This article is aimed at high level business users that want to understand a modern data warehouse and how it can be integrated with data science libraries in the asset management industry.
If you search the web for “modern data warehouse” you will find a plethora of articles, mainly provided by software vendors, that define the solution as “a system that allows you to bring together all your data at scale and to get insights easily through dashboards and analytical tools”.
History of Data Warehousing
Let us take a quick look at the history of data warehousing:
- In the 1980s relational databases were used to store data and generate reports, over time they became hard to manage as data volumes increased and needed to be integrated from multiple sub-systems.
- In the 2000s the idea of a data warehouse was born where data was consolidated from various sub systems and arranged in dimensional models to quickly slice and dice the data, Ralph Kimball was the main man behind this idea.
- Recently, in the last 5 years or so, the amount and type of data has overwhelmed standard database systems. Enterprises need to store data at huge volume, at real time and even store unstructured data such as images or log files. To solve this problem the data lake was proposed which is essentially a file system in the cloud (I use the term file system loosely as it’s not like a file system on your PC but a service that stores binary data and provides ways to get at that data quickly and easily). Examples are Amazon S3, Azure Blob Storage or Google cloud storage. However, the data lake suffers from lack of structure, no support for ACID transactions, slow queries when doing joins across data sets and was often called a Data Swamp.
- Very recently, in the last year or so, some software vendors such as Databricks Delta Lake and Snowflake have created systems that allow the data lake to be managed a lot better, these systems create a meta-data layer on top of the file system so that developers can create familiar structures like table and partitions and use MERGE operations with ACID transaction support (they also have built in query analyzers that allow for quicker joins across data sets). We are now getting closer to a data warehouse on top of a highly scalable and low cost file system called a Lake House.
The general theme is that the volume and complexity of data has increased and overwhelmed standard database systems. Data has been moved into data lakes as they scale better, store any type of data, and have a lower cost of ownership. Data lake management systems such as Delta Lake are driving the new Lake House architecture.
Another trend that’s on the rise is data science and AI which is not well supported on relational database systems. There are connectors from data science libraries into relational databases, however data science systems generally use special file formats such as Parquet or Avro that store the data much more efficiently. It is no surprise that Databricks Delta Lake stores its data as Parquet files and has integrations into many data science libraries.
The Rise of Data Science and Open Source
Unlike the enterprise, the data science community has embraced the open-source movement and most of the libraries are open source and free. By combining Delta Lake (free!) with a data science framework like Apache Spark (also free!) we can create a modern data warehouse that’s very low cost and can support high performance calculations and AI “out of the box”.
Wash, Rinse, Repeat
There is a repeating theme in the asset management industry for managing data and performing calculations:
- CONSOLIDATE – aggregate data from multiple source systems such as trading or accounting.
- MANAGE – apply master data management by enriching the data with classifications and groupings.
- CALCULATE – run calculations on enriched data such as Performance, Risk, Compliance, and Benchmarks.
By creating a Lake House and using open source and free software the asset management industry could create high value calculation systems at a low cost. The CALCULATE phase is of special interest as data science is all about performing rapid calculations over large volumes of data. There are many powerful calculation frameworks such as Hadoop (a bit out of date now), Apache Spark, Pandas, Azure Data Lake Analytics, Amazon EMR, Google Big Query etc. Any one of these could bring large efficiencies when paired with a Lake House architecture.
Functional IT is working on an Investment Compliance Monitoring solution using Databricks Delta Lake and Apache Spark, benchmark tests have shown that we can run 10 000 rule calculations in 10-15 minutes whereas current systems based on relational databases take 6-8 hours. We plan to explore many more solutions using this architecture in the future. No doubt many other commercial vendors will also move to this new architecture due to the cost benefit ratio.
The asset management industry should embrace the new wave of big data tools to create low-cost high value systems. Moving to the cloud is inevitable but may take some time due to security and data privacy concerns.
A follow up article will explore dimensional modelling and master data management in the data lake, both key areas to provide a complete solution in a real-world scenario. I’ll also discuss the common patterns in asset management data and common pitfalls when building data pipelines (a holdings file is not a transaction file!).