I read this book over the weekend to get a better understanding of the Lake House pattern, it’s produced by Databricks, the company behind Delta Lake so it’s a little biased towards their own product offering. It’s a good book that’s easy to read for non-technical people and only about 30 pages long so worth a look if you want to get a brief overview of this new emerging pattern.
Relational databases used to be good but are now overwhelmed due to the volume of data, this largely due to the rise of the internet. Companies created many different databases to try manage size but data became fragmented and siloed.
The data warehouse tried to solve this problem by bringing data together into a single consolidated database. This also became outdated as data volumes grew yet again and new types of data started to be stored such as images, videos, binary files, etc.
The data lake was the next iteration as it could store any type of data in huge volumes backed by cheap storage such as Amazon S3, Azure Blob and Google Cloud Storage. Hadoop and MapReduce were used to process data which were open source and free. A newer data analysis framework called Spark was faster and easier to use and became the “go to” tool for data scientists.
But the data lakes had a few problems:
- Hard to structure the data as tables with columns like in a standard database.
- No support for transactions, meaning changes made to data all pass or fail, this can leave data in an inconsistent state.
- Poor performance when querying such as joining data together in the so called “star schema”.
This led to the latest pattern called a lake house, which is a database management system that sits over a data lake. This part of the book pushed the Databricks products and there is no denying that they’ve done an amazing job by creating software that gives all the benefits of a database management system yet uses cheap storage. There are other systems out there that do a similar job, such as Snowflake and Azure Synapse but at a way higher cost.
Databricks Delta Lake allows:
- The creation of tables to enforce structure.
- Supports database transactions.
- Compacts and organizes data files
- Very quick queries as it has a super quick query engine.
- Allows the all important MERGE statement that is essential when building data pipelines (my own point not mentioned in book).
Creating a Lake House enables the storage of huge amounts of data with the same features as a data warehouse such as support for tables, transactions and high-speed queries. There are also added benefits of enabling AI and machine learning.
Functional IT is researching and using the new Lake House architecture to create a range of software products for the asset management industry starting with an investment compliance system. Our benchmark tests have shown that calculations run 25x times faster than similar systems based on standard relational databases at a much lower cost.