The blog post is for business users and technical implementers that want to understand the real world scenario of managing master data in a data lake.
In my post on the modern data warehouse I reviewed the history of data warehousing and a new trend known as the Lake House. A Lake House is essentially a database management system that overlies a distributed file store that allows for scale, low cost and high performance. This sounds like a great solution to store different types of data that can be used to feed into higher value calculation systems but there is a big piece missing – that piece is master data management.
Data required for performance attribution, compliance limit checking, custom benchmarks, fee calculations and client reporting all require classification and grouping of data. Managing these classifications and groups is known as master data management or MDM. MDM does have a few other components like matching records from different sources but I won’t go into that in this blog as I want to focus on the main problem of classifying data.
Lets look at a concrete example of a regulation 28 compliance rule “A money market instrument issued by a foreign bank may not exceed 5% of the portfolio per issuer”, this rule cannot be calculated unless instruments are classified into “money market”, “foreign” plus have an “issuer”. Another good example is creating a dimensional model within the Lake House for BI tools and dashboards, there is not much benefit in exposing holdings data to end users without classifications, such as the ICB classifications, as this is how they would want to slice and dice the data.
MDM in the Lake House
Its pretty clear that MDM is a key part of adding value to data but how does this get done in the Lake House? Having grappled with this problem for a few years building data warehouse management systems my advice would be to avoid doing it in the Lake House. Doing MDM requires a relational database model that supports joins and constraints across tables with primary and foreign keys, this type of data model is not well supported in a data lake.
It’s better to export the data into a specialized MDM system and once management has been completed, load it back into the Lake House. Another option would be to use the Lake House as a store of raw data and the export that into various calculation sub-systems but then we are back into the cycle of classifying data many times within each sub system which is no easy task.
Functional IT is creating a compliance rule checking system built on top of a Lake House and we have taken the approach of pushing data into a specialized MDM system, a typical process flow would be to:
- Extract data from a source system into the staging area of the Lake House.
- Export the data from the staging area into an external classification system backed by a relational database.
- Complete the MDM i.e. setup classifications and groups required for compliance rules.
- Import the classification data back into the Lake House where it can be used to run high performance calculations.
The emerging pattern of the Lake House will have many benefits for the asset management industry but careful thought is required as to how the data will be used in real world scenarios i.e. calculations, BI and reporting.