This blog is for technical people wanting to get a better idea of the capabilities of Apache Spark for running high performance calculations over asset manager data.
Compliance limit checking involves running large number of rules over portfolio holdings data. Regulatory rules or client mandates can require hundreds of rules over many portfolios resulting an explosion in the number of calculations required. Using relational databases and standard coding techniques performs very poorly with compliance rule runs taking several hours. This can be a problem as runs can exceed the overnight window and it’s also not possible to re-run rules after corrections are made as it takes too long to get results.
In my previous blog post I wrote how new data science frameworks, such as Apache Spark, are a good fit for this problem as they can handle large numbers of concurrent calculations. When paired with data lake storage they could provide a low-cost high-performance solution to this tricky problem.
Setting up the environment
To investigate this problem, we setup an environment consisting of:
- About 1000 portfolio records, about 2000 instrument records and about 4.5 million holdings records stored in a Databricks delta lake running on Microsoft Azure blob storage.
- We loaded the ICB classifications from the JSE and manufactured fake Issuer classifications.
- Our rule set consisted of 100 rules that performed the actions that simulate real compliance limit checking:
- Fetch holdings data for a portfolio and date.
- Filter holdings data based on a random classification query such as “Industry=Financials”.
- Group holdings data around the issuer.
- Aggregate values for each group.
- Compare aggregate values to total holdings for limits.
- Save breach results back into the delta lake.
- Capture a snapshot of data for audit purposes.
- Each rule ran for 100 portfolios resulting in 10 000 rule executions.
Spark is a distributed computing system that requires clusters of virtual machines, we selected the following settings for the spark cluster:
- One driver node with 16 cores and 64Gb RAM.
- Between 5-10 worker nodes with 8 cores and 32Gb RAM with auto-scale set to on so that the VMs could be created on demand.
- Nodes are Linux virtual machines that run at low cost (more on that later).
- There are various limits on the core count and VMs available in each Azure region so we picked a “middle of the road” scenario that would be available in any region with low cost.
The Results
All 10 000 rules ran in just under 10 minutes at about 8.5 – 9.5 minutes, we repeated the run multiple times to make sure we could reproduce the results. The results were very impressive, and we think we can make an improvement of around 20% by optimizing queries, joins, partitions, merge operations and so on. As it stands it’s a huge improvement over standard techniques that take several hours to run the same number of rules.
How Much Does It Cost
There is no point creating a fast calculation system if it’s too expensive and part of being a cloud developer is making sure that you use services that don’t incur high cost (unless you really need them, and the business case is suitable).
In this case we use Azure blob storage for the data lake and in terms of overall cost its basically free at about 50c/gb/month. There are some costs for reading and writing data, but they are also low for this use case.
The real cost is compute time for virtual machines which run at about R10/hour for each worker and R16/hour for the driver. Factoring in idle time at start up / shut down we used about 20 minutes of compute time so that would cost us R116/hr = R38 for our 20 minutes. Once again, an insignificant cost if we look at the value add and leaves us room to scale up the number of nodes or the VM specifications. Costs can be reduced by up to 70% if VMs are reserved but that would only make sense if we were constantly running calculations which is unlikely.
Conclusion
After a few months of research, we were able to create a high-performance rule engine using delta lake and Apache Spark. We believe the same principles could be applied to other types of calculations such as performance, risk, fees and benchmarks. This proves that open source and free data science frameworks provide a low-cost high-performance solution for asset management calculations.