Business Vault Secrets
Hello! This time we want to look at the Business Vault. Business Vaults are an extremely powerful feature of the Data Vault method but they aren’t given much publicity. We’ve seen many confused teams who’ve lost time and had to unpick mistakes. It’s a pity but for a bit of guidance, those mistakes could have been avoided. Before we get started, who are we? We’re the Datavault UK – the UK’s leading Data Vault consultancy.
We are responsible for the UK Data Vault User Group where those who are interested in Data Vault can meet, to learn and share their experiences. And for data engineers, we’ve written and supported a package you can use with the free DBT tool, to build your DataVault solution. This diagram shows the various layers mentioned in the Data Vault method. A commonly mentioned component is the Business Vault. This is the place where we store the results of business rule calculations and other types of derived data as well.
Dan defines the Business Vault as follows. I think the key points are: that it contains wholly derivable data and that it’s calculated after the Raw Data Vault and before loading the InformationMarts. So let’s start by busting the first misunderstanding. The Business Vault isn’t a separate layer. It’s not a separate schema. It’s held inside the Raw Vault. Yes – the Business Vault has new tables but they’re overlaid on the Raw Vault structure. Marts then feed on both the Raw Data Vault and the Business Vault tables. What types of data can the Business Vault hold? We might need to pre-calculate data for efficiency reasons, so it can be consumed by a downstream dashboard.
Calculate Quality Measures
We might need to calculate helper tables that drive better performance. We might calculate new values such as ratios. You might be using the results of data science for our business. And finally, we might be checking data quality and want to calculate quality measures. Let’s look at the first item on the list. Some processing might be needed to get data out of the Vault. Your users might not want the full granularity of data held in the Raw Vault. They may want to be aggregated, filtered, or masked data in their reporting Marts.
The point here is the data isn’t transformed. It’s simply added up, grouped, masked, or otherwise processed before it’s consumed. There are three possible places to build consumption rules. You can run them in the Raw Vault, creating Business Vault tables. You can run them on the way out to the Mart or you can run them in the end-user BI tools cache. If processing is intensive and the results are used by more than one Mart – you’re best running the rule inside the Vault to populate Business Vault tables. If processed data is used in one Mart only. You can run those rules as you populate the Mart.
And finally, you might consider implementing local processing rules in the end-user BI tool if the data is only used in one display. Now things to consider when deciding where to code the presentation rules are -security – it’s better to keep data in one place in the Vault and only let out what’s needed. Performance and cost – the Vault is generally cheaper and faster. What your BI tool can and can’t do. It may have limits on data volumes or the number of feeds per day. What your users need. Do they need aggregate data? It’s usually enough for them.
And who needs access to what. And what overlaps there are between groups of end-users. Next on the list is helper tables. These are the Data Vault Pit and Bridge tables. These are powerful techniques that support virtual reporting Marts and dimension and fact tables. You could spend a while on these but we don’t have time. So we’ll introduce them here you can check Dan’s book for some more information. I’m sure we’ll cover them off in a future video. So what does a Pit or point in timetable do? Say we have a Satellite, customer details from our CRM system.
This table has an effective date. There are no start or end dates here. So to find which Satellite rows are valid for a given date. We have to find the record for each customer’s primary key as the greatest effect from the date before the reference date. This can be quite an expensive calculation. However, SQL window functions help to reduce that load. One Satellite table may be OK for that sort of processing. What if you’ve got two, three, or more Satellites off the same hub. Perhaps, one Satellite is fed throughout the day another one gets a weekly update, and others are fed daily. Which values are valid for the given date.
Raw Vault as a Business
There’s quite a calculation – I don’t want to repeat that each time that we query the data. So a Pit table holds a pointer to the record in each of the satellites that are valid for each comes customer for each day. So for example, and this is common in many businesses, our users might want to process end-of-day data only. The Pit table then is calculated after loading the Raw Vault as a Business Vaultprocess and we find the end of day records for each customer in each satellite and insert references to the Pit table.
When we want to find the relevant satellite records we do an equijoin with the Pit table, which is much faster than looking for the greatest end date or greatest effective date. The table, the Pit table, is quite long but thin, and we can trim that table to only hold a few months of reference data. Now we can look at Bridge tables. It’s common to have queries that navigate Hubs and Links to fetch data from Satellites across the data model.
Here we have an example, we’ve got three hubs joined by two Links but the real model could join many, many, more, and possibly radiate out across multiple chains as well. The SQL to query that is quite repetitive. But you can make mistakes if you’re writing that – if you’re not concentrating. And different grains can also cause the query to grow and so to have performance problems. Bridge tables pre-calculate the navigation for each reference date, just like the Pit tables do for Satellites.
Load a Mart
They’re great tools to deliver aggregates since they can also store the sum of values as extra attributes to the table. If you take Pit and Bridge tables together the secret is, and if you look closely at them, you’ll see that the Pit table is a star schema dimension and the Bridge is quite close to a fact table. Pit and bridges then pre-calculate the work involved to load a Mart – and you can build a Mart to view on these helper tables to give you extra agility. Business rules calculate new data.
So that rule could work with a table, for example, the ratio of two columns. Or it can work across tables – say, we want to select a customer name from two tables. We want it from this table here unless it’s missing, in which case we look it up in that table over there – an integration rule. Business rule results are stored in business Satellites. And these look like other Satellites, except the source column is recorded as a business rule.
Business rule Satellite
Business rule Satellites hang off the same hubs or links that the raw satellites use. So here’s a common use case for a business rule satellite is to select the best view of records across a set of source systems. If sources disagree, which one should we prefer for the date of birth or name or address, and so on. We could code that decision process, as a rule, run it and populate a clean business rule Satellite after each raw load to store those results. Another feed for the Business Vault is the results of data science.
So let’s say a data scientist has run some experiments and they found something interesting and useful for the business. So they built a model and deployed it into production. The model takes some data from a Data Lake and our Data Vault and it produces some new data. The model is just really another source system as far as the Data Vault is concerned. So we feed the results into a staging layer and load them back into the Vault as normal. Finally, you might want to measure data quality.
One way of doing this is to write views on the data to expose quality problems – you could, for example, expose orphaned Satellites, malformed Hub keys, or missing dates, or out of range data, and so on. So if the data is OK, then the views are empty. If not, we can see the offending records in the view. These views are almost a form of Satellite that we can attach to a Hub or Link record giving the rule its context. So how can we implement our business rules? We can calculate our business Satellites and populate them directly from a business rule, which is an option. Perhaps there’s a better way of doing this.
Let’s say we have our regular Data Vault load, as illustrated. Stage data through ETL into the Raw Vault. We then apply business rules to that raw data. If we treat our rules engine as another source system. We can feed that output back to the staging layer. And then load the data just as any source data. As we’ve already built robust loading utilities, we can use them to load that business for all data properly.
Well, that’s a quick overview of the Business Vault. In summary – don’t be confused – the Business Vault is just a part of the Vault, not a separate layer. Business rules can create new Satellites and sometimes they may create a Link or Hub too. There are different types of rule – summaries, derivations, helper tables, data science output, and quality checks.