Data Vault: Is it the best approach?
Data Vault by definition is “A System of Business Intelligence containing the necessary components needed to accomplish enterprise vision in Data Warehousing and Information Delivery”. But what does this really mean? Data Vault is a blend of Methodology that is consistent and repeatable; Architecture that is multi-tier; and Modeling that is flexible and scalable for Big Data. The golden rule of Data Vault is that 100% of the data is loaded into the Enterprise Data Warehouse 100% of the time and must be 100% loaded to be considered complete. Data is loaded as soon as it is available. Smaller data sets that load more frequently are preferred over large batch processing in Data Vault.
The Methodology used in Data Vault is designed to be agile by nature. It begins with properly scoping deliverables into 2 to 3 week ‘sprints’. The short lifecycle turnaround within each of these sprints has many benefits in that it gives an attainable and measureable goal while staying within scope, managing risks, adhering to governance, and combining best practices for BI. These repeatable processes will improve with each sprint while incrementally building consumable solutions as a result.
Within Architecture and Modeling, units of work are multi-tiered for the process to handle both volume and velocity. It is scalable for the variety of data and results in an auditable structure. These are consistent and repeatable processes by nature. Hard and soft rules are also applied. The hard rules are any that do not change content of individual fields or grains. The soft rules change or interpret data and change the grain of the data. The hard rules are applied on the way in (rules in) to the data warehouse and the soft rules are applied on the way out (rules out). These soft rules provide a managed Self-Service BI for the business with minimum IT involvement.
Data Vault is based on a series of business keys that are uniquely identified as a business entity (following closely to the business model) who distributes those keys to a series of hubs, links, and satellites. Hubs contain a list of unique business keys with a low propensity to change. Links represent the relationship between the business keys and the satellites which contain the descriptive attributes of the data (for both business keys and their relationships) including the attributes’ changes over time. Since business keys are being used, they are carried throughout the load as the primary key structure to associate the three entities. There is no ‘look-up’ required, thus increasing the efficiency and parallelism at load time.
Implementation of a Data Vault allows code generation and enhances automation. With proper modeling, it provides consistency and scalability as well as fault-tolerance. Increased parallelism and reduced volumes during the load allows hubs, links, and satellites to load simultaneously which decreases overall execution time. No data is ever deleted from the Data Vault under the philosophy that all data is relevant.
The Data Vault Methodology includes multiple components of CMMI Level 5, best practices of Six Sigma, and total quality management life-cycle. Data Vault projects have a short, scope controlled release cycles that are repeatable, consistent, and measurable making it a valuable tool to handle Big Data.
Is Data Vault the best approach? It offers a unique solution to business and technical problems alike. It is focused squarely at the data integration efforts across the enterprise and is built from solid foundational concepts. It has many benefits which are produced as a by-product of the basic engineering. Sticking to the foundational rules and standards will help get any integration project off the ground quickly and easily. This proven approach can comprehensively address the data management needs in any data warehouse. Contact us today if you would like to hear more.
« Back to blog