Using a Data Quality Framework to Accelerate Measuring Data Quality
Let’s face it; doing data governance and data quality right is hard. If not, then why do many organizations struggle with starting and sustaining a data governance program? Perhaps the reason for the struggle is that organizations have a difficult time really understanding and quantifying data quality issues. They need a repeatable approach for managing the full life cycles of data that leads to consistency, accuracy, and scalability in a way that supports a high degree of automation. This is where a data quality framework can assist and provide real business benefits.
The term, “framework” is often overused and perhaps misunderstood. For the purposes of this discussion, I am defining framework as formal architectural artifact. This artifact provides guidance to verbally and graphically connect abstract ideas that clearly shows the relationship, context and meaning of those ideas so as to create tangible outcomes. It goes beyond just an academic exercise to provide project teams with a blueprint to guide articulation and development of process and solutions. It can also help to illuminate decisions regarding build vs. buy and to ensure the solution is comprehensive, that is to say there isn’t any functionality that has not been accounted for.
Implementing a data quality framework can bring speed, standardization, and visibility to understanding the level of goodness in an organization’s data
Seacoast Bank developed a high level data quality framework that models six essential elements: (1) Data Governance; (2) Data Curation; (3) Data Automation; (4) Data Instrumentation; (5) Publication; and (6) Remediation. Let’s explore each of these areas and to highlight areas of focus and the accompanying essentials.
Assuming the end goal in this discussion is to elevate the overall “goodness” or quality of data, it is paramount importance that ownership of business or data domains is not only identified but formally documented and agreed upon by the owners. The primary responsibility of the data owner is to ensure direct support of data stewards and to provide resources to define and approve all data quality rules being implemented. Without data owners, there will be no easy way to assign responsibility for correcting the data, so accountability is the key.
One of the biggest challenges to measuring and quantifying data quality is common to most organizations: data resides in dozens if not hundreds of data sources. Having a data warehouse at the disposal of the data quality team will help to consolidate and centrally manage data. The process of curation will inevitably highlight issues with the data and provide a process to standardize and format it that is optimized for both data quality analysis and reporting. At Seacoast Bank, we implemented a logical or virtual data warehouse, whereby we leveraged a hosted environment. The hosted environment was implemented as large collection of star schemas broken down by subject area.
By implementing virtualized views on top of the data warehouse, we were able to take advantage of the data curation that naturally took place on the hosted environment. The virtual views greatly simplified the data quality assessment process by eliminating complex SQL joins that typically would be needed as part of the data profiling activities.
A key aspect to maintaining timely and trusted data quality measures is to automate everything. While there are aspects of the end-to-end process that require hands-on analysis, nearly all other operations can and ultimately should be automated. One example of a manual touchpoint is working with business subject matter experts to define and craft the data quality rules. That activity requires conversation, adds insight to the data, and simply cannot be automated. For data analysis, we use tools to initially profile the data, but once we do, we take the very same rules and implement them as data quality jobs that run on a scheduled basis. They can run as frequently as the data changes, but generally once a day is sufficient for most sources. The process we use to rollup or aggregate results, which I will discuss in more detail with the topic of data instrumentation, also is a scheduled process. Lists of records not passing tests also can be designed into the data quality processing stream, and those lists can be incorporated as email attachments to distribution lists or handled by a workflow management system. Our future plans call for “tickets” to be automatically created via a Service Now API, when some data quality threshold is met. For example, this could be set of loan records not meeting one or more tests on data accuracy or completeness.
Data instrumentation is about putting hooks or rules in place to fully understand the level of quality in your data. The more disbursed the hooks are among all your data assets, the better overall assessment or measurement will be. We measure data quality along two aspects: by data or business domain and data quality dimension. For Seacoast Bank, since we are a financial services company, the data domains are deposits, loans, and customer data and so on. Our data quality dimensions are standardized to be widely recognized terms like accuracy, completeness, conformity, consistency, uniqueness and validity. All the data quality rules are tagged with two values: the data domain and the data quality dimension with which they are associated. For example, if we had a data quality rule that stated “account closed dates must be greater than account open dates,” then the data rule would be tagged with the “deposit” data domain and the “validity” data quality dimension. There is some subjectivity as to which data quality dimension is assigned to which rule, but the most important thing is that there is one and only one dimension assigned to each rule. Once all the data quality rules are defined as described, it is relatively easy to aggregate the results by data domain, data quality dimension or both. As part of our data quality framework, we implemented the concept of a data quality index. The data quality index is the computed average of all the data quality rules implemented. The rules are always constructed to be a percentage of records passing the data quality rule, so the data quality index (DQI) measures the level of “goodness” for our data. The scale is from 0 (very poor quality) to 100 percent (perfect quality). At the data quality dimension level, an index is also constructed, referred to as the key quality indicator (KQI). The KQI measures how good your data quality is for accuracy, completeness and so on.
Think of the DQI like a stock market index. Over time, the DQI will fluctuate, much like a stock market index. Analogous to how a stock market index is made up of a large number of independent measures of individual stocks, the DQI is an average index of a large number of independent data quality measures.
Once the DQI is calculated, stakeholders will want to know how the data is performing in terms of quality. The most comprehensive way to publicize this to have all measures published on your business analytics or reporting platform. At Seacoast Bank, we have created a data quality dashboard that shows the overall DQI, key quality indices for both data domains and dimensions and then drill-downs to all the rules that make up those indices. We also are able to see the trend line of the DQI, and trending of pass-and-fail rules. Leveraging visual aspects of our reporting tool, we use bar charts to show which dimensions contribute more to the overall index. Our index is shown as color code speedometer, so stakeholders can see at a glance if we are in the “green” or there are opportunities for improvement.
Having already explored defining data ownership, automating processing and developing specific data quality measures, we’ll now address the last component of our Data Quality Framework: ensuring there is a means to improve data quality. The goal is that by consistently managing the data life-cycle through updates and corrections, data will continue or move in alignment with expected values. Ideally, your data quality platform should provide a mechanism for allowing easy access to the collections of data that need attention such as categorized lists. As mentioned previously, it is best that the list be automatically generated by data quality testing and distributed to data stewards. In cases where the correct data values can be identified, even the remediation steps can be automated. Either way, it is important that a record be kept of who changed what and when for auditing purposes and for analytics.
Implementing a data quality framework can bring speed, standardization, and visibility to understanding the level of goodness in an organization’s data. At Seacoast Bank, with a plan in hand we were able to quickly implement a large number of rules over a two-month period that quantified data quality and allowed us to improve it. By adopting some or all of what we implemented, your organization will be on the right path to providing your stakeholders greater visibility to your data quality and how you are improving it.