Data is becoming an increasingly valuable asset to companies as the tools and technology to exploit its value continue to develop. However while good quality data analysed properly can bring value, the costs incurred from poor quality data can be considerable.
A report by the Data Warehouse Institute in 2002 estimated business costs of $600 billion arising from poor data quality. Gartner Group in 2013 suggested data quality problems cost American companies on average $14.2 million a year while in a recent report IBM put the total cost of poor data quality to the US economy at a staggering $3.1 trillion a year. While these figures need to be interpreted with some caution (these organizations supply data quality management solutions so it is probably in their interests that such numbers are high) their scale indicates that data quality management has the potential to deliver huge value to organization that implement it correctly.
In order to manage data quality one first needs to define it. This is slightly more challenging than may at first appear. Generally data quality is thought to be synonymous with fitness for purpose. In other words data is said to be of good quality if it satisfies the requirements of its intended use. This means that data quality is something other than a wholly intrinsic property of a data set.
Different approaches to defining data quality have been taken; empirical, theoretical and ontological. Such efforts have for the most part resulted in lists of attributes or dimensions describing data quality (accuracy, completeness, consistency and so on) and categories into which these attributes are grouped (Representational, Security, Accessibility and so on). There are a number of problems with this approach …
1) There are a large number of possible attributes describing data quality and little consensus as to which are essential.
2) Attributes can map onto more than one category which makes measurement difficult.
3) The same attributes are sometimes defined differently in different places in the literature.
4) Measures of data quality are often treated as if they are measuring something intrinsic to the data when in fact they aren’t.
A group of researchers, sometimes referred to as the Italian School, have drawn attention to the problem of purpose dependence. In particular the work of Phylis Illari is relevant here. Consider the example of a marketer using a data set to determine who to post marketing material to and a financial analyst using the same data set to score credit risk. The marketer will likely need a higher level of accuracy in the address fields than the analyst if the data is to be fit for purpose. Illari points out however that it is not just that purpose determines how accurate is accurate enough, as in the example above, but that purpose permeates accuracy and all other attributes of data quality. This is because when trying to represent a real world system in a data store, the features we choose to represent and the data used to represent them are informed by our purpose.
Purpose dependency has practical implications. Business intelligence practitioners typically use data warehouses in their work. Data warehouses are large repositories of information collected from various sources inside and sometimes outside the organisation which have been processed and integrated into an optimal format for query and analysis. Data contained in data warehouses may be repurposed data or data used for a purpose other than that for which it was originally intended. Purpose dependency implies that systems designed to provide optimal data quality will likely need to have a high degree of domain specificity. Organisations trying to optimise data quality may then face difficulties if this data needs to be repurposed.
The above is just a (very) brief introduction to data quality and purpose dependence. If interested in reading more about data quality check out the work of Richard Wang and colleagues at MIT who were instrumental in setting up the International Conference on Information Quality (ICIQ) Here is a link to Illari and Floridis paper on IQ-Purpose and Dimensions from the 2012 conference.