What is data quality?

Very often, when we talk or write about data quality in our industry, the discussion seems to be superficial and lacking depth. There is a lot of room for misunderstandings making the discussion as a whole obsolete. This post should help to classify arguments in the quality discussion and bring more depth into it.

The five dimensions of quality

Let’s start with a general framework of quality to locate arguments in their semantic field. According to David Garmin (1984) there are five principal approaches to quality.

Transcendent Approach

According to the transcendent approach quality is defined as an innate excellence that is absolute and universal. “High quality data needs to be perfect and flawless”. A general problem is that it’s actually pretty hard to tell what “perfect data” looks like and how to achieve it. However, this approach is fairly common in research. Validity as a “transcendent goal” for example very often leads to the trouble of finding a good trade-off between internal and external validity.

Product-based Approach

The product-based approach views quality as a result of the right ingredients and attributes of the product, – in our case data. “High quality data has carefully selected respondents in the sample, who write a lot of word into open text fields.” Here, data quality is quite tangible and can be measured precisely. However, this understanding of quality is very formalistic and, therefore, too superficial.

User-based Approach

The user-based approach starts from the premise that different users may have different wants and requirements. Here, the highest data quality is what best satisfies these needs. Hence, data quality is highly individual and subjective: high quality for one user can be average or poor data quality for another.

Manufacturing-based Approach

The manufacturing-based definition focuses on the process of producing data, – or in research terminology: they focus on methodology. “Good data is collected in adherence to the scientific standards and the best-practices of our industry”. While this approach makes data highly comparable, it sometimes doesn’t fit to the researcher’s task at hand.

Value-based Approach

Last but not least, there is a value-based approach, that sees quality as a positive return on investment (or more specific: Return On Insight). Here, data has a high quality if the costs of collecting it are minimal while the benefit from using it is maximum. At first sight, this approach seems legit, but it also has its downsides. This approach doesn’t tell much about the data properties itself, but more about the information needs of the user.

The five dimensions of data quality : Innate excellence, methodology & process, return on insight. data properties, and user requirements

Competing Views on Quality

All these approaches very often lead to competing views on quality. Data collectors, for example, may pay attention to methodology and data formats, while research buyers rather focus on their individual needs and the Return on Insight. And even within companies, there can be different perspectives. Members of the sales or marketing department may see the customers’ perspectives as paramount, while project managers see quality as well-defined specifications and processes. Being aware of these different views can help to improve the communication about quality, and consequently improve the quality itself.

But even if you have everyone on the same page, you may have difficulties to find the right approach. Let’s take observation data as an example. This method can be the best choice to answer your research questions but you may also run into the problem of complex data formats, missing values or outliers. This again can have an impact on the return on insight and demand a different approach.

To keep it short, it’s not easy to tell, what data quality actually is. Everyone is claiming to have it, but a closer look reveals that the corresponding arguments very often fall apart. Probably, it would be naïve to merely call for a more holistic perspective, as the different approaches are in an innate tension. It doesn’t mean, data quality is just an illusion or arbitrary, but it reminds us that data quality requires some effort and doesn’t fall into place by itself. In any case, data quality starts with good communication of what is expected.

People discussing

In the previous section, we explored a theoretical framework for categorizing arguments related to data quality, providing a foundational understanding of the various perspectives in this discussion. With this broader perspective, we will now delve into the practical aspects of data quality, focusing on what is most relevant and how we can achieve it

The Empirical Approach

Richard Wang and Diane Strong conducted a very interesting piece of research in the 1990’s. In the first step, they asked data consumers to list all attributes that come to their mind when thinking about data quality. In the second step, these attributes were ranked by importance. A factor analysis consolidated the initial 179 attributes to a smaller set of data quality dimensions in four major categories.

Intrinsic Data Quality

Intrinsic Data Quality includes “Accuracy” and “Objectivity”, meaning the data needs to be correct and without partiality. While these two dimensions seem to be pretty self-explanatory, “Believability” and “Reputation” are not so obvious. It’s quite interesting that they are not about the data itself but they refer to the source of data, either the respondents or the fieldwork provider: respondents need to be real and authentic, while the fieldwork provider should be trustworthy and serious.

Contextual Data Quality

Contextual Data Quality means, that some aspects of data quality can only be assessed in the light of the corresponding task at hand. As this context can vary a lot, attaining a high contextual data quality is not always easy. Most of the contextual dimensions (Value-added, Relevancy, Timeliness, Completeness, Appropriate amount of data) require thorough planning before setting up and conducting the research. Conversely, it is really hard to improve contextual data quality once it has been collected (e.g. reminders to improve completeness).

Representational Data Quality

Representational data quality refers to the way, data is formatted (concise and consistent) and the degree to which you can derive meaning from it (interpretability and ease of understanding). Simply imagine the data validation routines for an online survey. When asking for the respondents’ age for example, you would make sure everyone (consistently) enters the age in whole years (concisely) or even within the age groups you’re particularly interested in (ease of understanding). In any case, the respondent will be hindered from submitting erroneous or extreme values (interpretability).

Accessibility Data Quality

The two dimension within this category can be opposed, and, therefore, require a good balance. Accessibility is about how easy and effortless data can be retrieved, while Access Security is about how the access can be limited and controlled. These aspects have received an increasing attention during the last years – e.g. online dashboards or data warehouses.

Towards excellent data quality

As you can see, “Intrinsic Data Quality” mainly depends on selecting the right data source, “Contextual Data Quality” on planning the study thoroughly, “Representational Data Quality” on collecting the data in the right way and “Accessibility Data Quality” on reporting the data correctly. Or, more general, at each stage of the research process we have to deal with different tasks and challenges in order to achieve the best possible outcome.

In the first section, we discussed how different perspectives on data quality can sometimes compete. While it’s still valid that the requirements of all stakeholders need to be addressed in the first place, it’s possibly even more important that every tie in the value chain is contributing to the overall quality when collecting and processing the data. As research has become a complex process with divided responsibilities, we have to make sure that quality standards are met throughout the whole process.

Related pages

Read more about data quality with data from Norstat

Data quality with the Norstat panel

Finding the right participants for your unique study is crucial to achieve relevant and useful data. Based on our extensive network of respondents in Europe, we make sure you get access to the consumers you are looking for.

See our panels

Panel quality

In the quest for reliable data that underpins crucial decision-making, the importance of a high-quality panel cannot be overstated. But what exactly defines panel quality, and how do we maintain it?

Read more

Enhancing data quality with algorithms

Insight should lead our actions by giving them a structure. And insights are following the structure of the underlying data. By definition, structures are stable and withstand perturbations. Data has a high longevity, and therefore its quality should be regarded as an asset that keeps paying off in the future.

Read more