The data science project is defined. The proof of concept has been successfully implemented. And now? Unfortunately, this is often the end of the...
Data Quality - What is it and how good does my data need to be?
Faulty data brings problems and costs. We explain what data quality is and how good your data should be
Münster, Muenster, MÜNSTER or MUENSTER, 0000-0000-00 as a customer contact number, 99/99/99 as a purchase date...the examples of faulty data are long and the problems and costs of poor data quality real: from not reaching a customer to addressing them incorrectly in a newsletter to incorrect invoicing, to name just a few. Decisions made based on bad data can't be good. According to a survey by Experian Marketing Services, 73% of German companies believe that inaccurate data prevents them from delivering an outstanding customer experience. Good data quality is thus crucial for a company's day-to-day operations and, above all, a key success factor for Data Science projects. But what does data quality actually mean, how good does the data have to be for a Data Science project and how can you check the quality of your data? We will address these questions in this article.
WHAT IS DATA QUALITY AND WHY IS DATA QUALITY SO IMPORTANT?
Definition: Data quality describes how well the data is suited for intended applications. In this context, we therefore also speak of "fitness for use", i.e. the suitability of the data for the intended purpose. The quality of data is thus very context-dependent. While the quality of data may be sufficient for one particular use case, it may still be insufficient for another.
And why is it so important? In a Data Science project, everything is based on data as a resource. In the project, data from a wide variety of sources is brought together and then analyzed. Your data thus serves as input for any analysis model. So, true to the adage "garbage in, garbage out", even a sophisticated algorithm is of no use if the quality of the data is poor. Even though a data science project can fail for many reasons, the success of the project often depends on the quality of the available data.
Investments in measures that ensure the quality of the data are therefore crucial for the success of a project, but also more than worthwhile beyond that. After all, a lack of data quality can result in considerable costs for a company.
POOR DATA QUALITY COSTS
- The average revenue lost by companies due to faulty data is up to $15 million (Gartner's Data Quality Market Study). In other words, the cost of poor data quality is 15% to 25% of revenue (study published in MIT Sloan Management Review).
- 50% of IT budgets are spent on data reprocessing (Zoominfo).
- Once a data series is ingested, it costs $1 to verify it, $10 to clean it, and $100 if it remains erroneous (Zoominfo).
Fundamentally, however, poor data quality has far more far-reaching consequences than financial losses. They range from effects on employee confidence in decisions and customer satisfaction to productivity losses (e.g., due to additional time required for data preparation) and compliance problems.
WHAT ARE THE SOURCES OF POOR DATA QUALITY?
The sources of poor data quality can be very diverse, as the following graphic illustrates. First and foremost, however, is the data entry process, whether by employees or customers.
The Sources of Poor Data Quality (Source: The Data Warehousing Institute, 2002, Data Quality and the Bottom Line)
HOW CAN YOU MEASURE DATA QUALITY?
In practice, there are a variety of criteria that can be used to evaluate the quality of data. The most common evaluation criteria include the following:
Does the data factually match reality?
Do the data from different systems match each other?
Does the data set contain all necessary attributes and values?
Is the data in the appropriate and same format?
- Freedom from redundancy
Are there no duplicates within the data sets?
Is the data sufficiently accurate?
Does the data reflect the current state?
Can each data set be interpreted unambiguously?
Is the origin of the data traceable?
Does the data meet the respective information needs?
Is the data accessible to authorized users?
The criteria of correctness, completeness, uniformity, accuracy, and freedom from redundancy generally refer to the content and structure of the data and cover a variety of the sources of error most commonly associated with poor data quality. These mostly include data entry errors, such as typos, duplicate data entries, and missing or incorrect data values, among others.
The following graphic uses examples to provide an overview of the errors hidden behind the individual criteria, as well as possible causes and countermeasures.
Examples of data quality problems, possible causes, and countermeasures.
WHAT IS SUFFICIENTLY GOOD DATA QUALITY?
Of course, the more complete, consistent and error-free your data, the better. However, it is nearly impossible to ensure that all data meets the above criteria 100%. In fact, your data doesn't even have to be perfect, it has to meet the needs of the people or the purpose for which the data will be used.
How good does the quality of the data need to be for a Data Science project? Unfortunately, there is no universal answer to this question. As is often the case, there are a number of aspects that affect the required data quality. These include, among other things, the purpose for which the data is to be used, the use case, and the desired modeling procedure. The quality of the data also depends on the type of errors it contains and the extent to which these can be corrected during the data preparation phase of a data science project.
What data quality errors can be corrected?
- Errors that can be corrected with relatively little effort are, for example, duplicate data entries.
- Errors that can be corrected with increased effort are, for example, mixing or deviation of formats.
- Errors that cannot be corrected, on the other hand, are, for example, invalid data, missing entries, or errors caused by swapping input fields.
Problems in data quality can therefore be corrected to varying degrees in the aftermath. In order to be able to successfully prepare the data, the interaction of data scientists and the specialist departments is necessary so that it is clear which data is correct and which needs to be corrected. To ensure that everyone can understand what is in the data, a so-called data dictionary can help.
So even though some errors can be fixed, the better approach is always to prevent it from happening in the first place. Our checklist below is designed to help you give your data an initial quality check.
Our checklist for your data quality
Data is now considered the fourth factor of production alongside land, capital and labor. Data is therefore to be regarded as a critical resource that must be managed accordingly, if you are not already doing so. Ensuring high data quality requires a comprehensive data quality management system. After all, data quality is by no means purely an IT issue, but a management task. The topic of data quality is a small but important wheel of an overall data strategy. Various measures are necessary, including initial, one-time measures as well as activities that must be carried out on an ongoing basis.
In conclusion, we would like to provide you with the following best practice measures:
- Make the quality of your data a priority.
- Automate the ingestion of your data.
- Maintain your master and metadata.
- Prevent errors and don't just treat them.
After all, data quality problems not only impact the success of a data science project, but also have far-reaching consequences for the company as a whole. The good news for your data science project, however, is that you don't need the perfect data set. And, some errors, though by no means all (!), can be fixed by Data Scientists during the data preparation process.