The path to the data product - How you profit sustainably from Data Science
The data science project is defined. The proof of concept has been successfully implemented. And now? Unfortunately, this is often the end of the...
Faulty data brings problems and costs. We explain what data quality is and how good your data should be
Münster, Muenster, MÜNSTER or MUENSTER, 0000-0000-00 as a customer contact number, 99/99/99 as a purchase date...the examples of faulty data are long and the problems and costs of poor data quality real: from not reaching a customer to addressing them incorrectly in a newsletter to incorrect invoicing, to name just a few. Decisions made based on bad data can't be good. According to a survey by Experian Marketing Services, 73% of German companies believe that inaccurate data prevents them from delivering an outstanding customer experience. Good data quality is thus crucial for a company's day-to-day operations and, above all, a key success factor for Data Science projects. But what does data quality actually mean, how good does the data have to be for a Data Science project and how can you check the quality of your data? We will address these questions in this article.
Page Index
Definition: Data quality describes how well the data is suited for intended applications. In this context, we therefore also speak of "fitness for use", i.e. the suitability of the data for the intended purpose. The quality of data is thus very context-dependent. While the quality of data may be sufficient for one particular use case, it may still be insufficient for another.
And why is it so important? In a Data Science project, everything is based on data as a resource. In the project, data from a wide variety of sources is brought together and then analyzed. Your data thus serves as input for any analysis model. So, true to the adage "garbage in, garbage out", even a sophisticated algorithm is of no use if the quality of the data is poor. Even though a data science project can fail for many reasons, the success of the project often depends on the quality of the available data.
More on the topic of Data Science projects ➞
Investments in measures that ensure the quality of the data are therefore crucial for the success of a project, but also more than worthwhile beyond that. After all, a lack of data quality can result in considerable costs for a company.
Fundamentally, however, poor data quality has far more far-reaching consequences than financial losses. They range from effects on employee confidence in decisions and customer satisfaction to productivity losses (e.g., due to additional time required for data preparation) and compliance problems.
The sources of poor data quality can be very diverse, as the following graphic illustrates. First and foremost, however, is the data entry process, whether by employees or customers.
The Sources of Poor Data Quality (Source: The Data Warehousing Institute, 2002, Data Quality and the Bottom Line)
In practice, there are a variety of criteria that can be used to evaluate the quality of data. The most common evaluation criteria include the following:
The criteria of correctness, completeness, uniformity, accuracy, and freedom from redundancy generally refer to the content and structure of the data and cover a variety of the sources of error most commonly associated with poor data quality. These mostly include data entry errors, such as typos, duplicate data entries, and missing or incorrect data values, among others.
The following graphic uses examples to provide an overview of the errors hidden behind the individual criteria, as well as possible causes and countermeasures.
Examples of data quality problems, possible causes, and countermeasures.
Of course, the more complete, consistent and error-free your data, the better. However, it is nearly impossible to ensure that all data meets the above criteria 100%. In fact, your data doesn't even have to be perfect, it has to meet the needs of the people or the purpose for which the data will be used.
How good does the quality of the data need to be for a Data Science project? Unfortunately, there is no universal answer to this question. As is often the case, there are a number of aspects that affect the required data quality. These include, among other things, the purpose for which the data is to be used, the use case, and the desired modeling procedure. The quality of the data also depends on the type of errors it contains and the extent to which these can be corrected during the data preparation phase of a data science project.
Problems in data quality can therefore be corrected to varying degrees in the aftermath. In order to be able to successfully prepare the data, the interaction of data scientists and the specialist departments is necessary so that it is clear which data is correct and which needs to be corrected. To ensure that everyone can understand what is in the data, a so-called data dictionary can help.
So even though some errors can be fixed, the better approach is always to prevent it from happening in the first place. Our checklist below is designed to help you give your data an initial quality check.
Data is now considered the fourth factor of production alongside land, capital and labor. Data is therefore to be regarded as a critical resource that must be managed accordingly, if you are not already doing so. Ensuring high data quality requires a comprehensive data quality management system. After all, data quality is by no means purely an IT issue, but a management task. The topic of data quality is a small but important wheel of an overall data strategy. Various measures are necessary, including initial, one-time measures as well as activities that must be carried out on an ongoing basis.
In conclusion, we would like to provide you with the following best practice measures:
After all, data quality problems not only impact the success of a data science project, but also have far-reaching consequences for the company as a whole. The good news for your data science project, however, is that you don't need the perfect data set. And, some errors, though by no means all (!), can be fixed by Data Scientists during the data preparation process.
The data science project is defined. The proof of concept has been successfully implemented. And now? Unfortunately, this is often the end of the...
Companies are increasingly turning to data science and data analytics solutions to leverage the sea of data for their own business.
Forecasts that are created on the basis of Excel have the major disadvantage that they are based purely on historical, internal data. Our pacemaker...
In unserem Blog dreht es sich um Themen rund um Data Science und KI.