Reliable data is the basis for successful advanced analytics projects. Only proper data lead to correct data mining models. This is why data-driven companies must think about maintaining and enhancing this production factor.
Data-driven companies need reliable data
Not only data scientists depend on reliable and correct data when it comes to the development of value-generating data mining solutions. Users from all business areas, such as marketing, sales, or even production and human resources recognize the relevance of data for their business and processes. This leads to a change in the handling of data and analytics.
Future analytic landscapes exhibit an increasing heterogeneity: on the one hand through the collection, processing, analysis, and visualization of differently structured data from various internal and external sources, but on the other hand also through the additional analytical solutions which run on or beside the data warehouse or directly in the operational system.
It is clear that the significance of reliable user data and consistent master data is growing with this development as the necessity for the illustration of a unified and reliable basis for analytical demands for the optimization of existing processes or innovations not only continues to exist, but is becoming increasingly more important. Institutionalized data quality management and master data management are significant requirements for every data-driven company.
The principles for data quality management
The CXP Group regularly performs market surveys and supports companies in their digitalization projects. The following principles for data quality management combine the essential findings gained with the status quo of data quality management in companies. With the aid of these principles, companies can increase their awareness of the relevance of reliable data and start initiatives for the optimization of their data quality:
- For various reasons, data always contains Errors
Errors in data arise through human input errors (approx. five percent of all manual inputs contain errors), processing errors (calculation errors, transfer errors, data format errors), intentional incorrect inputs/fraud (e.g. employee bonus systems with premiums for new customers lead to more duplicates). In addition, data depicts the reality and reality changes (e.g. in the moment of a move, the address information in the databank is wrong). The technical validity can also be defective through the type of technical modeling, storage, use, or display of the data.
- Quality of data/data quality is defined by the usage context
The required quality of data arises not out of absolute valid quality standards, but rather out of the usage context (e.g. customer data that is used for the calculation of customer credit standing on the one hand, and for customer segmentation in target customer marketing on the other). In the former case, stricter rules for data quality must apply as the customer can be directly (negatively) affected. In the latter case, the impact of poor data quality is limited, as the consequences of customer segmentation data mining models are likely to be less directly noticeable.
- Data needs to be public in order to identify quality flaws
Data errors in the ERP system are, for example, not always discovered there as not all data in the system are relevant for the process; they first show up in the display, e.g. in a report. Especially referential errors only become apparent when all data is processed or displayed together.
- Correction of errors should always take place as close as possible to the place where the data originated
Errors should rather be discovered during input than later after processing (e.g. by taking precautions so that only valid values can be recorded, such as duplicate checking, checking the company rules, predefined selection values, etc.).
- Data quality management takes place primarily in the organizational, process, and technological dimensions
Data quality management takes place in organization, e.g. via responsibilities, organizational units, requirements management; in processes, e.g. via guidelines, user profiles, application cases; in technology, e.g. in the architecture, software, through concepts and principles of use.
- Organization and processes are more important than Technology
Organizational aspects have a bigger influence on data quality than technical aspects. Processes are at least as important as tools.
- Responsibility for data must lie and be clarified within the business Department
Data quality management is not an IT task. Tried and tested are the management role of the “data owner” who, among other things, defines and controls the quality criteria due to statutory requirements, if applicable (retention periods, data protection), as well as the operational role of the “data steward” for the ongoing monitoring and correction of data.
- Data quality management must include all data
Data quality management must include transactional business data, master data, machine data, and data generated by people.
- Data quality automatically worsens and must therefore constantly be monitored and improved
Data “ages” and must constantly be maintained. Data quality metrics help to monitor the level of quality.
Conclusion and recommendations
Not only data-driven companies painfully recognize that the production factor “data” is exactly like the other production factors “labor”, “land”, and “capital” not available in the correct quantity and quality per se, but must rather be established, maintained, and developed: After determining the status of their data quality, companies can take measures, such as adjusting their organization, optimizing their processes, and improving their technical support, which will efficiently increase and secure their data quality.