Data Management and Quality
Many processes in data management affect the quality of data in a database. These processes may be classified into three categories, such as processes that add new data to the database and those that manipulate existing data in the database.
The remaining category includes processes such as data cleansing that reduce the accuracy of the data over time without actually changing the data. This loss of accuracy is typically the result of changes in the real world that isn’t captured by the database’s collection processes.
Arkady Maydanchik describes the issue of data cleansing in the first chapter of his book Data Quality Assessment.
Cleansing in Data Management
Data cleansing is becoming increasingly common as more organizations incorporate this process into their data management policies. Traditional data cleansing is a relatively safe process since it’s performed manually, meaning that a staff member must review the data before making any corrections.
However, modern data cleansing techniques generally involve automatically making corrections to the data based on a set of rules. This rules-driven approach makes corrections more quickly than a manual process, but it also increases the risk of introducing inaccurate data since an automated process affects more records.
Computer programs often implement these rules, which represent an additional source of data inaccuracy since these programs may have their own bugs that can affect data cleansing.
Problems with Data Cleansing
Part of the risk of automatic data cleansing is due to the complexity of the rules in a typical data management environment, which frequently fail to reflect the organization’s actual data requirements. The data may still be incorrect after executing the data-cleansing process, even when it complies with the theoretical data model. The complexity and unrelated nature of many problems with data quality may result in the creation of additional problems in related data elements after performing data cleansing.
For example, employee data includes attributes that are closely related such as employment history, pay history and position history. Correcting one of these attributes is likely to make it inconsistent with the other employment data attributes.
Another factor that contributes to the problems with modern data cleansing is the complacency that data management personnel often exhibit after implementing this process. The combination of these factors often means that data cleansing creates more problems than it solves.
Case Study
The following case study from Maydanchik’s book illustrates the risk of data cleansing, which involved a large corporation with over 15,000 employees and a history of acquiring other businesses. This client needed to cleanse the employment history in its human resources system, primarily due to the large number of incorrect or missing hire dates for its employees.
These inaccuracies were a significant problem because the hire date was used to calculate retirement benefits for the client’s employees. Several sources of legacy data were available, allowing for the creation of several algorithms to cleanse the employment data.
However, many of these employees were hired by an acquired business rather than directly hired by the client corporation. The calculation of the retirement benefits was supposed to be based on the date that the client acquired the employee instead of the employee’s original hire date, but the original data specifications didn’t reflect this business requirement.
This discrepancy caused the data-cleansing process to apply many changes incorrectly. Fortunately, this process also produced a complete audit trail of the changes, which allowed the data analyst to correct these inconsistencies without too much difficulty.
This data-cleansing project was completed satisfactorily in a relatively short period of time, but many such projects create errors that remain in the database for years.
For more on solving data management issues, check out this post on managing data entry.