Improving Data Quality: The 3 Core Functions of a Data Cleanse

In previous posts, we established that data needs to be clean in order for organizations to make sound decisions, and we introduced a 5-part framework for sustaining data quality.

Once an organization has determined the critical pieces of data, and established a governance process around it, now it is time to improve the quality of data.

Data cleansing is an expensive and time-consuming process.

TWEETTweet: “Data cleansing is an expensive & time-consuming process” #dataquality

Before attempting any clean up, analyze and quantify the quality issues that you are trying to address. For example, are there any fields that are null but should have a value? Are data types enforced (only numbers in an “integer” type field, etc.)? Are values within an expected range? Is there conflicting information?

Then, make sure you have resources dedicated to these 3 core functions of a comprehensive data quality improvement project:

1. Data Transformation

Multiple source systems may have various standards for data entry, resulting in the same pieces of data captured in different ways. Restricting input values to a set list, like in a dropdown menu, is one way to ensure data uniformity. In cases where free text is entered, there may need to be an operation that finds all permutations of a specific term and maps them to a single value.

Consider credit card expiration date, for example. When shopping online, a consumer may encounter any of the following input formats:

  • Separate fields for month (spelled out or abbreviated) and year
  • Separate fields for month (2 digits) and year (2 or 4 digits)
  • One field for both (2-digit month, slash, 2 or 4-digit year)

In this example, there are at least 6 possible ways to capture a single piece of data: December, 2020; Dec, 2020; 12, 20; 12, 2020; 12/20; 12/2020.

When aggregating records from multiple systems that capture data differently, the Data Transformation component will convert the data into one specific format, and store it consistently.

2. Data Normalization and Deduplication

Data normalization reduces data redundancy and simplifies database design.

As we’ve discussed, most organizations use various systems to capture different pieces of dat. When there is overlap—when the same information is captured in different source systems—you will need an approach to normalize the aggregated record.

For example, if you are merging client information from Salesforce and financial data from NetSuite, normalizing the redundancies would mean storing only demographic information from Salesforce, and only financial information from NetSuite, once the correct match is made.

In another example, when merging 2 medical records from separate systems, it’s only necessary to keep 1 set of demographics. However, the data from both health records should be merged, as long as they represent separate encounters.

3. Manual Cleanup

In certain situations, it makes sense to manually edit the data. Some edits may be semi-automated, like “find and replace” specific values.

Although a tedious task, this exercise provides insight into most common data entry issues, and can help shape the data governance policy.

Becoming a data-driven organization is logical, and most businesses are actively striving toward this goal. However, only a data quality management approach that includes sustainable maintenance and a comprehensive clean-up effort will allow businesses to make informed decisions and maximize value from strategic investments.

The solution does not necessarily depend only on purchasing a fancy analytics system, or any other expensive software or hardware; it primarily depends on understanding and managing data on continuous basis.

TWEETTweet: Improving #DataQuality: The 3 Core Functions of a Data Cleanse #Abraic

Related Posts


Please share with your network:

Leave a Comment