TACKLING DUPLICATE DATA
As the amount of data collected in the world increases, so are the problems related to it!
One of the biggest problem being Data Duplication.
Here is an image to explain the concept of data duplication.
Three blocks are allocated space for the same information. A person’s name is stored three times, making the data redundant. Imagine the amount of space being wasted.
Once duplicate data enters the data management system, it creates serious issues.
Storing values multiple times wastes space.
When the value of a field changes, multiple observations need to be updated.
If we forget to change values in one of the multiple records, the database would have inconsistent data.
This kind of issues can be dealt with in multiple ways. Introducing a primary key or a foreign key is one such example.
A Primary Key is a field that identifies each record in a table uniquely It can also be a combination of fields. For example, the customer id in a retail store database.
Sometimes a field in one table is linked uniquely with a field in another table. This is the concept of
Foreign Key. In the above example, say we have a Sales table which has 'Customer id' and the
respective sales data. The Customer id here can have a foreign key reference to the Customer table in
order to avoid data redundancy.
In this blog, I would like to discuss another kind of data duplication challenge that these keys cannot address.
Sometimes the name of a customer is saved with different spellings at different instances. Few instances may have upper case, few lower case; few may have abbreviations, and few may just be misspelled.
More often than once, each and every enterprise faces this problem, yet there is no regularised
mechanism to solve this issue! Record duplication in suppliers, customers, and products tables could
lead to errors, increased costs & missed opportunities.
In such cases, even if there is a way out, the matching issues are different for each of the cases, so a single technique/ algorithm cannot hold good for all.
Here we will discuss Fuzzy String matching between two observations. To understand this, let me first introduce the term Levenshtein distance.
It is the distance (or difference) between two words, which is calculated as the minimum number of single-character edits necessary to change one word to the other.
Edit distance is the metric which calculates the number of operations (insertions, deletions or
insertion: cot → coat
deletion: coat → cot
substitution: coat → cost
The distance is normalized, by dividing the distance with the sum of length of the two words. Hence, the normalized distance is always between 0 and 1.0.
Understanding the Levenshtein distance with examples -
• It is at least the difference between the sizes of the two strings.
Eg - The Levenshtein distance between MISS and MISSING is 3
• It is not more than the length of the longer string.
Eg - The Levenshtein distance between Z and AFTERNOON is 9
• It is zero only if the strings are equal.
Eg - The Levenshtein distance between BURN and BURN is 0.
Real-time applications of the edit distance – Spell check, Spam filtering, Google search recommender systems, etc.
Fuzzy string matching is the technique used to find strings that match a pattern. The term Fuzzy means vague & general.
There are various methods to calculate the distance between strings - Sellers, Damerau-Levenshtein, Hamming, and more. However, in this example, we will go ahead with the Levenshtein distance.
Below is an example of a company’s database of vendors. Using Fuzzy String matching, we will understand the percentage of similarity between the observations.
Python Libraries: fuzzywuzzy, Levenstein, re, numpy, string, pandas
Requirements: Jupyter Notebook, or equivalent interface
1. First, we read the data frame.
2. All entries of the document are converted to Lower case and all abbreviations are expanded, duplicates are also removed.
3. The Data is cleaned and common suffixes such as private, limited, etc are removed.
4. A map of truncated and original column entries are made for future restoration of the data. i.e valuemap.
5. Clean data is compared using the fuzz.ratio function from package fuzzywuzzy, with a cutoff percentage of 79%.
6. Now their original values are extracted using valuemap, and another round of comparison is done using fuzzywuzzy.
7. The fuzzy matches along with their matching percentage are presented to the client for further discussions regarding the duplicate data issue.
The computation of the Levenshtein distance is roughly proportional to the product of the two string lengths. Hence it becomes computationally complex to calculate the distance between longer strings. Therefore we truncate the records to its bare minimum necessity and then perform the fuzzy string match yielding more effective results.
This is an example of using Fuzzy String Matching algorithm using Levenshtein distance to solve problems of data duplication with non-matching records. Using the fuzzy logic, data deduplication can be carried out in critical master data. This has numerous applications across industries such as BFSI, Pharma, CPG, Manufacturing and Retail. Most companies are looking to implement BI tools such as tableau, prescriptive and predictive ML & AI applications – reliable data achieved through such solutions as discussed in this article will enable the success of these efficiently
Data Scientist- NeenOpal Analytics @Pratiksha Sunder
Machine Learning based Forecasting in Supply Chain
March 28, 2018
Predictive Customer Lifetime Value
March 28, 2018