Ways of Dealing with String Duplicates


What is Data Duplication?

Often identical data entries are stored in the same data storage system or in different systems due to human errors and limitations of the data entry systems, resulting in duplicate data being produced. For instance, such an error may happen if a customer’s information is re-entered with a slight modification or typographical error and the data entry system fails to detect or flag the discrepancy. One other cause of data duplication can be when we have multiple sources feeding the same information to the database. Data duplication is one of the most prevalent problems with data quality. Businesses may incur time, financial, and resource costs.

Some issues that occur due to duplicate data are:

  • It leads to inaccurate analysis and reporting. For instance, the count of customers or revenue by customers can be misleading when a lot of duplicates are present in the customer data.
  • Loss of productivity fixing records.
  • Duplicate data hinders personalization.
  • Wasted marketing budget.
  • Inaccurate information on sales calls.

Detecting Duplicates:

The duplication in the data can be identified in a number of ways. While the simplest method involves the comparison of all records with each other, it happens to be highly computationally intensive. So, using one of the built-in indexing strategies, such as blocking, we create a smart set of candidate links. The sets include a collection of records grouped based on specified column(s). In the next phase, the candidate links are compared based on the defined data attributes. Depending on the type of comparison required in the data, we have different similarity indices or distance metrics.

Some of the distance metrics we use for comparison are:

Levenshtein distance:It is basically a string metric that is used to measure the distance between two sequences. The smallest number of single-character edits (insertions, deletions, or substitutions) required to change from one word into the other is the Levenshtein distance between two words. It is one of the most widely used distance metrics.

Jaro-Winkler distance:The Jaro–Winkler distance is another similarity metric that is used to compare two strings. The value ranges from 0-1 where 0 means that the two strings have no similarity while 1 signifies the two strings are completely equal. It gives more importance to the initial characters of the string.

Handling duplicates:

After detecting the duplicates, the next step is to actually deal with the detected records as the end goal of de-duplication is getting rid of the extra information which can skew the data or generate incorrect numbers due to the extra records.

The simplest way to deal with the detected duplicate records involves dropping the extra records.
This method can usually be rather automated with little developmental effort and does not need domain knowledge.
Although this approach is quite easy to understand and implement, it could result in information loss or incorrectly detected duplicates being dropped.

Dropping and selecting – In this procedure, the desired rows are chosen based on specific criteria, and the remaining rows are dropped.

Merging the rows - Rows are combined, the majority of the data is kept, and the rows are enriched, making this method the most ideal. Due to the numerous possibilities when dealing with duplicate records, the programming effort is quite significant.

The aforementioned methods are better than just simple dropping of duplicates, but require extensive domain knowledge and data-specific knowledge to determine the actual useful data as differentiation of the correct and incorrect records can only be achieved by working closely with the stakeholders who manage and use the data.

Avoiding duplicates in the future

As hashing and matching won't directly work because near duplicates will still be identified as different records by SQL, there isn't a specific foolproof solution to avoid near duplicates.

The best way to handle it is to have tight business rules and procedures in place and check for near duplicates while entering new records - Incorporation of near duplicate algorithm within the data entry system. This can be achieved by comparing the near duplicates within small blocks(any differentiating parameters) and would be accurate within that particular block, but this comes with a large processing cost that must be taken into account and established if this approach is realistic for the organisation.

Written by:

Akash Gupta

Data Scientist


Written by:

Piyush Mathur

Data Scientist


Related Post

Leave a Reply