Data De-Duplication - A case study



One of the leading Health, Wellness, and Nutrition brands, based out of Australia, was generating leads from multiple sources - paid as well as organic. They had over 200,000 leads in their database including a lot of potential duplicate leads.

The Problem

On account of these duplicate leads, a significant amount of resources were being wasted on pursuing the same leads through different channels.

This also included:

  • Increased marketing costs due to marketing expenditure on the same individual.
  • Sending the same catalog multiple times to the same individual.
  • Negative feedback from customers and degradation of the brand’s reputation as a consequence of contacting individuals repeatedly.
  • Loss of productivity as a lot of time was spent fixing the data manually.
  • Delay in personalization implementation due to lack of confidence in the data.

The Marketing Head was looking to improve the functioning and roll out personalized product options for different segments of consumers based on the data.

Leads Acquisition

Often, companies buy leads from different online platforms/sources. Based on users’ interests in certain products or services on these platforms, the user information, such as name, contact details, etc., is sold to other businesses as leads.

In addition to the leads captured through its own website, the client was also actively acquiring leads from other platforms to maximize their reach in the market.
Due to the presence of multiple sources for procuring leads, there was an inconsistency in the formats for important fields such as phone numbers, emails, etc. which rendered duplicate leads unidentified through traditional hashing methods. In many cases, customers enter different contact information on different platforms which also leads to inconsistent data.
Furthermore, a significant amount of in-house lead duplicates were also introduced due to the low responsiveness of the client’s website and a lack of immediate feedback for data submissions, leading to multiple submissions by users.


The primary aim of the analysis was to categorize the data into unique leads and the leads that had a high probability of being duplicates and suggest methods to prevent duplicates in the future.


On the data front, the data points were cleaned up by working closely with the client for each column. For example - cleaning phone numbers with regular expressions and removing special characters from addresses for performing comparisons to find duplicates.

A typical implementation of deduplication involves N×N comparisons but the number of comparisons can be significantly reduced by creating blocks on address parameters in this as we only compare the records in the same city/neighborhood reducing our runtime significantly.

A number of UI changes also were introduced based on our suggestions for the prevention of duplicate records to the maximum extent. This included separating the country code and the phone number along with replacing the textbox with numerical input for the phone number.


After implementing the above-mentioned method, we observed a drastic improvement in the quality of data. We were able to reduce the number of duplicates from about 15% to 1%. One of the lead sources had about 80% duplicates, so we suggested removing that source.

Owing to this improvement in data, the company was able to reduce its marketing expenditure by 30% by eliminating the repetitive calling of potential customers. Not to mention, this also improved the perception of the company significantly. A better personalized product range was implemented as confidence in the data increased.

Written by:

Harsh Dutta

Data Scientist


Kshitij Thakur

Data Scientist


Related Post

Leave a Reply