enEnglish
CROS

Methods for identifying duplicates

Duplication detection methods

Below are a series of resources considering duplication detection in Big Data. There are not many resources presented here, as this is just a fews hours worth of research. This is intended to help others get started. I am sure there are many more resources and ideas that I am missing…

http://www.mmds.org/

This resource is more theoretical. Chapter 3 considers duplication detection in Big Data. Chapter 3 is available in slide and document form.

The approach suggested is a hashing approach, followed by Jaccard Similarity. Basically, it blocks similar items and then compares only the items within the same hashed block using Jaccard Similarity. This is to avoid having to compare all items to each other.

 

Python library

https://pypi.python.org/pypi/NearDuplicatesDetection/0.2.0

I have found a Python Module (Library) which covers duplicated detection. I have not tested this Module. The Module is a similar approach to that outlined above, that is of hashing and Jaccard Similarity detection.

 

More advanced tools:

Below is a brief exploration into the use of distributed technologies (Spark) in record matching. I don’t think we need Spark for the Sprint. However, it is useful to research: As it would be needed for large scale problems. Thus, we may need such tools later in the project.

 

Using Spark to merge Brazilian Health Databases 

http://ceur-ws.org/Vol-1330/paper-04.pdf

This is an example from Brazil: concerning the joining together of several large health databases. The databases do not contain shared unique identifiers. Hence, they need to use Similarity matching, at scale. 

 

Hashing, Blocking and Similarity matching using Spark

They use the Spark(scala) environment. They standardise the features (data wrangling). The features are hashed (bigrams to bit vectors [10110000110101]), blocked (by region) and similarity matched using Sorensen.

 

Similarity measure

Sorensen ( D(a,b) = 2h / ( |a| + |b| ) ): which compares the position of 1’s in both vectors a and b. The common positions are counted in h, then divided by the total number of 1’s in a + b. (where 1 is a complete match).

 

Machine Learning Methods

Machine learning can also be used for duplication detection. The method suggested below is the use of supervised machine learning (for record linkage).

 

Record matching with Spark ML (video)

https://speakerdeck.com/aseigneurin/record-linkage-a-real-use-case-with-spark-ml

 

Slides (for the video above)

https://speakerd.s3.amazonaws.com/presentations/b91b10b4150746a0848bccc8139ad940/Record_Linkage__a_real_use_case_with_Spark_ML.pdf