Afrika Statistika

The exact probability law for the approximated similarity from the Minhashing method

Soumaila Dembele and Gane Samb Lo

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman $RU$ algorithm and a modified version of it denoted by $RUM$. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given.


Nous proposons un cadre probabilistique dans lequel nous étudions la loi de probabilité de l'algorithme de Rajaraman et Ullman $RU$ ainsi qu'une version modiée de cet algorithme notée $RUM$. Ces alogrithmes visent à estimer l'indice de la similarité entre des textes de grandes tailles dans le contexte du Web. Nous donnons une base de validité de cette méthode en montrant que pour des lois de probabilités minutieusement choisies, la similarité exacte est l'espérance mathématique de la similarité aléatoire donnée par l'algorithme RUM. Des généralisations sont abordées.

Article information

Afr. Stat., Volume 12, Number 1 (2017), 1199-1218.

Received: 1 March 2017
Revised: 3 April 2017
First available in Project Euclid: 22 April 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62E15: Exact distribution theory 62F12: Asymptotic properties of estimators 68R05: Combinatorics 68R15: Combinatorics on words 68Q97

Minshashing algorithms similarity estimation probability laws convergence of algorithms


Dembele, Soumaila; Lo, Gane Samb. The exact probability law for the approximated similarity from the Minhashing method. Afr. Stat. 12 (2017), no. 1, 1199--1218. doi:10.16929/as/2017.1199.100.

Export citation