Afrika Statistika

The exact probability law for the approximated similarity from the Minhashing method

Soumaila Dembele and Gane Samb Lo

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman $RU$ algorithm and a modified version of it denoted by $RUM$. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given.

Abstract

Nous proposons un cadre probabilistique dans lequel nous étudions la loi de probabilité de l'algorithme de Rajaraman et Ullman $RU$ ainsi qu'une version modiée de cet algorithme notée $RUM$. Ces alogrithmes visent à estimer l'indice de la similarité entre des textes de grandes tailles dans le contexte du Web. Nous donnons une base de validité de cette méthode en montrant que pour des lois de probabilités minutieusement choisies, la similarité exacte est l'espérance mathématique de la similarité aléatoire donnée par l'algorithme RUM. Des généralisations sont abordées.

Article information

Source
Afr. Stat., Volume 12, Number 1 (2017), 1199-1218.

Dates
Received: 1 March 2017
Revised: 3 April 2017
First available in Project Euclid: 22 April 2017

Permanent link to this document
https://projecteuclid.org/euclid.as/1492826422

Digital Object Identifier
doi:10.16929/as/2017.1199.100

Mathematical Reviews number (MathSciNet)
MR3638979

Zentralblatt MATH identifier
1362.62033

Subjects
Primary: 62E15: Exact distribution theory 62F12: Asymptotic properties of estimators 68R05: Combinatorics 68R15: Combinatorics on words 68Q97

Keywords
Minshashing algorithms similarity estimation probability laws convergence of algorithms

Citation

Dembele, Soumaila; Lo, Gane Samb. The exact probability law for the approximated similarity from the Minhashing method. Afr. Stat. 12 (2017), no. 1, 1199--1218. doi:10.16929/as/2017.1199.100. https://projecteuclid.org/euclid.as/1492826422


Export citation