What is Locality-sensitive hashing

Understanding Locality-Sensitive Hashing

Introduction

Locality-sensitive hashing (LSH) is an algorithm used in computer science for solving the nearest neighbor search problem in high-dimensional spaces. The problem is that exact nearest-neighbor algorithms (linear search or k-d trees) are infeasible for large datasets. In these types of scenarios, LSH comes to the rescue. It quickly finds approximate nearest neighbors, making it ideal for applications such as recommendation systems, image recognition, DNA analysis, and more.

The Idea behind LSH

The idea behind LSH is to group similar items into the same buckets. Buckets are sets of items that have been hashed into the same value. Hash functions generate these values, and they are designed to map similar inputs to the same hash values with a high probability and dissimilar inputs to different values. Hashing reduces the dimensionality of the problem by projecting the original data into a lower-dimensional space, but the space is not ordinary. Instead, it is a space of hash values. Hash functions define this space. In this space of hash values, similar items are close to each other, while dissimilar items are far apart.

Hashing Functions in LSH

There are two types of hashing functions used in LSH: random projection and permutation-based hashing.

Random projection: Random projection makes use of the Johnson-Lindenstrauss Lemma. This lemma states that points in high-dimensional space can be transformed to a lower-dimensional Euclidean space while preserving the pairwise distances between them. Random projection generates a series of random vectors, which are then used to project the high-dimensional vectors to lower-dimensional ones. The distance between these vectors is approximated by the dot product of the projected vectors.
Permutation-based hashing: Permutation-based hashing generates hashes by permuting elements of the feature vector. As an example, consider a document with terms “apple”, “banana” and “carrot”. If we generate two permutations: “carrot”, “apple”, “banana” and “banana”, “apple”, “carrot”, then we could hash the documents using the hash values of these two permutations of the document. The idea behind permutation-based hashing is that, even though two documents might not share any common terms, there is a chance that they share the same permutation values.

Locality-Sensitive Hashing Algorithm

The LSH algorithm works as follows:

Step 1: Apply a series of hash functions to each object in the dataset.
Step 2: Group similar objects into the same bucket. Similar objects are defined as those which hash to the same bucket.
Step 3: Only objects within the same bucket require further checking for similarity.

Buckets and Thresholds

LSH generates a large number of small buckets and groups all vectors that have been hashed to the same bucket. The similarity search is performed by scanning only the items in the same bucket in which a query vector lies. To improve the performance of the algorithm, LSH makes use of multiple hash tables. That means hashing the same data multiple times with different hash functions. The final output of LSH is the union of all similar candidates obtained from all hash tables.

Another important concept in LSH is the threshold distance or threshold similarity. In most cases, LSH returns candidates, which are approximately similar, and not exactly similar. Nevertheless, the user can specify a threshold distance or similarity, or the threshold can be set dynamically. Typically, lower threshold values would yield fewer candidates, while higher threshold values would return more candidates.

Conclusion

Locality-sensitive hashing is a powerful algorithm for solving the nearest neighbor search problem in high-dimensional spaces. It is widely used in various applications such as recommendation systems, image recognition, DNA analysis, and more. LSH works by generating hash functions that project the original data into a lower-dimensional space to group similar items into the same buckets. The algorithm has potentially high recall, meaning, it will retrieve most relevant candidates, but at the same, the number of false-positives is also high. Practitioners need to be careful when setting threshold values and should choose the ones that suit their needs. Overall, locality-sensitive hashing is an intuitive and efficient way to approximate nearest neighbors in high-dimensional spaces.

Related AI Basics