/pyLSHash

Locality Sensitive Hashing, fuzzy-hash, min-hash, simhash, aHash, pHash, dHash。基于 Hash值的图片相似度、文本相似度

Primary LanguagePythonMIT LicenseMIT

PyPI Python package codecov License Python Platform stars

A fast Python implementation of locality sensitive hashing.

Algorithm Function Application Features
fuzzy-hash Map text or string or file to 64-bits (or other) hash values. Similar contents hash similar hash values Fast compare similar contents Suitable for text/string/file
min-hash Map sets to signature matrices and find similar sets by calculating Jaccard similarity Similarity retrieval Suitable for text, network, audio, and other data
SimHash Convert high-dimensional data such as text and images into fixed-length vectors, and map similar vectors to the same bucket through hash functions Text and image similarity retrieval Suitable for high-dimensional data
aHash Compress images to a fixed size and map similar images to the same bucket through hash functions Similar image retrieval Has some robustness to scaling and slight deformations
dHash Convert images to grayscale and calculate difference values, then map similar images to the same bucket through hash functions Similar image retrieval Has some robustness to scaling and slight deformations
pHash Convert images to DCT coefficients and map similar images to the same bucket through hash functions Similar image retrieval Has some robustness to scaling, brightness, translation, rotation, and noise addition
LSH Map high-dimensional vectors to low-dimensional space and map similar vectors to the same bucket through hash functions Fast search for approximate vectors Suitable for large-scale high-dimensional data

Highlights

  • Fast hash calculation for large amount of high dimensional data through the use of numpy arrays.
  • Built-in support for persistency through Redis.
  • Multiple hash indexes support.
  • Built-in support for common distance/objective functions for ranking outputs.

Installation

pyLSHash depends on the following libraries:

  • numpy
  • redis (if persistency through Redis is needed)

To install:

$ pip install pyLSHash

Quickstart

fuzzy-hash

sentence1 = '''
近期,有一部热播硬核电视剧引发全网关注。与其他硬核电视剧不同的是,这部电视剧真的硬“核”,含“核”量高达100%。
这就是《许你万家灯火》——首部全景反映我国核电工业发展历程的电视剧。
《许你万家灯火》极具年代感,这是因为取景地之一是**核动力的发源地——九〇九基地。
**第一代核潜艇研发实验基地
也是**核动力研究设计院的前身

剧组在基地里面内置景,1:1复刻了主要场景十余个,
包括第一座陆上模式堆主控室、核电大院、核电办公楼、核电家属楼、零号点、医院、图书馆、大礼堂等,
高度还原了老一辈核工业人的研发和生活环境。

而《许你万家灯火》的创作题材,便是**完全自主知识产权的三代核电技术——“华龙一号”。
从核潜艇研发起步的**核工业
如何实现拥有世界一流核电站的梦想?
'''

sentence2 = '''
你好:
近期,有一部热播硬核节目引发全网关注。与其他硬核节目不同的是,这部电视剧真的硬“核”,含“核”量高达100%。
这就是《许你万家灯火》——首部全景反映我国核电工业发展历程的电视剧。
《许你万家灯火》极具年代感,这是因为取景地之一是**核动力的发源地——九〇九基地。
**第一代核潜艇研发实验基地
也是**核动力研究设计院的前身
剧组在基地里面内置景,1:1复刻了主要场景十余个,
包括第一座陆上模式堆主控室、核电大院、核电办公楼、核电家属楼、零号点、医院、图书馆、大礼堂等,
高度还原了老一辈核工业人的研发和生活环境。
而《许你万家灯火》的创作题材,便是**完全自主知识产权的三代核电技术——“华龙一号”。
从核潜艇研发起步的**核工业
如何实现拥有世界一流核电站的梦想?
感谢!
'''

from pyLSHash import FuzzyHash

fuzzy_hash = FuzzyHash()

hash1 = fuzzy_hash.get_hash(sentence1.encode('utf-8'))
hash2 = fuzzy_hash.get_hash(sentence2.encode('utf-8'))

print(hash1)
print(hash2)

corr = fuzzy_hash.compare(hash1, hash2)
print('corr = {}%'.format(corr))

b'24:NCRqxthHLDYTvxiiIhNM+Nkr6gy8C4xB6YR514cLCxd6tXKlru2uEj:tBHATdN+OuNOZrIxnAa' b'24:TsoR7RmxthHLDYTvxiiIhNM+Nkr6gy8o4xB6YR514cLCxd6tXilru2uEUv:fR7RmBHATdN+OulOZrIxdA7' corr = 86%

Look at examples/example_fuzzy_hash.py

SimHash

from pyLSHash import SimHash, hamming

sim_hash = SimHash()

sh1 = sim_hash.get_hash(sentence1)
sh2 = sim_hash.get_hash(sentence2)

corr = 1 - hamming(sh1, sh2) / sim_hash.len_hash
print(sh1)
print(sh2)
print('corr = {}'.format(corr))

957004571726091744
943493772323861728
corr = 0.890625

Look at examples/example_simhash.py

minHash

from pyLSHash import min_hash

k = 3  # minHash 值的维度

x1 = [1, 1, 0, 0, 0, 1, 1, 1, 1, 0]
x2 = [1, 0, 0, 0, 0, 1, 1, 1, 1, 0]

n = len(x1)  # 向量的维度
min_hash_val1 = min_hash.get_min_hash(x1, n, k)
min_hash_val2 = min_hash.get_min_hash(x2, n, k)
print(min_hash_val1)
print(min_hash_val2)

[1, 0, 0]
[1, 0, 0]

Look at examples/example_min_hash.py

aHash/dHash/pHash

aHash

a_hash_img1 = img_hash.a_hash(PIL.Image.open(img1))
a_hash_img2 = img_hash.a_hash(PIL.Image.open(img2))
hamming_distance = hamming(a_hash_img1, a_hash_img2)

dHash

d_hash_img1 = img_hash.d_hash(PIL.Image.open(img1))
d_hash_img2 = img_hash.d_hash(PIL.Image.open(img2))
hamming_distance = hamming(d_hash_img1, d_hash_img2)

pHash

p_hash_img1 = img_hash.p_hash(PIL.Image.open(img1))
p_hash_img2 = img_hash.p_hash(PIL.Image.open(img2))
hamming_distance = hamming(p_hash_img1, p_hash_img2)

outputs:

[aHash]: img1 = 0xffc3c3db819f0000, img2 = 0xffc3c3cb819f0000
hamming_distance = 1
[dHash]: img1 = 0x7ffae0c63d188743, img2 = 0x7ffae0c23d188743
hamming_distance = 1
[pHash]: img1 = 0xa8a0008200000000, img2 = 0xa8a0008200000000
hamming_distance = 0

Look at examples/example_img_hash.py

LSHash

To create 6-bit hashes for input data of 8 dimensions:

from pyLSHash import LSHash

lsh = LSHash(hash_size=6, input_dim=8)
lsh.index([1, 2, 3, 4, 5, 6, 7, 8])
lsh.index([2, 3, 4, 5, 6, 7, 8, 9])
# attach extra_data
lsh.index([2, 3, 4, 5, 6, 7, 8, 9], extra_data="some vector info")
lsh.index([10, 12, 99, 1, 5, 31, 2, 3])

res = lsh.query([1, 2, 3, 4, 5, 6, 7, 7])

[((1, 2, 3, 4, 5, 6, 7, 8), 1.0), ((2, 3, 4, 5, 6, 7, 8, 9), 11)]

User defined distance function

def l1norm_dist(x, y):
    return sum(abs(x - y))


res2 = lsh.query([1, 2, 3, 4, 5, 6, 7, 7], dist_func=l1norm_dist)

print(res2)

Use Redis

from pyLSHash import LSHash

lsh = LSHash(hash_size=6, input_dim=8
             , storage_instance=RedisStorage({'host': 'localhost', 'port': 6379, 'decode_responses': True}))

lsh.index([1, 2, 3, 4, 5, 6, 7, 8])
lsh.index([2, 3, 4, 5, 6, 7, 8, 9])
# attach extra_data
lsh.index([2, 3, 4, 5, 6, 7, 8, 9], extra_data="some vector info")
lsh.index([10, 12, 99, 1, 5, 31, 2, 3])
lsh.index([10, 12, 99, 1, 5, 31, 2, 3])

res = lsh.query([1, 2, 3, 4, 5, 6, 7, 7])

Use other database as storage

from pyLSHash import LSHash
from pyLSHash.storage import StorageBase
import redis
import json


class MyStorage(StorageBase):
    def __init__(self):
        self.storage = redis.StrictRedis(host='localhost', port=6379, decode_responses=True)

    def keys(self, pattern="*"):
        return self.storage.keys(pattern)

    def set_val(self, key, val):
        self.storage.set(key, val)

    def get_val(self, key):
        return self.storage.get(key)

    def append_val(self, key, val):
        self.storage.rpush(key, json.dumps(val))

    def get_list(self, key):
        res_list = [json.loads(val) for val in self.storage.lrange(key, 0, -1)]
        return tuple((tuple(item[0]), item[1]) for item in res_list)

    def clear(self):
        for key in self.storage.keys():
            self.storage.delete(key)


lsh = LSHash(hash_size=6, input_dim=8
             , storage_instance=MyStorage())

lsh.index([1, 2, 3, 4, 5, 6, 7, 8])
lsh.index([2, 3, 4, 5, 6, 7, 8, 9])
lsh.index([2, 3, 4, 5, 6, 7, 8, 9], extra_data="some vector info")
lsh.index([10, 12, 99, 1, 5, 31, 2, 3])
lsh.index([10, 12, 99, 1, 5, 31, 2, 3])

res = lsh.query([1, 2, 3, 4, 5, 6, 7, 7])

save&load model

lsh.save_uniform_planes("filename.pkl")
lsh.load_uniform_planes("filename.pkl")

clear indexed data

lsh.clear_storage()

Other examples