# Questions tagged [simhash]

15 questions

1

votes

1

answer

914

Views

### How to compare the similarity of documents with Simhash algorithm?

I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs).
I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo)
my datas are :
data = {
1: u'Im testing simhash algorithm.',
2: u'test of simhash algorithm'...

1

votes

0

answer

97

Views

### how to allot index number using SimhashIndex() to a document dataset?

This code implements Simhash function of four set of data.
import re
from simhash import Simhash, SimhashIndex
def get_features(s):
width = 3
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
data = {
1: u'How are you? I Am fine. blar b...

1

votes

1

answer

96

Views

### Pandas: matrix calculation on values

I have dataframe like this:
apple aple apply
apple 0 0 0
aple 0 0 0
apply 0 0 0
I want to calculate string distance e.g apple -> aple etc. My end result is here:
apple aple apply
apple 0 32 14
aple 32 0 30
apply 14 30 0
Cu...

1

votes

2

answer

0

Views

### simhash like algorithm to compare two text documents

The problem is:
I have a collection of text documents, i want to pick up the most similar one to the input one.
The input text document could be exactly match or modified partly.
The algorithm must be very fast.
Currently, I found simhash to take a fingerprint from collection documents. Is there any...

1

votes

2

answer

1.3k

Views

### Similarity Hash function(simhash)

I have a problem with using hash function. I have to assign some number(128 bit or 64 bit) with every word in the document. So, the hash value of 'similarity' must be near with 'similar'. That means, if has value of similarity=>10022(say) then similar=>10025. which should near with similar word. als...

1

votes

1

answer

1.3k

Views

### calculate pairwise simhash “distances”

I want to construct a pairwise distance matrix where the 'distances' are the similarity scores between two strings as implemented here. I was thinking of using sci-kit learn's pairwise distance method to do this, as I've used it before for other calculations and the easy parallelization is great.
He...

2

votes

1

answer

482

Views

### MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here:
https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/
could yield three clusters ({A}, {B,C,D} and {E}), for instance, if its results were:
A -> h01
B -> h02
C -> h02
D ->...

2

votes

4

answer

772

Views

### Hash function that maps similar inputs to similar outputs?

Is there a hash function where small changes in the input result in small changes in the output? For example, something like:
hash('Foo') => 9e107d9d372bb6826bd81d3542a419d6
hash('Foo!') => 9e107d9d372bb6826bd81d3542a419d7

1

votes

1

answer

913

Views

### Hamming distance (Simhash python) giving out unexpected value

I was checking out Simhash module ( https://github.com/leonsim/simhash ).
I presume that the Simhash('String').distance(Simhash('Another string')) is the hamming distance between the two strings. Now, I am not sure I understand this 'get_features(string) method completely, as shown in (https://leons...

13

votes

3

answer

6.7k

Views

### SimHash implementation in Java?

Has anyone come across a simhash function implemented in Java?
I've already searched for it, but couldn't find anything.

2

votes

2

answer

424

Views

### What more advantageous minhash over simhash?

I am working with simhash but also see minhash is more effective.
But I don't understand.
Please explain for me: What more advantageous minhash over simhash ?

2

votes

0

answer

303

Views

### SimHash implementation in R [closed]

Is there an implementation of simhash in R?
(SimHash is a hash algorithm created by Moses Charikaris which gives similar objects similar hashes)

2

votes

1

answer

1.3k

Views

### comparing web pages - simhash, and DOM edge node processing

This isn't a programming issue yet!
But I'm looking into how you'd compare web pages to see if the pages are the same/similar. This is a personal project, not for work/school... (just sayin!)
I've found a few basic simhash implementations, and was wondering if anyone could point me to a really good...

2

votes

0

answer

270

Views

### MongoDB support search Bitwise XOR and Bit Count?

I would like to move from MYSQL to MongoDB, one of the question I can not find answer for, if I can get or simulate XOR and Bit Count, which I need.
In MYSQL I would do:
SELECT BIT_COUNT(SimHash ^ $SimHash) as simhash ... ORDER BY simhash
It is possible to do something similar in MongoDB ?
Basically...

5

votes

2

answer

2.4k

Views

### Make a Sim Hash (Locality Sensitive Hashing) Algorithm more accurate?

I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human.
I used the ideas from this excellent blog post: http://knol.google.com/k...