Questions tagged [simhash]

1

votes
1

answer
914

Views

How to compare the similarity of documents with Simhash algorithm?

I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo) my datas are : data = { 1: u'Im testing simhash algorithm.', 2: u'test of simhash algorithm'...
Dany M
1

votes
0

answer
97

Views

how to allot index number using SimhashIndex() to a document dataset?

This code implements Simhash function of four set of data. import re from simhash import Simhash, SimhashIndex def get_features(s): width = 3 s = s.lower() s = re.sub(r'[^\w]+', '', s) return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))] data = { 1: u'How are you? I Am fine. blar b...
shipika singh
1

votes
1

answer
96

Views

Pandas: matrix calculation on values

I have dataframe like this: apple aple apply apple 0 0 0 aple 0 0 0 apply 0 0 0 I want to calculate string distance e.g apple -> aple etc. My end result is here: apple aple apply apple 0 32 14 aple 32 0 30 apply 14 30 0 Cu...
Shakeel Mumtaz
1

votes
2

answer
0

Views

simhash like algorithm to compare two text documents

The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one. The input text document could be exactly match or modified partly. The algorithm must be very fast. Currently, I found simhash to take a fingerprint from collection documents. Is there any...
xijing dai
1

votes
2

answer
1.3k

Views

Similarity Hash function(simhash)

I have a problem with using hash function. I have to assign some number(128 bit or 64 bit) with every word in the document. So, the hash value of 'similarity' must be near with 'similar'. That means, if has value of similarity=>10022(say) then similar=>10025. which should near with similar word. als...
Sanjaya Pandey
1

votes
1

answer
1.3k

Views

calculate pairwise simhash “distances”

I want to construct a pairwise distance matrix where the 'distances' are the similarity scores between two strings as implemented here. I was thinking of using sci-kit learn's pairwise distance method to do this, as I've used it before for other calculations and the easy parallelization is great. He...
user139014
2

votes
1

answer
482

Views

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if its results were: A -> h01 B -> h02 C -> h02 D ->...
cjauvin
2

votes
4

answer
772

Views

Hash function that maps similar inputs to similar outputs?

Is there a hash function where small changes in the input result in small changes in the output? For example, something like: hash('Foo') => 9e107d9d372bb6826bd81d3542a419d6 hash('Foo!') => 9e107d9d372bb6826bd81d3542a419d7
Paul Wicks
1

votes
1

answer
913

Views

Hamming distance (Simhash python) giving out unexpected value

I was checking out Simhash module ( https://github.com/leonsim/simhash ). I presume that the Simhash('String').distance(Simhash('Another string')) is the hamming distance between the two strings. Now, I am not sure I understand this 'get_features(string) method completely, as shown in (https://leons...
Ishan Sharma
13

votes
3

answer
6.7k

Views

SimHash implementation in Java?

Has anyone come across a simhash function implemented in Java? I've already searched for it, but couldn't find anything.
Joel
2

votes
2

answer
424

Views

What more advantageous minhash over simhash?

I am working with simhash but also see minhash is more effective. But I don't understand. Please explain for me: What more advantageous minhash over simhash ?
xfr1end
2

votes
0

answer
303

Views

SimHash implementation in R [closed]

Is there an implementation of simhash in R? (SimHash is a hash algorithm created by Moses Charikaris which gives similar objects similar hashes)
dzeltzer
2

votes
1

answer
1.3k

Views

comparing web pages - simhash, and DOM edge node processing

This isn't a programming issue yet! But I'm looking into how you'd compare web pages to see if the pages are the same/similar. This is a personal project, not for work/school... (just sayin!) I've found a few basic simhash implementations, and was wondering if anyone could point me to a really good...
tom smith
2

votes
0

answer
270

Views

MongoDB support search Bitwise XOR and Bit Count?

I would like to move from MYSQL to MongoDB, one of the question I can not find answer for, if I can get or simulate XOR and Bit Count, which I need. In MYSQL I would do: SELECT BIT_COUNT(SimHash ^ $SimHash) as simhash ... ORDER BY simhash It is possible to do something similar in MongoDB ? Basically...
2ge
5

votes
2

answer
2.4k

Views

Make a Sim Hash (Locality Sensitive Hashing) Algorithm more accurate?

I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human. I used the ideas from this excellent blog post: http://knol.google.com/k...
banncee