-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many collisions? #27
Comments
That's weird. Can you make sure you are comparing integers, and not strings? |
I am not following. I have both of those hashes stored in the database just like that. I could not find out how to get and store the 01010101011 representation. So the query actually would look like SELECT c.*, BIT_COUNT('12e627593dbc2307' ^ '12d2552c66ddc94b') as hamming_distance |
Following up on this. What am I doing wrong or does it not work for what I want it to? |
Any help on this would be greatly appreciated. Either I am not following how to get the binary value of the hash or what I am doing is correct and it has limitations. |
Sorry for my Eng. Why you can't loaded all md5 hashes from DB to script, and use with your test pics: |
Because by the time I get done hashing all of the images there will be 7 times as many as yours. All things being equal that would be more than 5 seconds. So I need the MySQL query to work, I am just seeing no documentation on how to store and ultimately compare the binary version of a hash. For instance is there a "$hash->get_binary"? I could then store that and pass it into the query. |
Let me ask this. Is the 0101011100 just an example and not the actual format? Following through the code for distance and when I call it passing the hash I get: hash: 7e426e403d1e71f | hexdec: 568622213714011935 Should I be storing and using this hexdec value (even though it isn't 0's and 1's)? |
Ok, try with hexdec But maybe need test with more images |
ALTER TABLE photoHash ADD hashBit BIT(YOURCOUNT) for storing 01010101011 representation, but this not INT for BIT_COUNT |
From the readme it has: Image 1 hash: 3c3e0e1a3a1e1e1e (0011110000111110000011100001101000111010000111100001111000011110) How do I generate the information in the parenthesis? I have the hash, just need to know how to get it as all 0's and 1's. |
In PHP you could do |
Thanks jdreesen. So now I am attempting to store that data, but what format? The only thing that appears to keep it like this is varchar. But when I do the query like this: SELECT i.*, BIT_COUNT(101001010110000000110011011011010010101111101100011010100001101 ^ i.hash) as hamming_distance After processing only 10 images and grabbing one of the hashes all 10 are being returned as 1. Obviously I am not storing it correctly or I am not comparing correctly, or something. I apologize that I am having this much difficulty with this, but other than getting a hash none of this is really defined. |
I think I have this figured out. I made the MySQL field BigINT Unsigned, I installed bcmath(so I don't get E+19 values) and I am storing and comparing against the hexdec value of the hash. I did a limited test and it got an exact match against one image (the others were in the 20's). I did another test where there were two similar images (one color and one black and white) and it matched perfect on the color and was like 10 on the B&W). Looks like it will take 80+ hours before it will have hashed everything. I will do some more testing and see if it all is working and if so we can consider this "issue" closed (just might want to take the learnings and apply to the readme. |
Hi @doughboyks, could you elaborate on your use of bcmath to solve this ? My hashes int representation get truncated to 9223372036854775807 aka PHP_INT_MAX when I try to store them. I'm considering splitting the hash in two parts for now, like
And then querying potential duplicates this way, assuming hash_1 and hash_2 being BIGINT columns, and supplied parameters being [int image.hash_1, int image.hash_2, int treshold]. select ID, BIT_COUNT(hash_1 XOR ?) + BIT_COUNT(hash2 XOR ?) AS distance FROM pictures HAVING distance < ? But this really feels dirty, I'd like to avoid this approach.
/* Picture.php, Laravel5.5 Eloquent Model */
/**
* @param int $level
* Returns potential duplicates of this picture.
*/
public function findPotentialDuplicates(int $level = 5) {
$hash_1 = $this->hash_1;
$hash_2 = $this->hash_2;
return DB::table('pictures')
->selectRaw('id, BIT_COUNT(hash_1 ^ ?) + BIT_COUNT(hash_2 ^ ?) AS distance', [$h1, $h2])
->havingRaw('distance < ?', [$level])
->get();
}
/**
* @param Picture $a
* @param Picture $b
* @return int
*/
public static function compare(Picture $a, Picture $b) {
$hasher = self::getHasher();
return $hasher->distance($a->phash, $b->phash);
} But hey, it works for now, I'm getting similar and "opposite" images with similarity cursors, couldn't get fancier ! Thanks for documenting your issue, it allowed me to resolve this. |
@Lucassifoni, any progress on this issue? |
@scratcher28, splitting the hash in two columns worked fine with our approx. 120.000 images, with relatively fast queries on a modest server (meaning : it wasn't annoying to use), but we moved on and this system isn't in use anymore. Re-reading this after a year, my approach was naïve and I should have designed something better. |
@Lucassifoni, do you know/have any better alternatives to find duplicates? |
With the conversation on this having hit on performance, I wanted to add that I find performance for hash comparisons quite bearable if you have GMP installed. Personally, I'm going to avoid doing any fancy database shenanigans as it'd lock me into a particular storage mechanism. For 50,000 hash comparisons: While the exact times will vary depending on server performance, hashes, the position of the moon, and any number of other things, I strongly suspect the 80% reduction in execution time will be pretty consistent. |
@scratcher28 the finding duplicates part was good enough, we supplemented it with some heuristics as a duplicate flagging pipeline :
Depending on the number of flags, you could show a popup to the user to ask for confirmation and word it more or less strongly. If the user says "yes, they're the same", we keep the one that had a better resolution. |
For optimizing mysql query checkout this stackoverflow answer. It suggests to narrow down search results by adding all the bits together and caching that value in a seperate column - that will allow you to exclude checking images that are obviously not alike. |
I agree it would be helpful for some of this to exist in the install instructions. To solve for PHP displaying numbers in E notation, I ended up making this work by installing bcmath. On Ubuntu: To solve for numbers being rounded to 9223372036854775807 in the database, make sure your column is bigint unsigned. bcmath causes issues with number display (carrying what should be integers out to a crazy number of decimals). You might also find these functions helpful, in particular bcround()
|
I have tested the perceptual hash with more than 5000 images, it occasionally generate weird results, but only with very high distance. What I don't understand though, is how it can find really similar pictures at very high distances: I have absolutely no explanation for this behaviour, but I will try to merge these PR and test it again:
|
@VincentChalnot thank you so much for detailed information. We are also experiencing the same issue, although we use DifferenceHash. 8 seems to be the sweet spot for us too. How strange, maybe that can be my lottery numbers this weekend 🤣 |
There's something wrong with the included perceptual hash algorithm. I could not figure out what exactly. The other included algorithms will usually provide better results. |
@jenssegers I get similar results even with DifferenceHash algorithm though. So maybe there's something else happening? |
Check the discussion at #52 and the PR #58 that are resolving a lot of issues. #52 (comment) Things I will try to look into to detect more duplicates (not really related to pHash)
|
I tried to save the hash (toInt) into mysql, and I found the value were casted which max length is 8 byte(64 bits), so the distance calculated by mysql bit_count is not correct actually. We try to split the hash(toHex) into 2 parts and save them into two columns, add both distances of column as the real distance, sound weird ,but it works! if MySQL could XOR more than 64 bits, we can save the image hash as what we calculate in PHP |
Maybe I am doing it wrong but I ran through about 1,250 images (out of more than 350,000) and hashed (Perceptual) them. I then stored the hash in the database and I am using the SQL Bit_COUNT to get the hamming distance.
I took a random hash and ran the query and ended up with 2 at a distance of 0 (different hashes) and maybe 50 or more at a distance of 1. The farthest away is a distance of 27.
These are the two images that had different hashes but yet were still 0 away.
Hash: 12d2552c66ddc94b (image: https://10deb7fbfece20ff53da-95da5b03499e7e5b086c55c243f676a1.ssl.cf1.rackcdn.com/a1afc58c6ca9540d057299ec3016d726_l.jpg)
Hash: 12e627593dbc2307 (image: https://10deb7fbfece20ff53da-95da5b03499e7e5b086c55c243f676a1.ssl.cf1.rackcdn.com/3b8a614226a953a8cd9526fca6fe9ba5_l.jpg)
As you can see these are not anywhere close to the same.
SELECT c.*, BIT_COUNT('12e627593dbc2307' ^ i.hash) as hamming_distance
FROM images i
where hash is not null
ORDER BY hamming_distance ASC
Will this not work on "created" images? Maybe the sample size is too small for an accurate comparison. I think I am reading it is converted to an 8x8 image...maybe in my case it should be MUCH larger but I am not sure where to start.
The text was updated successfully, but these errors were encountered: