Use hashing to determine albums.
Closed this issue · 22 comments
With the hashing solution , we could perform a duplicate detection with the files we currently generate, with the files in albums.
>>> image_file = open('2017-11-11/20171111_170331.jpg').read()
>>> image_file2 = open('Saturday in Rockford/20171111_170331.jpg').read()
>>> hashlib.md5(image_file).hexdigest() == hashlib.md5(image_file2).hexdigest()
True
If we find a match, we can create a file called albums.json or something to that effect and use the current directory to pull the album name. The json could look something like this:
[
{
"Saturday in Rockford": [
"20171111_170331.jpg",
"mySecondIMG2.png",
....
]
},
...
]
As much as I would hate to introduce yet another json file that we worked hard to remove from takeout folder, this is required since we can't assume an image belongs to only one album and just stick them in there. I'm open to hearing other solutions here though.
An alternative would be to just create album folders and allow duplicates in those folders.
I thought about using folders with symlinks inside - it seems so obvious and cross platform, tho I didn't see any gallery app using it, and I'm not experienced with other platforms than Linux - what do you think about it?
Using json with filenames would complicate when used with --divide-to-dates
, but we could fix that
Also, I would use hash only when the size would be identical - that's what currently works inside duplicate finding, because "date folders" usually contain ~5-20 photos, so the chance that two identical size photos would be different is near impossible.
The only concern in my head was Windows, but it seems like more recent versions of windows supports it.
https://askubuntu.com/questions/470758/symbolic-links-vs-windows-shortcuts
I will do some testing on this to see:
- Do soft links work well on > Windows 7 (works fine on all mac/*nix). I assume most have at least Windows 7 OS.
- How do programs like photoprism/nextcloud view/handle these softlinks.
@TheLastGimbus I don't think soft links are the solution unfortunately. Their behavior is very reliant on the file system they are created on and later used on. Windows can have unexpected behavior between ntfs and exFAT. Further it won't work if the user uses object storage...(like me).
At this point I think the best option is to allow duplicate files in album folders by default with an option to skip album folders. This would make it so that the user doesn't have to comb their directories of albums either.
Later as a third option we could create the json file. This way they can use that file to retrofit their album solution to whatever their specific file system and album options would be. You could write a script that creates these albums and soft links on your system in other words.
Thoughts?
@TheLastGimbus I've had some time to ponder on this. It is very possible that users have uploaded the same photo twice or duplicates from different albums. I think it would make more sense to do a global duplicate check (across date and album folders) now that we have an accurate file hashing dup check. This will also incorporate album detection. I will run through a trial of how much memory and time it will take to track and run everything in this phase if we implement it this way.
Let me know if you are absolutely not interested in doing this.
I am very much interested! I'm wondering how much my script was trouble for you to modify, since it was quickly written withou any intention to expand...
Doing "one big global duplicate check" makes sense - if the duration of doing it will be ~1.5x the current, it's still okay (by the way, with hashing that big, we could even try to go multi-core 😆 (if it isn't a problem in Python))
And, yea, if you will have full list of all the hashes - you will be able to do "album check" on it
By the way, try to get the info about the album form json, not the folder name - and since the "album metadata json" is named in the users language (mine are "metadane.json" in Polish, instead of "metadata.json") - just read every json in folder and check it's tags :/
But you can optimise it by checking the folder name for regex first 👍
Okay, great tip on the *.json scan vs metadata.json.
I prefer using both json and folder names and will handle if the same names exist in both.
Re: split function, shouldn't be too big of a fix once we have the main logic there.
we could even try to go multi-core 😆 (if it isn't a problem in Python))
The problem wouldn't be python, it would be sharing the duplication dictionaries between threads/procs. We could certainly do a divide and conquer approach where we save the dictionaries of each phase to disk and then have a merge step...but that may be overkill 😆.
I am a bit confused by some of the conditional logic that chooses different outputs. Can revisit that later maybe.
By the way, try to get the info about the album form json, not the folder name - and since the "album metadata json" is named in the users language (mine are "metadane.json" in Polish, instead of "metadata.json") - just read every json in folder and check it's tags :/
@TheLastGimbus, is there a clear reason why we can't just use the folder name? I have manually scanned about 25 folder's metadata.json files and they always match the folder name. Even one's with duplication names are using the exact same name.
"albumData": {
"title": "2020/09/24 #2",
I could do a programatic check I assume, but there's not much to be gained here it seems and the cost is high (having to scan and open ever json file in a directory that has possibly thousands of json files just to find a single metadata file doesn't seem worth it to me). Initially i'm going to implement it with just using the folder name and we can revisit using the metadata.json during the PR.
BTW the example above shows a date folder which I will be leaving the method we treat date folders as is. The non-date folders will be the only ones I attempt to pull album names from.
is there a clear reason why we can't just use the folder name?
To get the album's name? No. To get the fact that the folder is the album? As we have seen in other issues (#30), probably yes
Looks like Google keeps changing how "date folders" work, so we can't rely on simple regex 😕 - we need to check the json that it's really an album, sorry...
Edit: by the way, seeing how much you commit to this, I'm honestly sorry that this code is so messy and not commented anyhow 😅 - I did not expect that it will need maintenance...
No worries :) It's not too bad really and already solves a lot of issues. It's great work and I haven't explicitly told you that or said thank you so much! I'm looking forward to helping make this a bit easier for folks to use to migrate away from the google surveillance.
On the note of datetime formats , we already have some dependencies on this format. I think it's worth it to keep up with this and ask someone to submit an issue if this changes.
e.g.
_datetime.strptime(dir, '%Y:%m:%d %H:%M:%S').strftime('%Y:%m:%d %H:%M:%S')
Edit: Also it seems like all folders (date or album) contain the same format of metadata.json and I don't see any particular field other than the title that would indicate date directory vs an album directory.
{
"albumData": {
"title": "2016/08/11",
{
"albumData": {
"title": "Saturday in Rockford",
we already have some dependencies on this format
If you mean the format of the "date folder name", and the snippet below checking if it's valid - yes, I made it dependend, and it turns out (#30) Google flipped it upside down. I'm aspecially concerned with this now, because I just ordered a Takeout for my mom, and it looks like it has this "new format" too 😬 I'll update when I get to download and play with it myself...
Also it seems like all folders (date or album) contain the same [...] metadata.json and I don't see [anything] that would indicate date directory vs an album directory.
Oh, great...
Ugh... what now?
Okak, we need to see if Google fully migrated to this new "year folders" format - then we will (potentially) transition fully to it...
Meanwhile, I could try to email Google as @OscarVanL - maybe they will take a look, considering how much attention this repo got... after all, is is that hard to just give us one, simple, consistent .json
that would contain all we need? Then we would only do the job of setting right lastModified
🙏
I just ordered a Takeout for my mom, and it looks like it has this "new format" too 😬 I'll update when I get to download and play with it myself.
Looks like you're right! Just in the last week or so they rolled out an update that doesn't use the date folders anymore, they now bunch photos together by year instead of by day (e.g. "Photos from 2012"). This among more condensed metadata cut my export from 93GB to 75GB and the takeout process itself was way faster. However there's some strange things I noticed...
File json are not named exactly the same:
For jpg file 58530_158191067526424_100000065954143_509976_49.jpg
we find the json file (as .json not .jpg.json) missing the last character of the id for this uploaded photo 58530_158191067526424_100000065954143_509976_4.json
?!?!?!? I think it's probably a bug in the takeout code if I had to guess. It's consistent across what I've seen
So...bad news...as is the takeout code doesn't work for newer takeouts and there is no more metadata.json file (at least in the my second zip file...maybe it will be in my first one
good news...this is a relatively easy fix that I can add in to my current changes. My basic idea is to treat any directory prefixed with "Photos from " as a non-album directory and we'll basically just remove any attempt to infer date from directory name.
Let me know if this makes sense and if I should just update my changes to the new format. I'll be testing with my new takeout instead.
missing the last character of the id
Probably because of #8
treat any directory prefixed with "Photos from"
Don't get too excited - it says "photos from" - get prepared for my "Zdjęcia z..." in Polish 💖😍 - luckly, they are not
I'll download my takeout and write here how it looks...
That's fair....ugh well at this point I'm not prepared to support backwards compatibility.
Now that it's divided by year we can just consider that a valid album and treat those folders the same as any other album. Though it will make the albums json huge.
Another thought is to just remove all the album processing and file moving and simply remove duplicates locally in each directory, update the exif data, remove the json files in place. I kind of like the by year albums for my use case. What do you think?
there is no more metadata.json file (at least in the my second zip file...maybe it will be in my first one
Okay nevermind, there is metadata.json (or metadane.json for you ;) ) for the album folders and even better that these files do not exist in the date (Photos from, Zdjęcia z..., etc...) folders. So easy way to determine if we're dealing with an album is to do the check you mentioned above, if we don't find a json folder that contains json with "albumData" key then we are dealing with a regular date folder.
Zdjęcia z..., etc...
By the way I'm not sure about it, I wanted to download this now but - despite how unbelivable this sounds - whole Google is down 😆 - but that's probably how it's going to be - even if not, just wait till Google makes it separate in every language
No worries, if it's in any language now...but would be good to verify that you see the same with your downloads as well regarding the existence of a metadane.json for albums vs non-album directory.
If it's all good with you, i'm going to update this to follow the new standard. Are you okay with that?
I did a 'german' takeout of my photos and as far as I can tell, the folder 'Trash' is called 'Papierkorb' and the folder 'archive' is called 'Archiv'. These are true german names. The same holds true for the 'metadata.json' files which get translated to 'metadaten.json'.
But the yearly folders with all pictures are all called 'Photos from YYYY' where the word 'from' is not translated to german.
It seems that the 'trash' and 'archive' folders get translated to the native language, but the yearly folders are not.
Edit:
The folders 'Photos from YYYY', 'Archive' and 'Trash' does NOT contain any 'metadata.json' files, but the other folders do.
Thanks @JoLander!! This helps! I think we definitely have a way to determine what is an album vs data directory moving forward. It's up to @TheLastGimbus how we use that moving forward now that Google has done a pretty decent job at condensing the massive amount of folders for date directories.
Update from me - luckly, they are actualy "Photos from..." - but Google will probably treat this like a bug and "fix it" back to "Zdjęcia z..."
It's up to @TheLastGimbus how we use that
What's up to me? How we detect "albums/year folders"?
Okay nevermind, there is metadata.json for the album folders and [...] these files do not exist in the date (Photos from...) folders
Sooo - the "year folders" don't contain metadata.json - great, then use that 🎉
We could also keep "recognize folder type from name" function in place, unused, ready to switch if Google changes that/someone reports a bug - but that's just redundancy
What's up to me? How we detect "albums/year folders"?
Sorry that wasn't clear. I meant to say we need to ask you if we should just move to only support the new format and just tell people if they have an old takeout to run a new one? I think backwards compatibility is a tall order that would complicate the code.
We could also keep "recognize folder type from name" function in place, unused, ready to switch if Google changes that/someone reports a bug - but that's just redundancy.
I'd rather have one method that works now and probably for a while vs making more complexity.
But we do have backwards compatibility:
pip install google-photos...==1.2.0
:)))
No worries, you don't need to care about old scheme