Notes for Windows-Users
lukasalexanderweber opened this issue ยท 10 comments
Hello,
first of all thank you for publishing your great work! It saved a lot of time for me.
In sake of improving open source software, I would like to point out three things I noticed when using your script.
- Wrong Hashes
In line 65, you are using os.path.join() to generate the new file string. In my case, this leads to the following string, since I'm using Windows and Backslash is Standard:
file:C:/Users/weberlu/Desktop/neuer_pfad/gross\15-0000598.png
This does not lead to an error, however, the MD5 hash is completely wrong because of that.
Changing line 65 from
source_asset_path = 'file:'+os.path.join(directory_name, asset['name'])
to
source_asset_path = 'file:'+ directory_name + '/' + asset['name']
fixes this issue.
- Extra Line between each Line
After running your script, I get a blanc line between each JSON entry:
{
"name": "Test",
"securityToken": "Test Token",
I fixed this by adding
line = line.replace("\\n","")
below
line = line.replace(old_source_directory, node_ready_new_source_directory)
- Remove click functionality
The click functionality is not stable for Windows users. Just pass the new_source_directory and target_directory (with Slashes, not Backslashes!) as string parameter into the main() function!
Thats all from me and maybe helps another windows user.
Keep up the good work!
Lukas
All this is super helpful, thank you the original code and the windows fixes. I made the first two fixes easily, but I think I'm still stuck on the click issue, 3. I tried:
C:\Users\booby2\update-vott-assets>python update_vott_assets.py G:/Lehua/TrainLabeledMele G:/Lehua/ImagesTrain
And got the error msg below.
The other thing (while I have you) is that I didn't just move the folder within the same drive, I changed the drive letter (from E to G). Should I start with E for one of these directories, even though E no longer is mapped in this machine? Should I just re-map G to E? Or can I get this code to work by running this script properly. Thanks so much for the help!
Running Windows Data Science VM on Azure
Traceback (most recent call last):
File "update_vott_assets.py", line 234, in
main()
File "c:\Miniconda\lib\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "c:\Miniconda\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\Miniconda\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\Miniconda\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "update_vott_assets.py", line 164, in main
vott_dict = json.load(f)
File "c:\Miniconda\lib\json_init_.py", line 293, in load
return loads(fp.read(),
File "c:\Miniconda\lib\json_init_.py", line 357, in loads
return _default_decoder.decode(s)
File "c:\Miniconda\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "c:\Miniconda\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Add this at the end of the file:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-s", "--newsource",
help="the path to the directory that contains the images that were originally tagged")
parser.add_argument("-t", "--target",
help="the path to the directory that contains all the -asset.json files and the .vott file")
args = parser.parse_args()
update_md5_hash_id(args.newsource.replace("\\", "/"), args.target.replace("\\", "/"))
you can then run the script using
python update_vott_asset.py -s G:\Lehua\ImagesTrain -t G:\Lehua\TrainLabeledMele
(I guess your images are on ImagesTrain)
@cnrmck I would love to add a pull request for the Windows-Sollution, could you grant me the rights to collaborate?
Till then here is the full Script:
import fileinput
import hashlib
import json
import glob
import sys
import os
import re
import argparse
"""
# TODO: the provider, found in the vott file encryts the connection information for local filestystem.
"providerOptions": {
"encrypted": "eyJjaXBoZXJ0ZXh0IjoiM2RkMDM5Y2Y4ZGJjYjk1MzQ3ZTczMGRlYTZmNzg2MjdhZjRhN2E0MWNiNGRjNWViNDgyZWI5NzRmNDE0YWNjNDM5MmU1MGU0NzNhMzQ1MjUyYWM4YTIxM2YzODljOTEzYjhhYTkyNDhlMjIzMGNiZjQyZGM2ZjA4ZmM5OWY5MTMwODg1MjE3ZjQ1MmI3YmMyMmZhOTQ3ZTczODlmOTljN2E5MDkxOTA4MGM0MzcyMjhjYzViZGMzYWYwMTA4YjQ2ODJhZTg0ZmFmZjUyNTU0NzVlZDRkYjY3MGQ0ZGVkYjQ4YzdkOTJiN2ViNmJiZTI0OTIwMjg3ZTNiZmIzMzM2YzBmNjhlYzgzMjhhZWI5ZTBkZDExZTJhN2Y5MjRjYjEyNjczYmM1Nzk5NGQzNzIwZTdiMGZlY2M3MGJjOTlkMTgiLCJpdiI6IjM0OTU5NDBmOTJmMjE4MDJjMGM2Y2M0ZDM3MzlmYTYwNTllMTU2NGU2N2E3ZWI2NCJ9"
},
Unencrypt this, transform it (replace with the new path), then reencrypt. This way your source and destination directories will be automatically fixed.
"""
def get_single_file_with_suffix(directory, suffix):
"""
Return the file in directory with the specified suffix.
If none is found, error.
If more than one found, error.
Args:
directory: the path (relative or full) to the directory with the file in it.
suffix: the suffix of the file *including* the '.' (e.g. '.jpg' not 'jpg').
suffix: can also be a list of suffixes to try (e.g. ['.jpg', '.jpeg'])
"""
candidate_files = []
if type(suffix) is list:
# rename suffix to suffixes to make list comprehension more readable
suffixes = suffix
# this monster gets all the filenames that end in the suffix found in suffixes and flattens into a single list
candidate_files = [filename for suffix in suffixes for filename in \
glob.glob(os.path.join(directory, '*{}'.format(suffix)))]
elif type(suffix) is str:
# if only a single string is passed, the files that end in it
candidate_files = glob.glob(os.path.join(directory, '*{}'.format(suffix)))
if len(candidate_files) > 1:
raise Exception("Should have no more than one '{}' file in {}".format(suffix, directory))
elif len(candidate_files) == 0:
raise Exception("No file found with suffix '{}' in {}".format(suffix, directory))
else:
final_file = candidate_files[0]
return final_file
def map_old_vott_path_and_id_to_new(vott_dict, directory_name):
"""
Return a mapping of the old ids to the new ids
new ids are the md5 hash of the full path (including %20 as whitespace) to the source asset
"""
# initialize the dictionary to contain the mapping
old_to_new_ids = {}
# iterate through all the assets
for asset in vott_dict['assets'].values():
# get what will be the new path of the source asset
source_asset_path = 'file:'+ directory_name + '/' + asset['name']
# map the old id to the hexdigest of the full path to the source asset
old_to_new_ids[asset['id']] = hashlib.md5(source_asset_path.encode('utf-8')).hexdigest()
return old_to_new_ids
def replace_old_contents(target_directory, old_to_new_ids, old_source_directory,
node_ready_new_source_directory):
"""
Replace the contents of .vott and .json files in the target directory and its subdirectories
with the new asset ids and the new source directory
Essentially, go line by line through all files and replace the source directory from the old
machine to the one that will be used with this machine.
Args:
target_directory (`str`): path to the target directory
old_to_new_ids (:obj:`dict`): dictionary mapping from old asset id to new id
old_source_directory (`str`): path to the old source directory
node_ready_new_source_directory (`str`): path to new source directory, made ready for node
Return:
None
"""
# get the full path of all vott and json files in the target directory and subdirectories
files = [f for suffix in ('**/*.vott', '**/*.json') for f in glob.glob(os.path.join(target_directory, suffix), recursive=True) if os.path.isfile(f) == True]
# open an inplace fileinput so that stdout of this script becomes the input to the provided files
for byteline in fileinput.input(files=files, inplace=True, mode='rb'):
try:
# has to be opened in byte mode (to prevent unicode decode errors) then converted to a string
line = byteline.decode()
# replace every instance of the old id in every file with the proper new id
for id_pair in old_to_new_ids.items():
# the old asset id
old_id = id_pair[0]
# the new asset id
new_id = id_pair[1]
# replace the old id with the new one in this line
line = line.replace(old_id, new_id)
# replace the old directory name with the new one in this line
line = line.replace(old_source_directory, node_ready_new_source_directory)
line = line.replace("\\n","")
# the fileinput stream is open (thanks to inplace=True) so everything that goes to stdout
# goes into the original file (idgi, just works)
try:
sys.stdout.write(line)
except:
sys.stdout.write(line.encode('utf-8'))
except UnicodeDecodeError as e:
pass
def update_md5_hash_id(new_source_directory, target_directory):
"""
This script solves the problem of transferring assets labeled with VoTT from one machine to
another. Important: File paths with slash, not backslash!
Arguments:
target_directory -- the path to the directory that contains all the -asset.json files and the
.vott file
new_source_directory -- the path to the directory that contains the images that were
originally tagged (not yet tested with videos)
Note that the new_source_directory must contain ALL of the assets that were originally present
in the labeling process.
\b
Purpose:
The problem arises due to VoTT using the md5 hash of the absolute path of the source asset
(image or video) as the asset_id. This id is used whenever the asset is looked up, so transferring
assets and VoTT files from one machine to another breaks VoTT's ability to recognize the labels
because the filepath on two different machines look different (different usernames are enough to
cause the problem).
Running this script solves the problem by creating a new asset id for each asset in the provided
new_source_directory and updating the contents of the target_directory files to reference those
asset ids.
"""
# node uses %20 in place of spaces
node_ready_new_source_directory = re.sub(' ', '%20', new_source_directory)
# get the vott file that references all the asset files
vott_file = get_single_file_with_suffix(target_directory, '.vott')
# get a dictionary representation of the vott file
with open(vott_file, 'r') as f:
vott_dict = json.load(f)
# get the value of the 'path' key out of the the vott dictionary (a string referencing the old file)
path_value = list(vott_dict['assets'].values())[0]['path']
# get the source directory of the old files (to substitute with the new one)
# e.g. keep the '/home/dir' part of 'file:/home/dir/file.txt'
old_source_directory = os.path.split(path_value[len('file:'):])[0]
# get a dictionary that maps the old asset ids to the new ones
old_to_new_ids = map_old_vott_path_and_id_to_new(vott_dict, node_ready_new_source_directory)
print("Replacing old asset ids in file names with the new asset ids")
for old_asset_path in glob.glob(os.path.join(target_directory, '*-asset.json')):
# get the basename of the old_asset
old_asset_file = os.path.basename(old_asset_path)
# get the asset id out of the asset.json file
# i.e. the ba4eb9e76e2148bb7dc5b82bdccb7dbc in ba4eb9e76e2148bb7dc5b82bdccb7dbc-asset.json
old_asset_id = old_asset_file.split('-')[0]
# look up the new id to use for this file
new_id = old_to_new_ids[old_asset_id]
# rename the file so that it has the new asset id in its name, replacing the old one
os.rename(old_asset_path, os.path.join(target_directory, new_id+'-asset.json'))
print("Replacing old paths and asset ids in the files themselves, this may take a while.")
replace_old_contents(target_directory, old_to_new_ids, old_source_directory,
node_ready_new_source_directory)
# some variables used in the final instructions
source_connection = vott_dict['sourceConnection']['name']
target_connection = vott_dict['targetConnection']['name']
security_token_name = vott_dict['securityToken']
final_instructions = '''
Done! Only a couple remaining steps:
1. Open VoTT
2. Click Home then click Open Local Project
3. Navigate to '{target_directory}'
4. Open the '{vott_file}' file. If it opens without error, you're done! Otherwise:
- You get Error loading project file: You need to add the right security token
1. Click Settings (the gear icon)
2. Ensure you have a listing for '{security_token}' and the right key
(I can't help you there, ask the person that originally labeled these assets)
3. Try loading the '{vott_file}' file again.
and/or
- You get an unknown error or no images show up: You need to update your Connections
1. Click the Plug icon
2. Update '{}' by pointing its connection to:
'{}'
3. Update '{}' by pointing its connection to:
'{target_directory}'
Make sure to hit the Save button after editing.
4. Try clicking the Bookmark button to reload the '{vott_file}' file again. It should now work!
'''.format(source_connection, new_source_directory,
target_connection, security_token = security_token_name,
target_directory = target_directory, vott_file = os.path.basename(vott_file))
print(final_instructions)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-s", "--newsource",
help="the path to the directory that contains the images that were originally tagged")
parser.add_argument("-t", "--target",
help="the path to the directory that contains all the -asset.json files and the .vott file")
args = parser.parse_args()
update_md5_hash_id(args.newsource.replace("\\", "/"), args.target.replace("\\", "/"))
I actually got to this point with the old code, but now am getting key errors, which is strange, because when I search for that key it's in the json assets, and links to a photo that is in the image (training) directory. I'll test your new code now, though @lukasalexanderweber - - I notice they are relying on click, though, so do you think this is the click problem you described earlier?
Replacing old asset ids in file names with the new asset ids
Traceback (most recent call last):
File "C:\Users\booby2\update-vott-assets\update_vott_assets.py", line 234, in
main()
File "c:\Miniconda\lib\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "c:\Miniconda\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\Miniconda\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\Miniconda\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\Users\booby2\update-vott-assets\update_vott_assets.py", line 186, in main
new_id = old_to_new_ids[old_asset_id]
KeyError: '0001361e1cd054f0259269892633011b'
same problem, but thank you for fixing the dependence on click. So strange because the asset is in both places. There was a space in the photo id, which got translated to %20 but I read through the code and thought that took care of it. Any ideas are welcome...
Replacing old asset ids in file names with the new asset ids
Traceback (most recent call last):
File "C:\Users\booby2\update-vott-assets\update_vott_assets2.py", line 240, in
update_md5_hash_id(args.newsource.replace("\", "/"), args.target.replace("\", "/"))
File "C:\Users\booby2\update-vott-assets\update_vott_assets2.py", line 187, in update_md5_hash_id
new_id = old_to_new_ids[old_asset_id]
KeyError: '0001361e1cd054f0259269892633011b'
@lukasalexanderweber Sorry I missed this. Yes, I've added you as a collaborator. Thank you for your contribution. I'll respond to the rest at a later date I don't have time at the moment.
@lukasalexanderweber As I think about this more, maybe you should fork this repository and then I can just point windows users to your repo. That way we don't have to maintain a script compatible with both Windows and Unix, we can just each maintain our own. It's a pretty simple script, so no real benefit would be gained by trying to support both platforms.
@lukasalexanderweber As I think about this more, maybe you should fork this repository and then I can just point windows users to your repo. That way we don't have to maintain a script compatible with both Windows and Unix, we can just each maintain our own. It's a pretty simple script, so no real benefit would be gained by trying to support both platforms.
Good Idea, I forked it into https://github.com/lukasalexanderweber/update-vott-assets
same problem, but thank you for fixing the dependence on click. So strange because the asset is in both places. There was a space in the photo id, which got translated to %20 but I read through the code and thought that took care of it. Any ideas are welcome...
Replacing old asset ids in file names with the new asset ids
Traceback (most recent call last):
File "C:\Users\booby2\update-vott-assets\update_vott_assets2.py", line 240, in
update_md5_hash_id(args.newsource.replace("", "/"), args.target.replace("", "/"))
File "C:\Users\booby2\update-vott-assets\update_vott_assets2.py", line 187, in update_md5_hash_id
new_id = old_to_new_ids[old_asset_id]
KeyError: '0001361e1cd054f0259269892633011b'
I have in mind I tackeled this problem, too.. But sadly I can't remember how I solved it :( If I find a sollution in future work I'll let you know.