sign-language-processing/datasets

bug(signsuisse): missing/invalid fields

Opened this issue · 0 comments

I ran a more thorough check on the data and found some further missing/invalid fields.

Code:

data = []
for datum in dataset["train"]:
# for datum in itertools.islice(dataset["train"], 0, 10):
    current = {
        'id': datum['id'].numpy().decode('utf-8'),
        'name': datum['name'].numpy().decode('utf-8'),
        'spokenLanguage': datum['spokenLanguage'].numpy().decode('utf-8'),
        'signedLanguage': sign_language_lookup_table[datum['signedLanguage'].numpy().decode('utf-8')],
        'category': datum['category'].numpy().decode('utf-8'),
        'definition': datum['definition'].numpy().decode('utf-8'),
        'paraphrase': datum['paraphrase'].numpy().decode('utf-8'),
        'example': datum['exampleText'].numpy().decode('utf-8'),
        'url': datum['url'].numpy().decode('utf-8'), 
        'video': datum['video'].numpy().decode('utf-8'),
        'poseMediapipe': datum['pose']['path'].numpy().decode('utf-8'),
        'exampleVideo': datum['exampleVideo'].numpy().decode('utf-8'),
        'examplePoseMediapipe': datum['examplePose']['path'].numpy().decode('utf-8'),
    }
    data.append(current)

df = pd.DataFrame.from_records(data, index='id')

print('Check fields in metafile:')
for item in df.to_dict('records'):
    for key, value in item.items():
        if key not in ['video', 'poseMediapipe', 'exampleVideo', 'examplePoseMediapipe']:
            if not value or value == 'empty':
                print(f"id={item['id']}, name={item['name']} has empty {key}")

        if key in ['video', 'poseMediapipe']:
            if not os.path.exists(value):
                print(f"id={item['id']}, name={item['name']} has invalid {key} path (unexpected!)")

        if key in ['exampleVideo', 'examplePoseMediapipe']:
            if not os.path.exists(value):
                if item['example']:
                    print(f"id={item['id']}, name={item['name']} has invalid {key} path (unexpected!)")
                else:
                    print(f"id={item['id']}, name={item['name']} has invalid {key} path (expected)")
print('---------------------------------------')

and the log file:
signsuisse.log

Looking at the log file:

  • there is no invalid video path, but there are 1041 unexpected invalid poseMediapipe paths.
  • there are 16 empty examples, but there are 1446 invalid examplePoseMediapipe paths (only 16 of them are expected due to no example existing).
  • For example: LE HAVRE's missing of examplePoseMediapipe is expected since it does not have an example, while TRIPLE's missing of examplePoseMediapipe is unexpected.

So a lot of Mediapipe pose estimations are missing. @AmitMY please check whether it is the same on your side (not sure whether it's a downloading problem or not).