sign-language-processing/datasets

Getting "ValueError: Dataset with name already registered" errors when trying to run PyTest

Closed this issue · 23 comments

          OK, now I get 22 errors, but _from pytest_ we are making progress! 

image

Originally posted by @cleong110 in #53 (comment)

I thought I had this figured out using the following procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .

But when I follow the above steps, creating a new conda env and installing from source, I still get this.

OK, seriously, how does all this registering stuff work?

tensorflow/datasets#552 also talks about how to unregister

I mean the issue stack traces show

sign_language_datasets\datasets\__init__.py:1: in <module>
    from .aslg_pc12 import AslgPc12
sign_language_datasets\datasets\aslg_pc12\__init__.py:3: in <module>
    from .aslg_pc12 import AslgPc12

So maybe it shouldn't be in both places?

OK, If I...

  1. comment out everything in the https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py other than SignDatasetConfig
  2. remember to pip install from source again
    Then I finally get actual PyTest Tests running again.

So my https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py looks like

from .config import SignDatasetConfig
# from .aslg_pc12 import AslgPc12
# from .asl_lex import AslLex
# from .autsl import AUTSL
# from .chicagofswild import ChicagoFSWild
# from .dgs_types import DgsTypes
# from .dgs_corpus import DgsCorpus
# from .dicta_sign import DictaSign
# from .how2sign import How2Sign
# from .mediapi_skel import MediapiSkel
# from .rwth_phoenix2014_t import RWTHPhoenix2014T
# from .sign2mint import Sign2MINT
# from .signbank import SignBank
# from .signtyp import SignTyp
# from .signsuisse import SignSuisse
# from .sign_wordnet import SignWordnet
# from .swojs_glossario import SwojsGlossario
# from .wlasl import Wlasl
# from .ngt_corpus import NGTCorpus
# from .bsl_corpus import BslCorpus
# from .wmt_slt import WMTSLT

and then I run

python -m pip install . && pytest .

hmmmm the problem with commenting all datasets from this file, is that when we want to use a dataset, we use the following import, which registers all datasets

import sign_language_datasets.datasets

I'm ok with not doing this, and instead, whenever using a dataset, for example autsl, having to import

import sign_language_datasets.datasets.autsl

Though it might cause confusion and users not finding the right dataset.

What do you think?

Well, I thought I'd go back and see what tfds does to handle all this, and I have noticed a few things.

  1. Their example code doesn't have you import so specifically, it's just one top-level import and a .load method. So if we follow their practice it'd be more intuitive for people who are used to tfds.
import tensorflow_datasets as tfds
import tensorflow as tf

# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)
  1. They structure their repo and __init__.py files a bit differently. Most notably they have:

image

The other thing I tested was whether I could get pytest running on their repo, and yes I was able to, without this import issue.

I guess I'll give it a whirl and see?

hmmm, actually, just removing https://github.com/sign-language-processing/datasets/blob/master/__init__.py might be sufficient. I did

git clone https://github.com/sign-language-processing/datasets.git
cd datasets
conda create -n sign_language_datasets_source pip python=3.10 
conda activate sign_language_datasets_source 
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest . # Gives me the import errors
rm __init__.py
pytest . # it works

Made a .py file to load a dataset, and it seems to work in the expected way:

load_phoenix.py:

import tensorflow_datasets as tfds
import sign_language_datasets.datasets
from sign_language_datasets.datasets.config import SignDatasetConfig

import itertools
     
if __name__ == "__main__":
    config = SignDatasetConfig(name="only-annotations", version="3.0.0", include_video=False)
    rwth_phoenix2014_t = tfds.load(name='rwth_phoenix2014_t', builder_kwargs=dict(config=config))

    for datum in itertools.islice(rwth_phoenix2014_t["train"], 0, 10):
        print(datum['gloss'].numpy().decode('utf-8'))
        print(datum['text'].numpy().decode('utf-8'))
        print()
pip install . # make sure I'm using the edited version with the init gone
python load_phoenix.py # it loads

So I guess maybe that __init__.py at the top of the repo was causing the double-registration issue? I'm still not fully confident I understand the process of what's going on, but removing it seems to make things work without causing any issues I know of

If you make a PR I could see exactly what you did there. I am fine with following tfds

Made a pull request!

resolved, pull request merged