Getting "ValueError: Dataset with name already registered" errors when trying to run PyTest
Closed this issue · 23 comments
OK, now I get 22 errors, but _from pytest_ we are making progress!
Originally posted by @cleong110 in #53 (comment)
I thought I had this figured out using the following procedure:
conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .
But when I follow the above steps, creating a new conda env and installing from source, I still get this.
OK, seriously, how does all this registering stuff work?
Having a look at the original tensorflow_datasets
library, we've got:
- https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/__init__.py
- https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/object_detection/__init__.py
- https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/object_detection/lvis/__init__.py
- https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/datasets/imagenet2012/__init__.py
https://tensorflow.google.cn/datasets/common_gotchas?hl=en#new_datasets_should_be_self-contained_in_a_folder mentions something to do with paths
https://www.tensorflow.org/datasets/add_dataset#add_an_import_for_registration talks about registration
https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/registered.py is where the magic happens
tensorflow/datasets#552 also talks about how to unregister
Does it occur in Colab?
Yes: https://colab.research.google.com/drive/160LWW9pamKpkcb0YzQrHgamRNfqNpFsx?usp=sharing
So it's not just my machine.
I mean the issue stack traces show
sign_language_datasets\datasets\__init__.py:1: in <module>
from .aslg_pc12 import AslgPc12
sign_language_datasets\datasets\aslg_pc12\__init__.py:3: in <module>
from .aslg_pc12 import AslgPc12
So maybe it shouldn't be in both places?
OK, If I...
- comment out everything in the https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py other than SignDatasetConfig
- remember to pip install from source again
Then I finally get actual PyTest Tests running again.
So my https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py looks like
from .config import SignDatasetConfig
# from .aslg_pc12 import AslgPc12
# from .asl_lex import AslLex
# from .autsl import AUTSL
# from .chicagofswild import ChicagoFSWild
# from .dgs_types import DgsTypes
# from .dgs_corpus import DgsCorpus
# from .dicta_sign import DictaSign
# from .how2sign import How2Sign
# from .mediapi_skel import MediapiSkel
# from .rwth_phoenix2014_t import RWTHPhoenix2014T
# from .sign2mint import Sign2MINT
# from .signbank import SignBank
# from .signtyp import SignTyp
# from .signsuisse import SignSuisse
# from .sign_wordnet import SignWordnet
# from .swojs_glossario import SwojsGlossario
# from .wlasl import Wlasl
# from .ngt_corpus import NGTCorpus
# from .bsl_corpus import BslCorpus
# from .wmt_slt import WMTSLT
and then I run
python -m pip install . && pytest .
hmmmm the problem with commenting all datasets from this file, is that when we want to use a dataset, we use the following import, which registers all datasets
import sign_language_datasets.datasets
I'm ok with not doing this, and instead, whenever using a dataset, for example autsl
, having to import
import sign_language_datasets.datasets.autsl
Though it might cause confusion and users not finding the right dataset.
What do you think?
Well, I thought I'd go back and see what tfds
does to handle all this, and I have noticed a few things.
- Their example code doesn't have you import so specifically, it's just one top-level import and a .load method. So if we follow their practice it'd be more intuitive for people who are used to tfds.
import tensorflow_datasets as tfds
import tensorflow as tf
# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)
- They structure their repo and
__init__.py
files a bit differently. Most notably they have:
- None in the top-level:
- One in the
tensorflow_datasets
subdirectory , which imports the subcategories/subfolders such asimage_classification
, with a try-except built-in
- then the
__init__.py__
in the subcategory folder finally imports the actual dataset, most of which don't have their own subfolder, just a .py...
- But some do, and they also have an
__init__.py
with a full path, which also gets imported one level up in the category folder:
The other thing I tested was whether I could get pytest running on their repo, and yes I was able to, without this import issue.
So translating it to sign_language_datasets
, if we wanted the layout to be identical we would:
- remove this: https://github.com/sign-language-processing/datasets/blob/master/__init__.py
- keep this, adding in the try-except? https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/__init__.py
- change the imports in
datasets
to use full paths: https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py, e.g.sign_language_datasets.datasets.aslg_pc12
to match the pattrn of the subcategory inits likeimage_classification
- Change the inits in individual dataset folders to be full paths as well. e.g. https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/wmt_slt/__init__.py should match the pattern from https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/cifar10_n/__init__.py
I guess I'll give it a whirl and see?
hmmm, actually, just removing https://github.com/sign-language-processing/datasets/blob/master/__init__.py might be sufficient. I did
git clone https://github.com/sign-language-processing/datasets.git
cd datasets
conda create -n sign_language_datasets_source pip python=3.10
conda activate sign_language_datasets_source
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest . # Gives me the import errors
rm __init__.py
pytest . # it works
Made a .py file to load a dataset, and it seems to work in the expected way:
load_phoenix.py:
import tensorflow_datasets as tfds
import sign_language_datasets.datasets
from sign_language_datasets.datasets.config import SignDatasetConfig
import itertools
if __name__ == "__main__":
config = SignDatasetConfig(name="only-annotations", version="3.0.0", include_video=False)
rwth_phoenix2014_t = tfds.load(name='rwth_phoenix2014_t', builder_kwargs=dict(config=config))
for datum in itertools.islice(rwth_phoenix2014_t["train"], 0, 10):
print(datum['gloss'].numpy().decode('utf-8'))
print(datum['text'].numpy().decode('utf-8'))
print()
pip install . # make sure I'm using the edited version with the init gone
python load_phoenix.py # it loads
So I guess maybe that __init__.py
at the top of the repo was causing the double-registration issue? I'm still not fully confident I understand the process of what's going on, but removing it seems to make things work without causing any issues I know of
If you make a PR I could see exactly what you did there. I am fine with following tfds
Made a pull request!
resolved, pull request merged