Getting "ValueError: Dataset with name already registered" errors when trying to run PyTest

Question

Getting "ValueError: Dataset with name already registered" errors when trying to run PyTest

Closed this issue 9 months ago · 23 comments

          OK, now I get 22 errors, but _from pytest_ we are making progress!

Originally posted by @cleong110 in #53 (comment)

Answer 1 · 2024-03-01T20:31:57.000Z

I thought I had this figured out using the following procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .

But when I follow the above steps, creating a new conda env and installing from source, I still get this.

Answer 2 · 2024-03-01T20:39:39.000Z

OK, seriously, how does all this registering stuff work?

Answer 3 · 2024-03-01T20:45:46.000Z

Having a look at the original tensorflow_datasets library, we've got:

Answer 4 · 2024-03-01T20:52:45.000Z

https://tensorflow.google.cn/datasets/common_gotchas?hl=en#new_datasets_should_be_self-contained_in_a_folder mentions something to do with paths

Answer 5 · 2024-03-01T20:56:10.000Z

https://www.tensorflow.org/datasets/add_dataset#add_an_import_for_registration talks about registration

Answer 6 · 2024-03-01T20:57:03.000Z

https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/registered.py is where the magic happens

Answer 7 · 2024-03-01T20:58:04.000Z

tensorflow/datasets#552 also talks about how to unregister

Answer 8 · 2024-03-01T21:02:38.000Z

Does it occur in Colab?

Yes: https://colab.research.google.com/drive/160LWW9pamKpkcb0YzQrHgamRNfqNpFsx?usp=sharing

So it's not just my machine.

Answer 9 · 2024-03-01T21:08:43.000Z

I mean the issue stack traces show

sign_language_datasets\datasets\__init__.py:1: in <module>
    from .aslg_pc12 import AslgPc12
sign_language_datasets\datasets\aslg_pc12\__init__.py:3: in <module>
    from .aslg_pc12 import AslgPc12

So maybe it shouldn't be in both places?

Answer 10 · 2024-03-01T21:20:31.000Z

OK, If I...

comment out everything in the https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py other than SignDatasetConfig
remember to pip install from source again
Then I finally get actual PyTest Tests running again.

Answer 11 · 2024-03-01T21:21:18.000Z

So my https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py looks like

from .config import SignDatasetConfig
# from .aslg_pc12 import AslgPc12
# from .asl_lex import AslLex
# from .autsl import AUTSL
# from .chicagofswild import ChicagoFSWild
# from .dgs_types import DgsTypes
# from .dgs_corpus import DgsCorpus
# from .dicta_sign import DictaSign
# from .how2sign import How2Sign
# from .mediapi_skel import MediapiSkel
# from .rwth_phoenix2014_t import RWTHPhoenix2014T
# from .sign2mint import Sign2MINT
# from .signbank import SignBank
# from .signtyp import SignTyp
# from .signsuisse import SignSuisse
# from .sign_wordnet import SignWordnet
# from .swojs_glossario import SwojsGlossario
# from .wlasl import Wlasl
# from .ngt_corpus import NGTCorpus
# from .bsl_corpus import BslCorpus
# from .wmt_slt import WMTSLT

and then I run

python -m pip install . && pytest .

Answer 12 · 2024-03-02T12:13:54.000Z

hmmmm the problem with commenting all datasets from this file, is that when we want to use a dataset, we use the following import, which registers all datasets

import sign_language_datasets.datasets

I'm ok with not doing this, and instead, whenever using a dataset, for example autsl, having to import

import sign_language_datasets.datasets.autsl

Though it might cause confusion and users not finding the right dataset.

What do you think?

Answer 13 · 2024-03-04T15:38:08.000Z

Well, I thought I'd go back and see what tfds does to handle all this, and I have noticed a few things.

Their example code doesn't have you import so specifically, it's just one top-level import and a .load method. So if we follow their practice it'd be more intuitive for people who are used to tfds.

import tensorflow_datasets as tfds
import tensorflow as tf

# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)

They structure their repo and __init__.py files a bit differently. Most notably they have:

None in the top-level:
One in the tensorflow_datasets subdirectory , which imports the subcategories/subfolders such as image_classification, with a try-except built-in
then the __init__.py__ in the subcategory folder finally imports the actual dataset, most of which don't have their own subfolder, just a .py...
But some do, and they also have an __init__.py with a full path, which also gets imported one level up in the category folder:

Answer 14 · 2024-03-04T15:38:59.000Z

The other thing I tested was whether I could get pytest running on their repo, and yes I was able to, without this import issue.

Answer 15 · 2024-03-04T15:45:15.000Z

So translating it to sign_language_datasets, if we wanted the layout to be identical we would:

remove this: https://github.com/sign-language-processing/datasets/blob/master/__init__.py
keep this, adding in the try-except? https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/__init__.py
change the imports in datasets to use full paths: https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py, e.g. sign_language_datasets.datasets.aslg_pc12 to match the pattrn of the subcategory inits like image_classification
Change the inits in individual dataset folders to be full paths as well. e.g. https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/wmt_slt/__init__.py should match the pattern from https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/cifar10_n/__init__.py

Answer 16 · 2024-03-04T15:45:29.000Z

I guess I'll give it a whirl and see?

Answer 17 · 2024-03-04T15:55:58.000Z

hmmm, actually, just removing https://github.com/sign-language-processing/datasets/blob/master/__init__.py might be sufficient. I did

git clone https://github.com/sign-language-processing/datasets.git
cd datasets
conda create -n sign_language_datasets_source pip python=3.10 
conda activate sign_language_datasets_source 
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest . # Gives me the import errors
rm __init__.py
pytest . # it works

Answer 18 · 2024-03-04T16:02:53.000Z

Made a .py file to load a dataset, and it seems to work in the expected way:

load_phoenix.py:

import tensorflow_datasets as tfds
import sign_language_datasets.datasets
from sign_language_datasets.datasets.config import SignDatasetConfig

import itertools
     
if __name__ == "__main__":
    config = SignDatasetConfig(name="only-annotations", version="3.0.0", include_video=False)
    rwth_phoenix2014_t = tfds.load(name='rwth_phoenix2014_t', builder_kwargs=dict(config=config))

    for datum in itertools.islice(rwth_phoenix2014_t["train"], 0, 10):
        print(datum['gloss'].numpy().decode('utf-8'))
        print(datum['text'].numpy().decode('utf-8'))
        print()

pip install . # make sure I'm using the edited version with the init gone
python load_phoenix.py # it loads

Answer 19 · 2024-03-04T16:17:40.000Z

replicated in Colab: https://colab.research.google.com/drive/1IOa0IPdmPa-S0iE6GF5QygUzEee88OVM?usp=sharing

Answer 20 · 2024-03-04T16:19:13.000Z

So I guess maybe that __init__.py at the top of the repo was causing the double-registration issue? I'm still not fully confident I understand the process of what's going on, but removing it seems to make things work without causing any issues I know of

Answer 21 · 2024-03-05T14:12:40.000Z

If you make a PR I could see exactly what you did there. I am fine with following tfds

Answer 22 · 2024-03-05T14:34:31.000Z

Made a pull request!

Answer 23 · 2024-03-05T20:37:32.000Z

resolved, pull request merged