Using the provided eval code on different dataset

Question

Using the provided eval code on different dataset

Closed this issue 3 years ago · 20 comments

Hey,

I have tried running "make eval" and was able to retrieve numbers on Spider-Dev.

Now, I would like to use the provided T5+PICARD for evaluation on other datasets like Spider-DK or Spider-realistic. Is there an easy and convenient way to use these dataset for eval only---for e.g., as a command-line arguments?

Thanks

Answer 1 · 2021-12-18T15:09:12.000Z

Hi @salokr,
I'm afraid there isn't a code path for these datasets. Contributions are welcome. You'd have to make a copy of https://github.com/ElementAI/picard/blob/main/seq2seq/datasets/spider/spider.py, e.g. https://github.com/ElementAI/picard/blob/main/seq2seq/datasets/spider-realistic/spider-realistic.py, and then change it so that it downloads the dataset and pre-processes it correctly. You also need to add some straight-forward code here: https://github.com/ElementAI/picard/blob/e37020b6eee18bff865d9d2ba852bd636f3ed777/seq2seq/utils/dataset_loader.py#L86.
Good luck, Torsten

Answer 2 · 2022-02-14T14:11:10.000Z

see also #59 #58 #57

Answer 3 · 2022-02-16T05:08:55.000Z

Thank you for this @tscholak .

I have one more query. Can I avoid using the docker images altogether and use the code as it is. Is there a way I can do this?

I will train T5 on my own dataset and then will follow the steps to use PICARD. I tried installing packages from the ".toml" file using "python -m pip install ." but I'm getting an error that "PICARD is not a package" so the installation procedure halts every time.

Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... error ERROR: Command errored out with exit status 1: command: /projects/salokr/envs/picardEnv/bin/python /projects/salokr/envs/picardEnv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmp4hwpkbkp cwd: /tmp/pip-req-build-j09ifpxd Complete output (20 lines): Traceback (most recent call last): File "/projects/salokr/envs/picardEnv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 280, in <module> main() File "/projects/salokr/envs/picardEnv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 263, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "/projects/salokr/envs/picardEnv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 133, in prepare_metadata_for_build_wheel return hook(metadata_directory, config_settings) File "/tmp/pip-build-env-_ju152to/overlay/lib/python3.8/site-packages/poetry/core/masonry/api.py", line 44, in prepare_metadata_for_build_wheel builder = WheelBuilder(poetry) File "/tmp/pip-build-env-_ju152to/overlay/lib/python3.8/site-packages/poetry/core/masonry/builders/wheel.py", line 57, in __init__ super(WheelBuilder, self).__init__(poetry, executable=executable) File "/tmp/pip-build-env-_ju152to/overlay/lib/python3.8/site-packages/poetry/core/masonry/builders/builder.py", line 85, in __init__ self._module = Module( File "/tmp/pip-build-env-_ju152to/overlay/lib/python3.8/site-packages/poetry/core/masonry/utils/module.py", line 73, in __init__ PackageInclude( File "/tmp/pip-build-env-_ju152to/overlay/lib/python3.8/site-packages/poetry/core/masonry/utils/package_include.py", line 22, in __init__ self.check_elements() File "/tmp/pip-build-env-_ju152to/overlay/lib/python3.8/site-packages/poetry/core/masonry/utils/package_include.py", line 80, in check_elements raise ValueError("{} is not a package.".format(root.name)) **ValueError: picard is not a package.** ----------------------------------------

Saurabh

Answer 4 · 2022-02-16T13:33:05.000Z

Hi @salokr,
Yes, you can train your model with just the python code and dependencies. The repository uses poetry, and you can use poetry install to install the python dependencies. This should not give you an error like the above.
Contributions for additional datasets are welcome!

Answer 5 · 2022-02-17T02:33:14.000Z

Hi @tscholak thank you for the response. I tried using "poetry install" but the same error "picard in not a package" is there. Attaching the screenshot for your reference:

Contributions for additional datasets are welcome!:
I can give it a try but with no promises :)

Saurabh

Answer 6 · 2022-02-17T02:57:17.000Z

🤔 perhaps poetry install --no-root will work. This should only install dependencies.

Answer 7 · 2022-02-17T03:06:55.000Z

I think, in the last run poetry install packages were installed. Because when I tried running poetry install --no-root, I got:

Answer 8 · 2022-02-22T11:09:25.000Z

Hey @salokr,
Thanks for sharing these numbers, they are quite interesting. Is there any code you could share? There is an open issue and call for help with implementing additional dataset code. It would be amazing if you were to contribute by opening a pull request. 😊

Answer 9 · 2022-03-05T17:05:36.000Z

thank you @salokr!

Answer 10 · 2022-03-07T17:13:32.000Z

Hi @tscholak,

Quick question:
Probably you're already aware of this, but are you interested in extending the support for your code to be used by Singularity also, instead of docker? If you're then I can make another pull request to add a temp ReadMe file. You can take a look at it and if you're interested then probably we'll merge them.

Why Singularity? I have found that editing images via singularity are very easy. You can make changes directly to the code by making editable directories using it. So, for issues like, isse_12 we don't have to push and pull back the changes before making edits. It can be directly edited like a conventional python file.

Another reason: sometimes we don't have support for Docker at all. 😆

Happy to hear your thoughts and I will close this issue after your response.

Saurabh

Answer 11 · 2022-03-07T18:16:36.000Z

Hi @salokr,
The issue I would have with that is that someone would need to maintain this. I am unfamiliar with Singularity, and I therefore cannot fix any issue that may come up.

Answer 12 · 2022-03-07T18:43:44.000Z

That being said, I'm curious how the procedure would look like. If it's just another section in the readme, I'm probably ok with it :)

Answer 13 · 2022-03-07T22:34:46.000Z

Upon reading here, it turns out that there is a simpler solution to edit docker images, and it doesn't even require pushing the changes back to git (like in docker) --- using the singularity and the sandbox.

Using a sandbox we can create a container within a writable directory. The resulting directory operates just like a container in a SIF/docker file but is editable.

In this way, you can navigate to the app/ directory of this editable container and make changes directly to the files (no need to push to git to reflect changes), and simply execute the seq2seq/run_seq2seq.py file.

In summary, it will be like using a docker image within our system but with edit permissions. Singularity creates a writable directory called app where all the files and folders from the repo are stored. Edit them and directly run. Probably, there is a total of 10 steps (7 for singularity and 3 to make changes to add support for spider_real/dk/dev) to get eval numbers on a new dataset.

And, yes, I mean that I can add details for singularity at the end of your ReadMe file OR create a new one and add a link to the original Readme for interested users.

Answer 14 · 2022-03-07T22:41:46.000Z

Thanks for this information.
What are you going to do about the non-python dependencies?

Answer 15 · 2022-03-07T22:51:38.000Z

Whatever you have provided in the container will be there no need to install anything from the external sources. You just have to execute

singularity build --sandbox text-to-sql-eval docker://tscholak/text-to-sql-eval:e37020b6eee18bff865d9d2ba852bd636f3ed777 # using the --sandbox flag will create a writable directory text-to-sql-eval instead of a non-editable "sif" file. It will take time though (like 15-20 minutes)

Singularity will create a directory called text-to-sql-eval . Within this directory all the environments and packages (conda etc.) you've provided all already available. Something like this:

Upon navigating to ./opt/conda/bin, you will find:

And inside ./app directory all the files related to this git are there:

Finally, make the desired changes inside any of the files you want and run the image (like you run docker). More specifically, run the directory text-to-sql-eval.

Answer 16 · 2022-03-07T22:54:28.000Z

It may seem complex a bit but I have found this solution to be easier (this is how I got the numbers on all versions of spider).

Answer 17 · 2022-03-07T22:55:03.000Z

Thanks again!
Have you confirmed that the binaries actually work inside this environment? I imagine there will be issues with dynamic linking of libraries.

Answer 18 · 2022-03-07T23:04:55.000Z

I can take a look into this if you can please provide me to test which/how part of "the binaries actually work inside this environment?".

My experience with this is: I have used your eval image and used the following steps to get spider numbers (nothing additional) and I never faced any issue regarding libraries/dependencies :

Answer 19 · 2022-03-07T23:24:23.000Z

Alright then, feel free to add a section to the readme!

Answer 20 · 2022-03-07T23:27:58.000Z

Sure thanks :)