Original Spider Dataset used in the Paper + Setup Instructions to Run on a New Database

Question

Original Spider Dataset used in the Paper + Setup Instructions to Run on a New Database

Opened this issue a year ago · 1 comments

Hey,

I tried following the setup instructions given in README.md but I think because of some changes in the Spider dataset these instructions are no longer valid.

For example,

python3 -u preprocess/process_dataset.py --dataset_path data/train.json --raw_table_path data/tables.json --table_path data/tables.bin --output_path 'data/train.bin' --skip_large --semantic_graph

There is no 'data/train.json' file in the Spider dataset. But it has 'data/train_spider.json' and 'data/train_others.json' files.

I tried changing the file name, but I get the following error.

Firstly, preprocess the original databases ... Traceback (most recent call last): File "/sensei-fs/users/saudi/text2sql/ISESL-SQL/preprocess/process_dataset.py", line 74, in <module> tables = process_tables(processor, tables_list, args.table_path, args.verbose) File "/sensei-fs/users/saudi/text2sql/ISESL-SQL/preprocess/process_dataset.py", line 26, in process_tables tables[each['db_id']] = processor.preprocess_database(each, verbose=verbose) File "/sensei-fs/users/saudi/text2sql/ISESL-SQL/preprocess/common_utils.py", line 100, in preprocess_database c = [w.lemma.lower() for s in doc.sentences for w in s.words] File "/sensei-fs/users/saudi/text2sql/ISESL-SQL/preprocess/common_utils.py", line 100, in <listcomp> c = [w.lemma.lower() for s in doc.sentences for w in s.words] AttributeError: 'NoneType' object has no attribute 'lower'

If possible, please upload the original dataset used in google drive and share the link.

Also, please provide the instructions to run the pipeline on a new database. Like what files need to be created in the data folder and what scripts to use.

Your help on this is much appreciated.

Best,
Saud Iqbal

Answer 1 · 2023-10-09T12:02:05.000Z

Have you resolved this issue? I don't think it's closely related to the dataset. Maybe some words in your lib version does not have lemma. You can try using
[w.lemma_.lower() if w.lemma_ is not None else w.lower() for s in doc.sentences for w in s.words]