imsb-uke/scGAN

'IndexError: index out of bounds ' When running command 'python main.py --param parameters.json --process'

JohnWang1997 opened this issue · 5 comments

The output of the command is as follows:
$ python main.py --param parameters.json --process
( Omit some warnings )
Clustering of the raw data is done to 3 clusters.
Filtering of the raw data is done with minimum 10 cells per gene.
Filtering of the raw data is done with minimum 3 genes per cell.
Cells number is 2700 , with 13714 genes per cell.
Scaling of the data is done using normalize_per_cell_LSwith 20000
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/mnt/d/Linux/software/anaconda/envs/scGAN/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2560, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/_libs/index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/d/Linux/software/anaconda/envs/scGAN/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/mnt/d/PythonProject/scGAN/preprocessing/write_tfrecords.py", line 149, in read_and_serialize
sc_genes, d = process_line(line)
File "/mnt/d/PythonProject/scGAN/preprocessing/write_tfrecords.py", line 113, in process_line
dset = line.obs['split'][0]
File "/mnt/d/Linux/software/anaconda/envs/scGAN/lib/python3.6/site-packages/pandas/core/series.py", line 623, in getitem
result = self.index.get_value(self, key)
File "/mnt/d/Linux/software/anaconda/envs/scGAN/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2566, in get_value
return libts.get_value_box(s, key)
File "pandas/_libs/tslib.pyx", line 1017, in pandas._libs.tslib.get_value_box
File "pandas/_libs/tslib.pyx", line 1032, in pandas._libs.tslib.get_value_box
IndexError: index out of bounds
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "main.py", line 94, in
process_files(exp_folders)
File "/mnt/d/PythonProject/scGAN/preprocessing/write_tfrecords.py", line 175, in process_files
for res in results:
File "/mnt/d/Linux/software/anaconda/envs/scGAN/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
IndexError: index out of bounds
————————————————————————————————————————————————————
Thank you very much for your answer!

To help me debug, would you please attach your parameters.json file?
Most likely I'm going to also need the associated dataset (or a reduced version with a few single cells in it should be enough, as long as it reproduces the issue on your end).

Thanks for your answer. I tested this process command using the pbmc3k data set, except for modifying the input and output directories, without modifying other hyperparameters of the .json file.
In addition, I ran the code step by step to debug in the jupyter notebook, and the same error was reported at the same location (File "/mnt/d/PythonProject/scGAN/preprocessing/write_tfrecords.py", line 175), but tfrecords can be generated.
data_and_json.zip
TF_records.zip
test_process(ipynb).zip

Thanks for your data and the notebook. Unfortunately we're not able to reproduce the issue so far.
Could you may send us your package versions using pip freeze?
This would help to rule out that the issue is happening due to different package versions.

After some investigation, it appears that you have been using a more recent version of anndata that is unfortunately not compatible with our code. Indeed, with version 0.7, anndata introduced some breaking changes of their data structures.
However, using an older version, you will be able to preprocess the data and proceed with the training of scGAN.
The Dockerfile we initially provided was actually really bad and unhelpful to help running scGAN. I just committed a new version of it that will let you set up a suitable environment easily. (Alternatively, if you do not wish to use Docker, you will also find a requirements.txt file that you can use with pip to install all the python dependencies.)

I'm also attaching here an h5ad file of the 10x 3kPBMCs dataset that I generated with the correct version of anndata. Naturally, after having set up the right environment, you can also use again our jupyter notebook to convert any other dataset from 10x format to h5ad.
3kPBMCs_06.zip

I hope this helps. Let me know if you need more support.

Thank you very much for your detailed answers @pierremac and @fhausmann. I have modified the Anndata package to version 0.6.18 based on your suggestions and solved the problem. Wish you success in your work!