hempelc/metagenomics-vs-totalRNASeq

Issue with generating final output files

Closed this issue · 2 comments

Hi.
Firstly thanks so much for creating this github repository. It's amazing work and I've been finding it so helpful with analysis of my data.
I'm having a problem with the final step of the pipeline - generating final output files.
The merge_on_outer.py subscript results in an error (see below). I think this is because there is no column named 'counts' in my mapping file. I reviewed the mapping step, which had appeared to go well. The samtools coverage command works as expected, and then following your code cols 1 and 7 and kept for the final file - which refer to the node name, and the mean coverage score. This is then renamed to coverage.

So, what I wanted to check was - are these the correct columns to take forward from the coverage.sam file, i.e. should I be using coverage or counts instead? Should the merge_on_outer.py script work in the absence of a column named count? Should coverage simply be renamed to counts?

Hope that all makes sense. I'm definitely not used to python scripts and have learnt most of this stuff from going through the work of others.

Traceback (most recent call last):
File "/home/rjs202/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'counts'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/shared/home/rjs202/RNAseq/subscripts/merge_on_outer.py", line 25, in
df3['counts'].fillna(0, inplace=True)
File "/home/rjs202/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 4102, in getitem
indexer = self.columns.get_loc(key)
File "/home/rjs202/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'counts'

Best wishes,
Richard

Hi Richard,

Thanks for the kind words, I'm glad that the code provided useful!

You're definitely on the right track, in the script merge_on_outer.py, the line that causes you an issue is

df3['counts'].fillna(0, inplace=True)

It looks for the column 'counts' but can't find it (since the column we want to edit is called 'coverage') and therefore raises an error.

This should be easily fixed by changing 'counts' to 'coverage' in the script. I made the change and pushed it to the repo; you can either re-download it or if you have made any changes since then that you want to keep, you can just make that change yourself in the script.

Unfortunately, I don't have access to the data anymore to double-check that this fixes the issue but I'm pretty sure it will. If you run into any additional issues, please feel free to reach out again.

Best,
Chris

Hi Chris,
yes this works once I made the change to the script. Glad to know I am understanding something about the scripts.
Thanks again for your hard work on this.
BW,
Richard