bgzip, tabix and sorting
carolinehey opened this issue · 5 comments
Hi @wdecoster
I have 2 nanopolish output tsv files (calls and frequencies). I would like to use them as input for methplotlib. I have tried to sort as you wrote in an example but when i want to use tabix it gives an unsorted error.
cat <(head -n1 sample8_methylation_calls_2.tsv) <(tail -n +2 sample8_methylation_calls_2.tsv | sort -k2,2 -k3,3) | bgzip > sample8_sort_calls.tsv.gz
tabix -S1 -s1 -b3 -e4 sample8_methylation_calls_2.tsv.gz
Looking at the sorted and decompressed tsv file i see that the problem is between the plus and minus strand:
Should it only be sorted by the 3 column and how can this be done?
When i sort like this and use tabix afterwards the file does not work in methplotlib
cat <(head -n1 sample8_methylation_calls_2.tsv) <(tail -n +2 sample8_methylation_calls_2.tsv | sort -k 3) | bgzip > sample8_sort_calls.tsv.gz
(methplotlib_caroline_python3) root@srvodeappgsq02:/longread/G128-2021-Imprinting-disorders/20220315_1228_1G_PAI63326_dd13a3c5# methplotlib -m sample8_sort2_calls.tsv.gz -n calls -w chr15:24,500,000-25,500,000 -f /longread/resources/reference_genomes/GRCh38_Hg38/hg38.fa
Traceback (most recent call last):
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 142, in pandas._libs.index.IndexEngine.get_loc
TypeError: '85285 False
85303 True
85322 True
85341 True
85358 True
...
13851 True
13873 True
13895 False
13917 False
13939 True
Name: log_lik_ratio, Length: 166259, dtype: bool' is an invalid key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/series.py", line 1086, in setitem
self._set_with_engine(key, value)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/series.py", line 1147, in _set_with_engine
loc = self.index.get_loc(key)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3628, in get_loc
self._check_indexing_error(key)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5637, in _check_indexing_error
raise InvalidIndexError(key)
pandas.errors.InvalidIndexError: 85285 False
85303 True
85322 True
85341 True
85358 True
...
13851 True
13873 True
13895 False
13917 False
13939 True
Name: log_lik_ratio, Length: 166259, dtype: bool
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/methplotlib_caroline_python3/bin/methplotlib", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/methplotlib/methplotlib.py", line 26, in main
meth_browser(meth_data=meth_data,
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/methplotlib/methplotlib.py", line 53, in meth_browser
meth_traces = plots.methylation(meth_data, dotsize=dotsize, binary=binary, minqual=minqual)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/methplotlib/plots.py", line 106, in methylation
make_per_read_meth_traces_llr(table=meth.table,
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/methplotlib/plots.py", line 183, in make_per_read_meth_traces_llr
table.loc[:, "llr_scaled"] = rescale_log_likelihood_ratio(table["log_lik_ratio"].copy())
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/methplotlib/plots.py", line 269, in rescale_log_likelihood_ratio
llr[llr > 0] = scaler.fit_transform(llr[llr > 0].values.reshape(-1, 1)).tolist()
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/series.py", line 1128, in setitem
self._set_values(indexer, value)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/series.py", line 1192, in _set_values
self._mgr = self._mgr.setitem(indexer=key, value=value)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 337, in setitem
return self.apply("setitem", indexer=indexer, value=value)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 304, in apply
applied = getattr(b, f)(**kwargs)
File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 943, in setitem
values[indexer] = value
ValueError: shape mismatch: value array of shape (96245,1) could not be broadcast to indexing result of shape (96245,)
For the calls file you should sort on the first (chromosome), third (start), and fourth (end) columns.
Sorry for my ignorance, but how is this done?
And should it also be done on the frequency file?
I think
cat <(head -n1 sample8_methylation_calls_2.tsv) <(tail -n +2 sample8_methylation_calls_2.tsv | sort -k1,1n -k3,3 -k4,4) | bgzip > sample8_sort_calls.tsv.gz
The frequency file has chromosome, start and end in columns 1,2,3 so that would probably look like
cat <(head -n1 sample8_methylation_freqs.tsv) <(tail -n +2 sample8_methylation_freqs.tsv | sort -k1,1n -k2,2 -k3,3) | bgzip > sample8_sort_freqs.tsv.gz
Please create a separate issue for this and also provide the version of methplotlib that you are using.