yfukasawa/LongQC

pandas.errors.EmptyDataError: No columns to parse from file

Closed this issue · 6 comments

Hi Yoshinori

I am getting the following error when running:

python longQC.py sampleqc -x pb-hifi -o longqc ccs.bam

I made sure that ccs.bam is a 16GB file

Thank you for the help

longQC:2021-05-07 09:18:02,824:592:INFO:Filteration finished.
longQC:2021-05-07 09:18:12,834:598:INFO:Generating coverage related plots...
Traceback (most recent call last):
File "/LongQC-1.2.0b/longQC.py", line 956, in
main(args)
File "/LongQC-1.2.0b/longQC.py", line 62, in main
args.handler(args)
File "/LongQC-1.2.0b/longQC.py", line 602, in command_sample
lc = LqCoverage(cov_path, isTranscript=args.transcript, control_filtering=pb_control)
File "/LongQC-1.2.0b/lq_coverage.py", line 88, in init
self.df = pd.read_table(table_path, sep='\t', header=None)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers.py", line 689, in read_table
return _read(filepath_or_buffer, kwds)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers.py", line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers.py", line 819, in init
self._engine = self._make_engine(self.engine)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers.py", line 1898, in init
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas/_libs/parsers.pyx", line 521, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file

Hi @grpiccoli,

Can I ask contents for a file named coverage_err.txt? Also, can I ask the size for coverage_out.txt?
It should be under /path/to/longqc/analysis/minimap2/coverage_*.txt.
One possible cause is out of memory in minimap2-coverage, which usually requires 50-60 Gbytes for 4Gbp index (default).
Unfortunately, it's very hard to predict how much memory is used for your dataset as it depends on k-mer freq either. Having 100 Gbytes memory is a bit safer, and if machine memory size is small, please specify smaller size in --index option (e.g. 500M).
coverage_out.txt should have 5000 lines, but let me ask check this possibility.

Yoshinori

Thank you Yoshinori for all the help, the fix was to increase ram as described

Hello ! I experienced the same issue. It looks like resubmitting the job with 100 gigs of ram has fixed the problem.

Maybe the error message could be more explicit ?

Hi, I have the same problem, memory shouldn't be an issue on my 1TB machine. The coverage_err.txt contains the following error message:

[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::mm_mapopt_update::728.53218.16] mid_occ = 3164
[M::main::728.532
18.16] loaded/built the index for 239301 target sequence(s)
[M::worker_pipeline::776.181*19.72] mapped 5000 sequences. (Peak RSS: 50.844 GB)
munmap_chunk(): invalid pointer

Seems to be a minimap2 issue.

Have you seen that before?

Thanks,
Chris

hello @VDaric,

sorry for late reply as I simply didn't noticed.
Thanks for your feedback. will consider some nicer ways for memory issue. maybe something to catch killed process.
also, technically it's very hard to predict how many RAM will be required in advance as it depends on dataset content. from my experience, skewed GC samples may require more RAMs maybe due to inflated spurious matches.

hello @ctxchris,

Thank you for your report.
yes, I agree. this issue should not have been caused by lack of memory, as the process used 50 GB at maximum before the crush.

for the message, as long as I remember, I've never seen this error. at present, I'm sorry to say, but no idea...
By any chance, your data is a public one? If I can get the same data, I can try to reproduce the issue at my end.

Yoshinori