What files/directories are actually needed as input?
nick-youngblut opened this issue · 6 comments
At least most of the docs/tutorials demonstrate specifying the entire run folder as input.
However, for pipelines such as Nextflow, it is much more efficient to specify only the specific input files required.
Given that the package is labeled "interop", one would assume that only the InterOp
directory (or just the main *.bin files in the directory, such as SummaryRunMetricsOut.bin
) are needed, but it is not clear how to just read InterOp
directory or specific *.bin files with the python wrapper.
For instance, if I use the summary.py example (after updating to Python 3), I get:
2023-12-09 21:06:24,486 - Skipping - cannot read RunInfo.xml: - No format found to parse ErrorMetricsOut.bin with version: 6 of 3
/io/./interop/io/metric_stream.h::read_metrics (111)
However, the input directory that I specified contains:
|-- InterOp
`-- RunInfo.xml
The RunInfo.xml
is present, and all *.bin files are in the InterOp
directory:
InterOp/AlignmentMetricsOut.bin
InterOp/BasecallingMetricsOut.bin
InterOp/CorrectedIntMetricsOut.bin
InterOp/EmpiricalPhasingMetricsOut.bin
InterOp/ErrorMetricsOut.bin
InterOp/EventMetricsOut.bin
InterOp/ExtendedTileMetricsOut.bin
InterOp/ExtractionMetricsOut.bin
InterOp/FWHMGridMetricsOut.bin
InterOp/ImageMetricsOut.bin
InterOp/InsertSizeMetricsOut.bin
InterOp/OpticalMetricsOut.bin
InterOp/OpticalModelMetricsOut.bin
InterOp/PFGridMetricsOut.bin
InterOp/QMetrics2030Out.bin
InterOp/QMetricsByLaneOut.bin
InterOp/QMetricsOut.bin
InterOp/RawFWHMGridMetricsOut.bin
InterOp/ReconstructionMetricsOut.bin
InterOp/SummaryRunMetricsOut.bin
InterOp/SweepMetricsOut.bin
InterOp/TileMetricsOut.bin
Moreover, summary.py generates nothing for MiSeq runs. If I add:
print(f"Summary size: {summary.size()}")
print(f"Summary lane count: {summary.lane_count()}")
print(f"Summary surface count: {summary.surface_count()}")
I get:
Summary size: 0
Summary lane count: 0
Summary surface count: 0
The MiSeq run folder that I'm using contains all output files for a successful MiSeq run.
The RunInfo.xml file shows that the counts should be 1 and not 0:
<?xml version="1.0"?>
<RunInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Version="2">
<Run Id="XXX" Number="45">
<Flowcell>XXX</Flowcell>
<Instrument>XXX</Instrument>
<Date>231205</Date>
<Reads>
<Read NumCycles="151" Number="1" IsIndexedRead="N" />
<Read NumCycles="151" Number="2" IsIndexedRead="N" />
</Reads>
<FlowcellLayout LaneCount="1" SurfaceCount="1" SwathCount="1" TileCount="2" />
</Run>
</RunInfo>
Addressing the first concern, the error message reported in Python below is incorrect. It should have said just Skipping
and not Skipping - cannot read RunInfo.xml
.
The second part of the error is the important bit
No format found to parse ErrorMetricsOut.bin with version: 6 of 3
This means that you are trying to parse version 6
of the ErrorMetricsOut.bin
with a version of the interop library that only supports up to version 3
. Upgrading the interop library will address this issue.
interop/src/examples/python/summary.py
Lines 34 to 37 in b3a1089
The second issue sounds like a bug. It may be in the older version of interop you are using based on the previous issue, or it may still be in the library. I will need to investigate this.
I cannot reproduce this issue with a local MiSeq run and the latest version of the library.
Upgrading the interop library will address this issue.
$ pip install interop==1.3.0
ERROR: Could not find a version that satisfies the requirement interop==1.3.0 (from versions: 1.1.18, 1.1.19, 1.1.21, 1.1.22, 1.1.23)
I'm using Ubuntu 22.04 & python 3.9.19. My python env:
Package Version
---------- -------
numpy 1.26.2
pip 23.3.1
setuptools 68.2.2
wheel 0.41.3
Based on the setup.py.in file, it seems like version 1.3.0 should be compatible with my environment.
Note: Installation of interop v1.3.0 via bioconda doesn't install the python package.
Look like there is a bug when building the Python 3.9 for manylinux. That wheel is missing in PyPI.
I will have a PR out to fix that.
As for bioconda, we don't support that and I don't know much about it.
As for the list of files, the InterOp files listed on this site plus the RunInfo.xml are required.
You can load individual files, but we don't document that route and I don't recommend it.