Illumina/interop

What files/directories are actually needed as input?

nick-youngblut opened this issue · 6 comments

At least most of the docs/tutorials demonstrate specifying the entire run folder as input.

However, for pipelines such as Nextflow, it is much more efficient to specify only the specific input files required.

Given that the package is labeled "interop", one would assume that only the InterOp directory (or just the main *.bin files in the directory, such as SummaryRunMetricsOut.bin) are needed, but it is not clear how to just read InterOp directory or specific *.bin files with the python wrapper.

For instance, if I use the summary.py example (after updating to Python 3), I get:

2023-12-09 21:06:24,486 - Skipping - cannot read RunInfo.xml:  - No format found to parse ErrorMetricsOut.bin with version: 6 of 3
/io/./interop/io/metric_stream.h::read_metrics (111)

However, the input directory that I specified contains:

|-- InterOp
`-- RunInfo.xml

The RunInfo.xml is present, and all *.bin files are in the InterOp directory:

InterOp/AlignmentMetricsOut.bin
InterOp/BasecallingMetricsOut.bin
InterOp/CorrectedIntMetricsOut.bin
InterOp/EmpiricalPhasingMetricsOut.bin
InterOp/ErrorMetricsOut.bin
InterOp/EventMetricsOut.bin
InterOp/ExtendedTileMetricsOut.bin
InterOp/ExtractionMetricsOut.bin
InterOp/FWHMGridMetricsOut.bin
InterOp/ImageMetricsOut.bin
InterOp/InsertSizeMetricsOut.bin
InterOp/OpticalMetricsOut.bin
InterOp/OpticalModelMetricsOut.bin
InterOp/PFGridMetricsOut.bin
InterOp/QMetrics2030Out.bin
InterOp/QMetricsByLaneOut.bin
InterOp/QMetricsOut.bin
InterOp/RawFWHMGridMetricsOut.bin
InterOp/ReconstructionMetricsOut.bin
InterOp/SummaryRunMetricsOut.bin
InterOp/SweepMetricsOut.bin
InterOp/TileMetricsOut.bin

Moreover, summary.py generates nothing for MiSeq runs. If I add:

        print(f"Summary size: {summary.size()}")
        print(f"Summary lane count: {summary.lane_count()}")
        print(f"Summary surface count: {summary.surface_count()}")

I get:

Summary size: 0
Summary lane count: 0
Summary surface count: 0

The MiSeq run folder that I'm using contains all output files for a successful MiSeq run.

The RunInfo.xml file shows that the counts should be 1 and not 0:

<?xml version="1.0"?>
<RunInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Version="2">
  <Run Id="XXX" Number="45">
    <Flowcell>XXX</Flowcell>
    <Instrument>XXX</Instrument>
    <Date>231205</Date>
    <Reads>
      <Read NumCycles="151" Number="1" IsIndexedRead="N" />
      <Read NumCycles="151" Number="2" IsIndexedRead="N" />
    </Reads>
    <FlowcellLayout LaneCount="1" SurfaceCount="1" SwathCount="1" TileCount="2" />
  </Run>
</RunInfo>

Addressing the first concern, the error message reported in Python below is incorrect. It should have said just Skipping and not Skipping - cannot read RunInfo.xml.

The second part of the error is the important bit
No format found to parse ErrorMetricsOut.bin with version: 6 of 3

This means that you are trying to parse version 6 of the ErrorMetricsOut.bin with a version of the interop library that only supports up to version 3. Upgrading the interop library will address this issue.

try:
run_metrics.read(run_folder_path, valid_to_load)
except Exception, ex:
logging.warn("Skipping - cannot read RunInfo.xml: %s - %s"%(run_folder, str(ex)))

The second issue sounds like a bug. It may be in the older version of interop you are using based on the previous issue, or it may still be in the library. I will need to investigate this.

I cannot reproduce this issue with a local MiSeq run and the latest version of the library.

Upgrading the interop library will address this issue.

$ pip install interop==1.3.0
ERROR: Could not find a version that satisfies the requirement interop==1.3.0 (from versions: 1.1.18, 1.1.19, 1.1.21, 1.1.22, 1.1.23)

I'm using Ubuntu 22.04 & python 3.9.19. My python env:

Package    Version
---------- -------
numpy      1.26.2
pip        23.3.1
setuptools 68.2.2
wheel      0.41.3

Based on the setup.py.in file, it seems like version 1.3.0 should be compatible with my environment.

Note: Installation of interop v1.3.0 via bioconda doesn't install the python package.

Look like there is a bug when building the Python 3.9 for manylinux. That wheel is missing in PyPI.

I will have a PR out to fix that.

As for bioconda, we don't support that and I don't know much about it.

As for the list of files, the InterOp files listed on this site plus the RunInfo.xml are required.

You can load individual files, but we don't document that route and I don't recommend it.