COMBINE-lab/salmon

Clarity on how 'library_types' is inferred for meta_info.json

Closed this issue · 4 comments

Hi!

I'm trying to tidy up some of the strandedness analysis in nf-core/rnaseq, and have gone down the rabbit hole a little bit, hoping you can shortcut things for me.

nf-core/rnaseq's test profile runs Salmon on subsampled FASTQ files and uses Salmon to infer the strandedness, by extracting the library_types from a meta_info.json.

In an example run (~50k reads) this is unstranded (IU), but I'm having trouble reconciling with what I see in the corresponding lib_format_counts.json:

{
    "read_files": "[ WT_REP2.subsampled_R1.fastq.gz, WT_REP2.subsampled_R2.fastq.gz]",
    "expected_format": "IU",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 39909,
    "num_assigned_fragments": 39909,
    "num_frags_with_concordant_consistent_mappings": 41509,
    "num_frags_with_inconsistent_or_orphan_mappings": 3363,
    "strand_mapping_bias": 0.11922715555662628,
    "MSF": 0,
    "OSF": 0,
    "ISF": 4949,
    "MSR": 0,
    "OSR": 0,
    "ISR": 36560,
    "SF": 1754,
    "SR": 1609,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}

Is it possible for me to do my own inference of IU from those numbers, or does it happen elsewhere? Is there a minimum number of reads/ mappings required to make this assessment, such that 'IU' is just a default?

I'm hoping to derive my own logic, so that we can harmonise it with the calculations we make from RSeQC's output, for example.

I've been poking through LibraryTypeDetector without much joy, if I replicate that logic in Python I get 'ISR', not 'IU'.

To partially answer my own question, I think you need 50k reads to get a proper strandedness assessment, and I don't think it's configurable.

Hey @pinin4fjords, did you fully figure it out? I'm having the exact same question while implementing my own RNA-seq pipeline that closely resembles nf-core's one.

@tdsone I think this is the right logic: https://github.com/COMBINE-lab/salmon/blob/master/include/LibraryTypeDetector.hpp.

The main source of my confusion on this post was that I think Salmon just chucks back 'IU' for read numbers below 50k. It confused me less on realistic read numbers.

I've simplified quite a bit to just work with the strandedness component, using the numbers from lib_format_counts.json. It seems to produce results broadly as expected, but might be a bit naive, for example the numbers are mappings rather than fragments which could throw things off a bit.

Ah, yes - that makes sense. Thank you.