ftwkoopmans/msdap

Can I keep protein groups separate?

ht-lau opened this issue · 1 comments

ht-lau commented

Hi,
I think it is more appropriate to open a new thread on this.

I wonder if there is a way to keep protein group separate between ambiguous peptides. For example,

peptide A, GRIA1; GRIA2
peptide B, GRIA1

remove_ambiguous_proteingroups = FALSE will output GRIA1;GRIA2
remove_ambiguous_proteingroups = FALSE will output GRIA1

I would like to know if there is a way that I can keep both GRIA1 and GRIA1;GRIA2 in the output, even if I will have to manipulate the input. Because the ability to keep them apart can be very valuable based on these recent manuscripts

https://doi.org/10.1038/s41467-023-41558-2
https://doi.org/10.1101/2023.09.19.558203

Thanks
HT

I'm not sure what you are exactly asking, so I will try to disentangle by clarifying how MS-DAP deals with peptide-to-protein mappings from A-Z.

  1. raw data processing software identifies peptides and performs protein inference; this software applies some algorithm (e.g. based on IDpicker or ProteinProphet) to assign observed peptides to proteingroups. At this point, decisions are made on how to deal with unique and shared peptides (e.g. "razor" peptides might be assigned to one proteingroup in a winner-takes-all approach). How this is done exactly depends on the respective software (DIA-NN / Spectronaut / MaxQuant / FragPipe / etc.)

  2. when you load your dataset into MS-DAP, the peptide-to-proteingroup assignments are used as-is. So the "protein_id" assigned to each peptide/precursor in the MS-DAP peptide data table is the exact same as provided by upstream software. So if upstream software states that peptide X is assigned to "GRIA1" and peptide Y to "GRIA1;GRIA2" we assume that is correct / makes sense.

Note that MS-DAP cannot know at this point which peptides are assigned to proteingroup A because they are unique for the respective protein, and which are "razor peptides" that are assigned to A by lack of other evidence. Hence, we use all input data as-is and offer no further control over peptide-to-protein assignments at the moment.

  1. differential expression analysis (DEA) in MS-DAP yields proteingroup-level statistics (log2fc and p-value) for each statistical contrast. For each proteingroup in the output (e.g. differential_abundance_analysis.xlsx ) the respective set of unique gene symbols is shown (gene_symbols_or_id column) and importantly, this is still consistent with the input data (i.e. for each row/proteingroup, the results are based on the subset of peptides originally assigned to this proteingroup that also pass all filtering rules that you defined, such as filter_min_detect ).

Note that at this stage, we are looking at proteingroup-level statistics (e.g. output from DEqMS or MSqRob). Peptide information is lost at this point (i.e. the proteingroup info is a summary of respective peptides).

  1. if you want to summarize/simplify the DEA results, you may use the new summarise_stats function introduced with MS-DAP 1.0.6, documentation page is available here. All this function does is filter and summarise the the DEA results (which only contains proteingroup-level information). For example, if you set remove_ambiguous_proteingroups=TRUE it will filter the DEA table and simply remove all rows where the gene_symbols_or_id column contains multiple gene symbols.

Specific to your question;

I wonder if there is a way to keep protein group separate between ambiguous peptides. For example,
peptide A, GRIA1; GRIA2
peptide B, GRIA1

If the upstream software assigned peptides A and B in your example to different proteingroups, they will also be in separate proteingroups throughout MS-DAP. As mentioned above; we use peptide-to-proteingroup assignements from the dataset you import from DIA-NN/FragPipe/etcetera as-is.

I would like to know if there is a way that I can keep both GRIA1 and GRIA1;GRIA2 in the output

They are by default