Issue with repLoad and repGermline when using O. mykiss IMGT data
bengarcia015 opened this issue ยท 2 comments
๐ Bug
I am currently trying to analyze BCR sequencing data by using IMGT HiV-QUEST data before using immunarch for further analysis. I should note that I am analyzing BCR data from Rainbow Trout (Oncorhynchus mykiss) and attempting to use the BCR pipeline, so I am not sure if this is causing any of the issues I am encountering. This is my first time reporting a bug, so I apologize if there is too much/not enough information, or if there is anything I missed here! Overall, the package is working beautifully, except for a couple of issues:
First, when using repLoad on the .airr files that were exported from IMGT, I receive a number of errors, copied below. The airr file is directly downloaded from IMGT, with no modifications:
l12_all <- repLoad('/Users/benjamingarcia/Documents/imgt_airr/import/PBS_24_NC.tsv')
== Step 1/3: loading repertoire files... ==
Processing "" ...
-- [1/1] Parsing "/Users/benjamingarcia/Documents/imgt_airr/import/PBS_24.tsv" -- airr
[!] Removed 10553 clonotypes with no nucleotide and amino acid CDR3 sequence.
[!] Warning: found NAs in clonal counts. Setting them to 1's.== Step 2/3: checking metadata files and merging files... ==
Processing "" ...
-- Metadata file not found; creating a dummy metadata...== Step 3/3: processing paired chain data... ==
Done!
Warning messages:
1: One or more parsing issues, callproblems()
on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
2: One or more parsing issues, callproblems()
on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
3: One or more parsing issues, callproblems()
on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
4: In validate_tabular(data, schema = schema) :
Warning: rev_comp is not logical for row(s): 1, 3, 4, 5, 6, 9, 10, 12, 13, 14, 17, 19, 20, 22, 24, 25, 31, 32, 34, 38, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 64, 65, 66, 68, 70, 73, 74, 77, 78, 83, 84, 86, 87, 90, 94, 95, 97, 98, 99, 100, 101, 104, 105, 106, 107, 110, 111, 117, 119, 120, 121, 122, 127, 128, 129, 130, 131, 132, 138, 139, 141, 142, 143, 146, 148, 151, 154, 156, 158, 159, 161, 162, 163, 164, 166, 167, 168, 169, 171, 173, 174, 175, 176, 178, 179, 182, 183, 185, 187, 190, 191, 192, 194, 195, 200, 201, 202, 203, 204, 205, 207, 209, 215, 217, 218, 219, 221, 222, 224, 225, 226, 227, 228, 229, 231, 233, 234, 236, 237, 239, 240, 243, 247, 250, 251, 252, 253, 255, 257, 259, 263, 265, 268, 270, 271, 272, 274, 275, 278, 280, 281, 282, 283, 285, 286, 289, 290, 291, 292, 293, 294, 295, 299, 300, 301, 304, 305, 306, 308, 309, 312, 313, 314, 315, 316, 318, 319, 321, 322, 325, 326, 327, 328, 329, 333, 335, 339, 340, 341, 348, 353, 355, 356, 357, 358, 360, 368, 3 [... truncated]
5: In validate_tabular(data, schema = schema) :
Warning: productive is not logical for row(s): 1, 3, 4, 5, 6, 9, 10, 12, 13, 14, 17, 19, 20, 22, 24, 25, 31, 32, 34, 38, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 64, 65, 66, 68, 70, 73, 74, 77, 78, 83, 84, 86, 87, 90, 94, 95, 97, 98, 99, 100, 101, 104, 105, 106, 107, 110, 111, 117, 119, 120, 121, 122, 127, 128, 129, 130, 131, 132, 138, 139, 141, 142, 143, 146, 148, 151, 154, 156, 158, 159, 161, 162, 163, 164, 166, 167, 168, 169, 171, 173, 174, 175, 176, 178, 179, 182, 183, 185, 187, 190, 191, 192, 194, 195, 200, 201, 202, 203, 204, 205, 207, 209, 215, 217, 218, 219, 221, 222, 224, 225, 226, 227, 228, 229, 231, 233, 234, 236, 237, 239, 240, 243, 247, 250, 251, 252, 253, 255, 257, 259, 263, 265, 268, 270, 271, 272, 274, 275, 278, 280, 281, 282, 283, 285, 286, 289, 290, 291, 292, 293, 294, 295, 299, 300, 301, 304, 305, 306, 308, 309, 312, 313, 314, 315, 316, 318, 319, 321, 322, 325, 326, 327, 328, 329, 333, 335, 339, 340, 341, 348, 353, 355, 356, 357, 358, 360, 368, [... truncated]
6: In validate_tabular(data, schema = schema) :
Warning: complete_vdj is not logical for row(s): 1, 3, 4, 5, 6, 9, 10, 12, 13, 14, 17, 19, 20, 22, 24, 25, 31, 32, 34, 38, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 64, 65, 66, 68, 70, 73, 74, 77, 78, 83, 84, 86, 87, 90, 94, 95, 97, 98, 99, 100, 101, 104, 105, 106, 107, 110, 111, 117, 119, 120, 121, 122, 127, 128, 129, 130, 131, 132, 138, 139, 141, 142, 143, 146, 148, 151, 154, 156, 158, 159, 161, 162, 163, 164, 166, 167, 168, 169, 171, 173, 174, 175, 176, 177, 178, 179, 182, 183, 185, 187, 190, 191, 192, 194, 195, 200, 201, 202, 203, 204, 205, 207, 209, 215, 217, 218, 219, 221, 222, 224, 225, 226, 227, 228, 229, 231, 233, 234, 236, 237, 239, 240, 243, 247, 250, 251, 252, 253, 255, 257, 259, 263, 265, 268, 270, 271, 272, 274, 275, 278, 280, 281, 282, 283, 285, 286, 289, 290, 291, 292, 293, 294, 295, 299, 300, 301, 304, 305, 306, 308, 309, 312, 313, 314, 315, 316, 318, 319, 321, 322, 325, 326, 327, 328, 329, 333, 335, 339, 340, 341, 348, 353, 355, 356, 357, 358, 36 [... truncated]
7: In validate_tabular(data, schema = schema) :
Warning: vj_in_frame is not logical for row(s): 1, 3, 4, 5, 6, 9, 10, 12, 13, 14, 17, 19, 20, 22, 24, 25, 31, 32, 34, 38, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 64, 65, 66, 68, 70, 73, 74, 77, 78, 83, 84, 86, 87, 90, 94, 95, 97, 98, 99, 100, 101, 104, 105, 106, 107, 110, 111, 117, 119, 120, 121, 122, 127, 128, 129, 130, 131, 132, 138, 139, 141, 142, 143, 146, 148, 151, 154, 156, 158, 159, 161, 162, 163, 164, 166, 167, 168, 169, 171, 173, 174, 175, 176, 177, 178, 179, 182, 183, 185, 187, 190, 191, 192, 194, 195, 200, 201, 202, 203, 204, 205, 207, 209, 215, 217, 218, 219, 221, 222, 224, 225, 226, 227, 228, 229, 231, 233, 234, 236, 237, 239, 240, 243, 247, 250, 251, 252, 253, 255, 257, 259, 263, 265, 268, 270, 271, 272, 274, 275, 278, 280, 281, 282, 283, 285, 286, 289, 290, 291, 292, 293, 294, 295, 299, 300, 301, 304, 305, 306, 308, 309, 312, 313, 314, 315, 316, 318, 319, 321, 322, 325, 326, 327, 328, 329, 333, 335, 339, 340, 341, 348, 353, 355, 356, 357, 358, 360 [... truncated]
8: In validate_tabular(data, schema = schema) :
Warning: stop_codon is not logical for row(s): 1, 3, 4, 5, 6, 9, 10, 12, 13, 14, 17, 19, 20, 22, 24, 25, 31, 32, 34, 38, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 64, 65, 66, 68, 70, 73, 74, 77, 78, 83, 84, 86, 87, 90, 94, 95, 97, 98, 99, 100, 101, 104, 105, 106, 107, 110, 111, 117, 119, 120, 121, 122, 127, 128, 129, 130, 131, 132, 138, 139, 141, 142, 143, 146, 148, 151, 154, 156, 158, 159, 161, 162, 163, 164, 166, 167, 168, 169, 171, 173, 174, 175, 176, 178, 179, 182, 183, 185, 187, 190, 191, 192, 194, 195, 200, 201, 202, 203, 204, 205, 207, 209, 215, 217, 218, 219, 221, 222, 224, 225, 226, 227, 228, 229, 231, 233, 234, 236, 237, 239, 240, 243, 247, 250, 251, 252, 253, 255, 257, 259, 263, 265, 268, 270, 271, 272, 274, 275, 278, 280, 281, 282, 283, 285, 286, 289, 290, 291, 292, 293, 294, 295, 299, 300, 301, 304, 305, 306, 308, 309, 312, 313, 314, 315, 316, 318, 319, 321, 322, 325, 326, 327, 328, 329, 333, 335, 339, 340, 341, 348, 353, 355, 356, 357, 358, 360, 368, [... truncated]
9: In .which_recomb_type(df[[IMMCOL$v]]) :
Can't determine the type of V(D)J recombination. No insertions will be presented in the resulting data table.
The organization of the O. mykiss VDJ loci is different to that of mammals, so it is certainly possible that this is causing some of the issues with the "not logical" values being reported.
The next issue I have encountered arises with repGermline. The code and outcome are listed below:
repGermline(l12_all$data, .species = 'OncorhynchusMykiss')
Error inmap2()
:
โน In index: 1.
โน With name: PBS_24.
Caused by error inmerge_reference_sequences()
:
! After merging with reference, the data from sample PBS_24 is empty.
There were no valid alleles in the data!
Runrlang::last_trace()
to see where the error occurred.
Warning messages:
1: In validate_mandatory_columns(., sample_name) :
437 rows from 5796 in sample PBS_24 were dropped because of missing values in mandatory columns FR1.nt, CDR1.nt, FR2.nt, CDR2.nt, FR3.nt, CDR3.nt, FR4.nt!
2: In merge_reference_sequences(., reference, "V", species, sample_name) :
Genes or alleles Oncmyk_Ar IGHV1D-1201 F, Oncmyk_Ar IGHV11-28-301 F, Oncmyk_Ar IGHV6-3101 ORF, Oncmyk_Ar IGHV8-39-201 F, Oncmyk_Sw IGHV1-1801 F, Oncmyk_Sw IGHV11-2501 F, Oncmyk_Ar IGHV10D-702 F, Oncmyk_Sw IGHV1-4201 F, Oncmyk_Ar IGHV1-2102 F, Oncmyk_Ar IGHV1-39-501 F, Oncmyk_Ar IGHV1-1301 F, Oncmyk_Sw IGHV6D-7601 F, Oncmyk_Ar IGHV8-4601 F, Oncmyk_Ar IGHV3-2002 F, Oncmyk_Sw IGHV1D-7301 F, Oncmyk_Sw IGHV12D-5601 F, Oncmyk_Sw IGHV3D-3001 ORF, Oncmyk_Ar IGHV1D-14-301 F, Oncmyk_Ar IGHV7D-17-102 F, Oncmyk_Ar IGHV1-47-401 F, Oncmyk_Ar IGHV16-3701 F, Oncmyk_Ar IGHV9-1502 F, Oncmyk_Ar IGHV9D-202 F, Oncmyk_Sw IGHV15-4801 P, Oncmyk_Sw IGHV1-2101 P, Oncmyk_Ar IGHV6-402 F, Oncmyk_Sw IGHV16-1401 ORF, Oncmyk_Sw IGHV1-201 F, Oncmyk IGHV8-502 F, Oncmyk_Sw IGHV6D-4001 F, Oncmyk_Ar IGHV8-1902 ORF, Oncmyk_Ar IGHV2-801 F, Oncmyk_Ar IGHV2-2802 F, Oncmyk_Ar IGHV11-47-501 ORF, Oncmyk_Sw IGHV9-2301 F, Oncmyk_Sw IGHV9-1501 F, Oncmyk_Ar IGHV12D-7102 F, Oncmyk_Ar IGHV16-39-30 [... truncated]
I have tried going through and modifying the names to remove the "Oncmyk_Sw " and "Oncmyk_Sw" from the names, as well as everything except for the core portion of the locus name, to no avail.
Thank you for any help you are able to offer, and please let me know if there is any further information/data that you need to help diagnose this!
Ben
Hi Ben, for BCR analysis we need to have nucleotide data, but it seems like you don't have one in the IMGT data. Is this correct?
Is there a way to export IMGT files in AIRR Standard Format instead of the format that IMGT uses?
I have imported both formats, including the IMGT-exported AIRR formatted data, and it seems like that may have solved some of the issues. However, I am still running into one main error when loading the data:
[!] Warning: found NAs in clonal counts. Setting them to 1's.
I think the error may initially have been caused by my use of PRESTO as the pre-IMGT cleaning method, which collapses down the counts and puts that into the name. I have removed them from the name, and put them into the AIRR formatted document in the "consensus_count" column, is there another column where I should put these counts so that they can be loaded into the "clones" entry of the immunarch object?