We devised a methodology to reconstruct the genome of a given wheat accession (query accession) using the provided genomes of the set of the wheat accessions (database accessions) based on the IBSpy haplotype data (https://github.com/Uauy-Lab/IBSpy). For a given genome fragment, this method could find the potentially original place among the database.
The methodology (MLP finding) comprises the following steps:
The k-mers of the bread wheat reference sequence to the k-mers of each database were compared and counted the number of variations within each 50 kb window with the Identity-by-State (IBS) Python pipeline (IBSpy: https://github.com/Uauy-Lab/IBSpy), and then 1 Mb haplotypes were assigned with the affinity propagation clustering technique (with a window size of 1 Mb) based on IBSpy values computed from 20 consecutive 50-kb windows (IBSpy: https://github.com/Uauy-Lab/IBSpy).
Pedigree tracing was employed to track the ancestry of query accession among the database accessions. Building upon the haplotype assignments in non-overlapping 1 Mb windows, we reconstructed each query's genome using extended matched haplotype blocks from the database accessions, employing an in silico strategy featuring a minimum tiling path approach.
On a per-chromosome basis, we began by comparing the haplotypes of the query's first window with the corresponding chromosome's haplotypes in each database accession. The length of Matched Haplotype Windows (MHW) from each database accession was noted. The database accession with the longest MHW was identified as the potential progenitor contributing to that genomic region. Subsequent MHWs from other database accessions were discarded.
The comparison process was iterated from the second window onward. At each step, we identified the longest MHW and its associated database accession. This iterative comparison continued window by window until covering the entire chromosome.
We then aligned the physical positions of identified MHWs from each window. Overlapping MHWs were removed, retaining those forming the minimum tiling path for reconstructing the chromosome. The reconstructed path allowed observation of the Mosaic composition of the query's genome originating from the database accessions. This strategic process delineates the closest database accession at a 1 Mb resolution for any given genomic region within the query accession. The contribution percentage of each Watkins line to a cultivar was quantified as the ratio of its total MHWs to the entire genome windows.
The Python libraries gzip, math, re, np from numpy were needed for running the MLP_finding.py.
MLP_finding has relatively few options, you can look at them with the --help
command.
python MLP_finding.py --help
usage: MLP_finding.py [-h] [-i HAPLPTYPE_FILE] [-r REFERENCE] [-q QUERY_FILE] [-t TARGET_FILE] [-c CHR_LENGTH] [-w WINDOW_SIZE] [-o OUTPUT_PATH]```
optional arguments:
-h, --help # Show this help message and exit
-i HAPLPTYPE_FILE, --haplptype_file HAPLPTYPE_FILE # The haplotype assignment file, including the haplotypes of the query accessions and database accessions
-r REFERENCE, --reference REFERENCE # The IBSpy values and haplotypes were generated based which pangenome
-q QUERY_FILE, --query_file QUERY_FILE # Query accessions. One line for each query accession
-t TARGET_FILE, --target_file TARGET_FILE # The database accessions. One line for each database accession
-c CHR_LENGTH, --chr_length CHR_LENGTH # The chromosome length of the 10+ pangenome which was provided in the demo folder
-w WINDOW_SIZE, --window_size WINDOW_SIZE # The window size used for the haplotype assignment in IBSpy
-o OUTPUT_PATH, --output_path OUTPUT_PATH # Output path
In the demo/results folder, there are seven files were generated by MLP_finding.py with the commands:
python MLP_finding.py -i haplotype.tsv.gz -r chinese -q query_samples.txt -t target_samples.txt -c chr_length.tsv -w 1000000 -o ./results
1) matrix_MLP_blocks.txt shows the MLP fragments.
2) matrix_contribution_percentage.txt shows the contribution percentage of database accessions to each query accession.
3) matrix_cumulative_contribution_percentage.txt shows the cumulative contribution percentages of database accessions to each query accession. The database accessions were ranked by their contribution percentages, and the contribution percentage was calculating with the database accession-specific fragments contributed by this accession. For example, the first and second accession contribute 10% and 9% of the query's genome respectively. The 6% of fragments contributed by the second accession was totally covered by the fragments contributed by the first accession. Then values for them in this file were 10% and 3%, respectively.
4) matrix_cumulative_contribution_percentage_sample.txt shows the names of the database accessions in the relative positions of the matrix_cumulative_contribution_percentage.txt.
5) hap_num_window.txt shows the haplotype numbers in each window.
6) standerd_genotype.txt shows the transformed haplotype format used for MLP_finding.py.
7) weighted_matrix.txt shows the penalty score coefficients used for the haplotype comparison and smooth.