T2T-chm13-chrX

Generate single copy k-mer sites in a given assembly, as described in the Miga et al. 2019 T2T chrX paper.

These locus were used for marker assisted mapping to avoid mapping biases due to assembly errors. The script provided here generates marker sites in the assembly where the k-mer is unique (found only once) in the assembly and is coming from the single copy peak from the Illumina read set (collected from barcode trimmed 10X Genomics reads).

Memory and CPU requirements

  • mem: 10~12 Gb

  • cpu: will use available cpus, up to 64

  1. Download the CHM13 single copy k-mer meryl database from here.
# Untar
tar xvf CHM13.10X.k21.gt5.lt58.meryl.tar

# Give a more meaningful name
ln -s CHM13.10X.k21.gt5.lt58.meryl 10X_single.meryl
  1. Get the meryl v1.0

  2. Run collect_single_copy_kmers.sh, which does

  • Collect unique k-mers in the assembly
meryl equal-to 1 k=21 [ count memory=12 asm.fasta output asm.meryl ] output asm_1.meryl
  • Intersect asm_1.meryl with single copy k-mers from the read set
meryl intersect 10X_single.meryl asm_1.meryl output markers.meryl
  • Lookup and dump the positions
# Run this for asm.fasta or on each chromosome ($chr) to speed up
meryl-lookup -dump -memory 12 -sequence $chr.fasta -mers markers.meryl |\
  awk -F "\t" '$(NF-4)=="T" {print $1"\t"$(NF-5)"\t"$(NF-5)+21"\t"($(NF-2)+$NF)}' \
  > ${chr}_markers.bed

Once markers.meryl are generated, collect_single_copy_kmers.sh can be run in parallel for each sequence(contig or scaffold) in asm.fasta. When no specific sequence ID ($chr) is provided, it may take several hours to generate the complete markers.bed.

  1. Optionally, generate marker density track loadable to IGV with IGVtools
cat *_markers.bed > markers.bed
igvtools count markers.bed markers.tdf asm.fasta.fai

The last column in markers.bed contains the k-mer frequency found in the Illumina read set. This can be used to further stratify more confident markers.