Long-read Repeat Element pipeline

for PacBio single-cell MAS-seq data

LoRE workflow

Expand minimal workflow

Expand maximal workflow

Expand

Clone the repository:

git clone https://github.com/siebrenf/lore.git

Create the conda environment:

conda env create -n lore -f lore/requirements.yaml
conda activate lore

Install LoRE in the conda environment:

pip install -e ./lore

Change directory into the LoRE folder.

Activate the conda environment:

conda activate lore

Update the config.yaml.

Adapters, primers and barcodes for the 5' Kinnex kit can be downloaded by LoRE, or can be placed inside the results directory.
The results directory (as well as any other directory) can be set in the config. The default results directory is ./results.
The genome and gene annotation need to be obtained manually. You will need to specify their locations in the config, as well as the symbol for the mitochondria.
- If the pigeon classify output suggests the number of reads per cell is low, this may suggest the genome and/or gene annotation were insufficiently annotated.
Additional documentation for most rules (steps) in the workflow can be found in the code.
Optional outputs (currently) include bigwigs (for track visualization) and a QC report. Both are recommended, but adds (some) computational load.

Test your config:

snakemake --snakefile lore/Snakefile --configfile config.yaml --dry-run

Run your config:

nice snakemake --use-conda --snakefile lore/Snakefile --configfile config.yaml --resources parallel_downloads=1 mem_mb=100_000 -j 60 > log.txt 2>&1

implement TE/RE detection using the output of either:
- isoseq_groupdedup (an unaligned FASTA and BAM file)
- pbmm2_align (an aligned BAM file)
  - current settings:
    - multimapped reads are included: reads are assigned to any number of locations (nice for TEs).
    - unmapped reads are included.
- repositories of interest to this purpose have been marked below.
learn more about the pigeon classify filter settings (for the gene level, some filters may be more lenient).
integrate genomepy to get a genome & gene annotation.
- figure out the requirements for a "good" reference genome & gene annotation.

de novo Repeat library construction pipelines
- https://doi.org/10.1186/s12864-021-08117-9
  - https://github.com/kacst-bioinfo-lab/TE_ideintification_pipeline # nice workflow figure
  - could be useful for non-model organisms/strains
RE/TE pipelines
- SoloTE
  - easy to use: use pbmm2 output
  - https://doi.org/10.1038/s42003-022-04020-5
  - https://github.com/bvaldebenitom/SoloTE
- TrEMOLO
  - looks good + snakemake!
  - https://doi.org/10.1186/s13059-023-02911-2
  - https://github.com/DrosophilaGenomeEvolution/TrEMOLO
- tldr
  - looks easy to use
  - https://doi.org/10.1016/j.molcel.2020.10.024
  - https://github.com/adamewing/tldr
  - works with "partial detection, assembly and annotation of non-reference TE insertions"
- TELR
  - looks good! pure python
  - https://doi.org/10.1093/nar/gkac794
  - https://github.com/bergmanlab/TELR
  - "relies on the detection of abnormally mapped reads upon the reference genome that can be linked to a TE"
- teNanoporePipeline
  - may be usable
  - https://doi.org/10.1016/j.isci.2023.108214
  - https://github.com/javiercguard/teNanoporePipeline
  - Tools: RepeatMaster, Dfam
- LoRTE
  - not usable (python 2!)
  - https://doi.org/10.1186/s13100-017-0088-x
  - dead link