Scripts to call L1 insertions from single-cell ATLAS-seq experiments.
- cutadapt (tested version: 1.14)
- bwa (tested version: 0.7.16a)
- Picard tools (tested version: 1.136)
- bedtools (tested version: 2.25.0)
- seqtk (tested version: 1.0; note that seqtk is only required if subsampling of sequencing data is used - to reduce time of analysis in tests)
- GNU grep/awk
Place and edit the .atlas.conf
file in your home folder according to your configuration.
Prepare directories as follow:
project/
|-- annotations
|-- data
|-- results
`-- scripts
The data
folder should contain the .fastq files.
- A human reference genome sequence (ex:
hg19.fa
) and its bwa index (ex:hg19.fa.amb, .ann, .bwt, .fai, .pac, .sa
). Their location is indicated in the.atlas.conf
file. - adjust the paths of annotations files in the
annotations/annotations_sc_5atlas.txt
andannotations/annotations_sc_5atlas.txt
files.
Starting from a sequencing run, this script:
- demultiplex the sequencing reads based on samples
- trim the reads and map them on the reference genome provided
- cluster sequencing reads and identify potential break points
Example for a 3'-atlas-seq run:
The barcode file .bc
is a tabular text file with 3 columns (index name, index sequence, sample name). An example is provided in the annotation folder.
cd project/results
../scripts/atlas-clustering_v2.2_forktest_nosoft.sh \
-d 0 \
-b ../annotations/3pp.bc \
../data/R05_INS-203.ATLAS-seq.E6T_E6C1_3prime.fastq
Example for a 5'-atlas-seq run:
The barcode file 5pp.bc
has 3 columns: index name, index sequence, sample name.
cd project/results
../scripts/atlas-clustering_v2.2_forktest_nosoft.sh \
-d 0 \
-f \
-b ../annotations/5pp.bc \
../data/R07_INS-208.ATLAS-seq_E6C1_5prime_E6C3_5prime.fastq
mkdir -p project/results/pooled_single_cells
cp project/results/*_*atlas*/* project/results/pooled_single_cells/
This script calls L1 peaks within large local amplification clusters and annotate them.
cd results/pooled_single_cells
project/scripts/atlas-seq_single_cells_Acount.sh
This filtering step was added to remove artefactual amplifications in the WGA procedure.
cd results/pooled_single_cells
for file in *.3atlas.v2.2forktest_nosoft.best.insertions.10kb.full.annotated.bed ;
do
name=$( echo -e $file | awk -F "." '{printf $1}' ) ;
awk '($1~/^#/) || (($5>=3 && $8+$7>=1) && (($19~/L1HS\|Ta/) || (($19=="." && $21!~/L1PA/) && ($17=="." || $11==0 ) && ($NF<=1))))' ${name}.3atlas.v2.2forktest_nosoft.best.insertions.10kb.full.annotated.bed \
> ${name}.3atlas.v2.2forktest_nosoft.best.insertions.10kb.full.annotated.filtered.bed;
cut -f1-6 ${name}.3atlas.v2.2forktest_nosoft.best.insertions.10kb.full.annotated.filtered.bed \
> ${name}.3atlas.v2.2forktest_nosoft.best.insertions.10kb.filtered.true.bed;
done