Cutadapt timeout
priscilla-glenn opened this issue ยท 40 comments
Hi,
I've tried running scATACpipe and while this time it started without issues, the cutadapt has timed out. I've included my nextflow.log for your convenience as I'm not sure why it was taking so long to run. I did notice one discrepancy where cutadapt said it was trying to use 32 cores (pasted below) even though in my config file I said the max should be 28 cores. So my one guess would be the computer is trying to use 32 cores when I didn't give it access to 32. Would there be a way to fix it so cutadapt is trying to use 32 cores?
Thanks!
Command output:
This is cutadapt 3.4 with Python 3.9.1
Command line parameters: --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz
Processing reads on 32 cores in paired-end mode ...
nextflow.log_May12.txt
Command wrapper:
This is cutadapt 3.4 with Python 3.9.1
Command line parameters: --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz
Processing reads on 32 cores in paired-end mode ...
Hi @priscilla-glenn, the --core 0
flag directs cutadapt to auto detect and use all available cores. There are multiple places to specify config parameters to Nextflow, the order of their priorities can be found here. Your "28 cores" might likely be overwritten depending on where you specified it.
To further debug the cutadapt timeout issue, you can go to the working directory, e.g. /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841
, and check out the .command.log
file. You can also share it here.
Thank you. I will try resetting some of the config parameters to see if that helps.
In addition, here is the .command.log file for your reference.
This is cutadapt 3.4 with Python 3.9.1
Command line parameters: --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz
Processing reads on 32 cores in paired-end mode ...
nxf-uiXqo6FQ015ZXvCk4oxA87gZ
My current config file (before adjusting more)
params {
preprocess = 'default'
input_fastq = '/home/blackmonlab/big-storage/Priscilla/SingleCell/MouseSC/scATACpipe_mouse_samplesheet.csv'
outdir = './scATAC_mouse2'
species_latin_name = 'Mus musculus'
ref_fasta_ensembl = 'mus_musculus'
split_fastq = false
barcode_correction = 'pheniqs'
whitelist_barcode = 'assets/whitelist_barcodes'
filter = 'both'
doublet_removal_algorithm = 'archr'
archr_thread = 8
archr_blacklist = false
archr_batch_correction_harmony = true
filter_sample = false
filter_seurat_ilsi = false
custom_peaks = false
archr_scrnaseq = false
archr_scrnaseq_grouplist = false
max_memory = '110.GB'
max_cpus = 28
max_time = '240.h'
}
And here is the .command.run file as well. Please let me know if there any other files that could help with the troubleshooting. Thank you!
#!/bin/bash
NEXTFLOW TASK: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1)
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 1 ]] && set -x
NXF_ENTRY=${1:-nxf_main}
nxf_tree() {
local pid=$1
declare -a ALL_CHILDREN
while read P PP;do
ALL_CHILDREN[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)
pstat() {
local x_pid=$1
local STATUS=$(2> /dev/null < /proc/$1/status grep -E 'Vm|ctxt')
if [ $? = 0 ]; then
local x_vsz=$(echo "$STATUS" | grep VmSize | awk '{print $2}' || echo -n '0')
local x_rss=$(echo "$STATUS" | grep VmRSS | awk '{print $2}' || echo -n '0')
local x_peak=$(echo "$STATUS" | grep -E 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n '0 0')
local x_pmem=$(awk -v rss=$x_rss -v mem_tot=$mem_tot 'BEGIN {printf "%.0f", rss/mem_tot*100*10}' || echo -n '0')
local vol_ctxt=$(echo "$STATUS" | grep '\bvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
local inv_ct
xt=$(echo "$STATUS" | grep '\bnonvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
cpu_stat[x_pid]="$x_pid $x_pmem $x_vsz $x_rss $x_peak $vol_ctxt $inv_ctxt"
fi
}
pwalk() {
pstat $1
for i in ${ALL_CHILDREN[$1]:=}; do pwalk $i; done
}
pwalk $1
}
nxf_stat() {
cpu_stat=()
nxf_tree $1
declare -a sum=(0 0 0 0 0 0 0 0)
local pid
local i
for pid in "${!cpu_stat[@]}"; do
local row=(${cpu_stat[pid]})
[ $NXF_DEBUG = 1 ] && echo "++ stat mem=${row[*]}"
for i in "${!row[@]}"; do
if [ $i != 0 ]; then
sum[i]=$((sum[i]+row[i]))
fi
done
done
[ $NXF_DEBUG = 1 ] && echo -e "++ stat SUM=${sum[*]}"
for i in {1..7}; do
if [ ${sum[i]} -lt ${cpu_peak[i]} ]; then
sum[i]=${cpu_peak[i]}
else
cpu_peak[i]=${sum[i]}
fi
done
[ $NXF_DEBUG = 1 ] && echo -e "++ stat PEAK=${sum[*]}\n"
nxf_stat_ret=(${sum[*]})
}
nxf_mem_watch() {
set -o pipefail
local pid=$1
local trace_file=.command.trace
local count=0;
declare -a cpu_stat=(0 0 0 0 0 0 0 0)
declare -a cpu_peak=(0 0 0 0 0 0 0 0)
local mem_tot=$(< /proc/meminfo grep MemTotal | awk '{print $2}')
local timeout
local DONE
local STOP=''
[ $NXF_DEBUG = 1 ] && nxf_sleep 0.2 && ps fx
while true; do
nxf_stat $pid
if [ $count -lt 10 ]; then timeout=1;
elif [ $count -lt 120 ]; then timeout=5;
else timeout=30;
fi
read -t $timeout -r DONE || true
[[ $DONE ]] && break
if [ ! -e /proc/$pid ]; then
[ ! $STOP ] && STOP=$(nxf_date)
[ $(($(nxf_date)-STOP)) -gt 10000 ] && break
fi
count=$((count+1))
done
echo "%mem=${nxf_stat_ret[1]}" >> $trace_file
echo "vmem=${nxf_stat_ret[2]}" >> $trace_file
echo "rss=${nxf_stat_ret[3]}" >> $trace_file
echo "peak_vmem=${nxf_stat_ret[4]}" >> $trace_file
echo "peak_rss=${nxf_stat_ret[5]}" >> $trace_file
echo "vol_ctxt=${nxf_stat_ret[6]}" >> $trace_file
echo "inv_ctxt=${nxf_stat_ret[7]}" >> $trace_file
}
nxf_write_trace() {
echo "nextflow.trace/v2" > $trace_file
echo "realtime=$wall_time" >> $trace_file
echo "%cpu=$ucpu" >> $trace_file
echo "rchar=${io_stat1[0]}" >> $trace_file
echo "wchar=${io_stat1[1]}" >> $trace_file
echo "syscr=${io_stat1[2]}" >> $trace_file
echo "syscw=${io_stat1[3]}" >> $trace_file
echo "read_bytes=${io_stat1[4]}" >> $trace_file
echo "write_bytes=${io_stat1[5]}" >> $trace_file
}
nxf_trace_mac() {
local start_millis=$(nxf_date)
/bin/bash -euo pipefail /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.sh
local end_millis=$(nxf_date)
local wall_time=$((end_millis-start_millis))
local ucpu=''
local io_stat1=('' '' '' '' '' '')
nxf_write_trace
}
nxf_fd() {
local FD=11
while [ -e /proc/$$/fd/$FD ]; do FD=$((FD+1)); done
echo $FD
}
nxf_trace_linux() {
local pid=$$
command -v ps &>/dev/null || { >&2 echo "Command 'ps' required by nextflow to collect task metrics cannot be found"; exit 1; }
local num_cpus=$(< /proc/cpuinfo grep '^processor' -c)
local tot_time0=$(grep '^cpu ' /proc/stat | awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9; printf "%.0f",sum}')
local cpu_time0=$(2> /dev/null < /proc/$pid/stat awk '{printf "%.0f", ($16+$17)10 }' || echo -n 'X')
local io_stat0=($(2> /dev/null < /proc/$pid/io sed 's/^.:\s*//' | head -n 6 | tr '\n' ' ' || echo -n '0 0 0 0 0 0'))
local start_millis=$(nxf_date)
trap 'kill $mem_proc' ERR
/bin/bash -euo pipefail /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.sh &
local task=$!
mem_fd=$(nxf_fd)
eval "exec $mem_fd> >(nxf_mem_watch $task)"
local mem_proc=$!
wait $task
local end_millis=$(nxf_date)
local tot_time1=$(grep '^cpu ' /proc/stat | awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9; printf "%.0f",sum}')
local cpu_time1=$(2> /dev/null < /proc/$pid/stat awk '{printf "%.0f", ($16+$17)*10 }' || echo -n 'X')
local ucpu=$(awk -v p1=$cpu_time1 -v p0=$cpu_time0 -v t1=$tot_time1 -v t0=$tot_time0 -v n=$num_cpus 'BEGIN { pct=(p1-p0)/(t1-t0)*100*n; printf("%.0f", pct>0 ? pct : 0) }' )
local io_stat1=($(2> /dev/null < /proc/$pid/io sed 's/^.*:\s*//' | head -n 6 | tr '\n' ' ' || echo -n '0 0 0 0 0 0'))
local i
for i in {0..5}; do
io_stat1[i]=$((io_stat1[i]-io_stat0[i]))
done
local wall_time=$((end_millis-start_millis))
[ $NXF_DEBUG = 1 ] && echo "+++ STATS %CPU=$ucpu TIME=$wall_time I/O=${io_stat1[*]}"
echo "nextflow.trace/v2" > $trace_file
echo "realtime=$wall_time" >> $trace_file
echo "%cpu=$ucpu" >> $trace_file
echo "rchar=${io_stat1[0]}" >> $trace_file
echo "wchar=${io_stat1[1]}" >> $trace_file
echo "syscr=${io_stat1[2]}" >> $trace_file
echo "syscw=${io_stat1[3]}" >> $trace_file
echo "read_bytes=${io_stat1[4]}" >> $trace_file
echo "write_bytes=${io_stat1[5]}" >> $trace_file
[ -e /proc/$mem_proc ] && eval "echo 'DONE' >&$mem_fd" || true
wait $mem_proc 2>/dev/null || true
while [ -e /proc/$mem_proc ]; do nxf_sleep 0.1; done
}
nxf_trace() {
local trace_file=.command.trace
touch $trace_file
if [[ $(uname) = Darwin ]]; then
nxf_trace_mac
else
nxf_trace_linux
fi
}
nxf_container_env() {
cat << EOF
export HDF5_USE_FILE_LOCKING="FALSE"
export RHDF5_USE_FILE_LOCKING="FALSE"
export NXF_SINGULARITY_CACHEDIR="./work/singularity"
export PYTHONNOUSERSITE="1"
export R_PROFILE_USER="/.Rprofile"
export R_ENVIRON_USER="/.Renviron"
export PATH="$PATH:/mnt/md1/Priscilla/SingleCell/scATACpipe/bin"
EOF
}
nxf_sleep() {
sleep $1 2>/dev/null || sleep 1;
}
nxf_date() {
local ts=$(date +%s%3N);
if [[ ${#ts} == 10 ]]; then echo ${ts}000
elif [[
elif [[
elif [[ ${#ts} == 13 ]]; then echo $ts
else echo "Unexpected timestamp value: $ts"; exit 1
fi
}
nxf_env() {
echo '============= task environment ============='
env | sort | sed "s/(.)AWS(.)=(.{6}).*/\1AWS\2=\3xxxxxxxxxxxxx/"
echo '============= task output =================='
}
nxf_kill() {
declare -a children
while read P PP;do
children[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)
kill_all() {
[[ $1 != $$ ]] && kill $1 2>/dev/null || true
for i in ${children[$1]:=}; do kill_all $i; done
}
kill_all $1
}
nxf_mktemp() {
local base=${1:-/tmp}
mkdir -p "$base"
if [[ $(uname) = Darwin ]]; then mktemp -d $base/nxf.XXXXXXXXXX
else TMPDIR="$base" mktemp -d -t nxf.XXXXXXXXXX
fi
}
nxf_fs_copy() {
local source=$1
local target=$2
local basedir=$(dirname $1)
mkdir -p $target/$basedir
cp -fRL $source $target/$basedir
}
nxf_fs_move() {
local source=$1
local target=$2
local basedir=$(dirname $1)
mkdir -p $target/$basedir
mv -f $source $target/$basedir
}
nxf_fs_rsync() {
rsync -rRl $1 $2
}
nxf_fs_rclone() {
rclone copyto $1 $2/$1
}
nxf_fs_fcp() {
fcp $1 $2/$1
}
on_exit() {
exit_status=${nxf_main_ret:=$?}
printf $exit_status > /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.exitcode
set +u
[[ "$tee1" ]] && kill $tee1 2>/dev/null
[[ "$tee2" ]] && kill $tee2 2>/dev/null
[[ "$ctmp" ]] && rm -rf $ctmp || true
docker rm $NXF_BOXID &>/dev/null || true
sync || true
exit $exit_status
}
on_term() {
set +e
docker kill $NXF_BOXID
}
nxf_launch() {
docker run -i --cpu-shares 12288 --memory 8192m -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /mnt/md1/Priscilla/SingleCell/scATACpipe:/mnt/md1/Priscilla/SingleCell/scATACpipe -w "$PWD" -u
}
nxf_stage() {
true
# stage input files
rm -f R1_Sample_S1_1.barcoded.fastq.gz
rm -f R2_Sample_S1_1.barcoded.fastq.gz
ln -s /mnt/md1/Priscilla/SingleCell/scATACpipe/work/ca/08da56da2ae1c4e0385a13edd8b6fc/R1_Sample_S1_1.barcoded.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz
ln -s /mnt/md1/Priscilla/SingleCell/scATACpipe/work/ca/08da56da2ae1c4e0385a13edd8b6fc/R2_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz
}
nxf_unstage() {
true
[[ ${nxf_main_ret:=0} != 0 ]] && return
}
nxf_main() {
trap on_exit EXIT
trap on_term TERM INT USR2
trap '' USR1
[[ "${NXF_CHDIR:-}" ]] && cd "$NXF_CHDIR"
export NXF_BOXID="nxf-$(dd bs=18 count=1 if=/dev/urandom 2>/dev/null | base64 | tr +/ 0A)"
NXF_SCRATCH=''
[[ $NXF_DEBUG > 0 ]] && nxf_env
touch /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.begin
set +u
set -u
[[ $NXF_SCRATCH ]] && cd $NXF_SCRATCH
nxf_stage
set +e
ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
local cout=$ctmp/.command.out; mkfifo $cout
local cerr=$ctmp/.command.err; mkfifo $cerr
tee .command.out < $cout &
tee1=$!
tee .command.err < $cerr >&2 &
tee2=$!
( nxf_launch ) >$cout 2>$cerr &
pid=$!
wait $pid || nxf_main_ret=$?
wait $tee1 $tee2
nxf_unstage
}
$NXF_ENTRY
Typically, cutadapt
should be very fast. I think next debugging step is to try to run the problematic command directly. Here are how:
- go to the working directory:
cd /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841
- activate the container for cutadapt:
hukai916/cutadapt_xenial:0.1
docker run -it hukai916/cutadapt_xenial:0.1
if you are using docker. Orsingularity shell xxx.sif
if you are usingsingularity
, thexxx.sif
file should be under thework/singularity
directory.
- copy the command from
.command.sh
and run to see if you can replicate the error.
Btw, I notice another config /mnt/md1/Priscilla/SingleCell/scATACpipe/Mouse_custom.config
, which might set max_cpus = 32
and override the default?
Thank you.
I went to the working directory cd /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841
and activated the container for cutadapt successfully. My terminal has (base) root@b407de9f1cc3:/#
on the far left.
I then tried to run the code above from .command.sh
cutadapt --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz
However, I get an error stating it cannot find my input file. And when I look in the directory using ls while in the docker container, the files from my original working directory are no longer listed.
ERROR: Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/cutadapt/pipeline.py", line 552, in run
with self._opener.xopen(self.path, 'rb') as f:
File "/opt/conda/lib/python3.9/site-packages/cutadapt/utils.py", line 186, in xopen
f = open_raise_limit(
File "/opt/conda/lib/python3.9/site-packages/cutadapt/utils.py", line 51, in open_raise_limit
f = func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/xopen/init.py", line 609, in xopen
return _open_gz(filename, mode, compresslevel, threads)
File "/opt/conda/lib/python3.9/site-packages/xopen/init.py", line 505, in _open_gz
return igzip.open(filename, mode)
File "/opt/conda/lib/python3.9/site-packages/isal/igzip.py", line 92, in open
binary_file = IGzipFile(filename, gz_mode, compresslevel)
File "/opt/conda/lib/python3.9/site-packages/isal/igzip.py", line 151, in init
super().init(filename, mode, compresslevel, fileobj, mtime)
File "/opt/conda/lib/python3.9/gzip.py", line 173, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'R1_Sample_S1_1.barcoded.fastq.gz'
As for the Mouse_custom.config, that is the config I generated for this project and in it I stated that max_cpus = 28
Just thought I'd give an update.
I went into the cutadapt.nf code and replaced the option variable with cores=28. However, when I resume the session, while I can see that the command.sh file now states cores=28 instead of cores=0, it still doesn't appear to work. When I first restart the session, on my system monitor, I can see that it starts working and the cores are being used, however, after about five minutes it starts going into a cycle of not finishing cutadapt and the system shows that not much processing is occurring on the cores.
In addition, I ran cutadapt 4.4 outside of scATACpipe to try to get an estimate of how long cutadapt should take on my data and it finished in about 30 minutes.
At this point, I'm not sure why it won't finish cutadapt. So Hopefully getting it to run in the container for cutadapt will allow us to debug the issue.
Thanks,
May-17 11:28:37.222 [Task submitter] INFO nextflow.Session - [a0/bc5742] Submitted process > SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1)
May-17 11:33:36.310 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 12; name: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1); status: RUNNING; exit: -; error: -; workDir: /mnt/md1/Priscilla/SingleCell/scATACpipe/work/a0/bc574254865a25e80499c88af84974]
May-17 11:38:36.387 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 12; name: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1); status: RUNNING; exit: -; error: -; workDir: /mnt/md1/Priscilla/SingleCell/scATACpipe/work/a0/bc574254865a25e80499c88af84974]
May-17 11:43:36.451 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 12; name: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1); status: RUNNING; exit: -; error: -; workDir: /mnt/md1/Priscilla/SingleCell/scATACpipe/work/a0/bc574254865a25e80499c88af84974]
Hi, I am on vacation and might be slow in response.
To use docker, you need to mount your local folder to the container, check out here. Or, you can run docker command in non-interactive mode without the -it
flag. Let me know if any questions.
Thanks for your help. I was able to mount the folder which held the barcode files (not the pointer files) and run cutadapt. It ran correctly and finished in 25 minutes so I'm not sure why it's not working when I run the full pipeline. At this point I will try to redownload scATACpipe and recreate the config file.
Maybe also try a fresh start by not using the -resume
.
Hi, so I did a fresh start without using resume and no luck.
Though I did not notice the last time I tried to resume and this time with the fresh restart, that my swap memory was suddenly filling up and as soon the swap memory filled, my core usage dropped and the cutadapt started cycling. This only happens in the scATACpipe container as I was successfully able to run the cutadapt container by itself and run cutadapt on my own outside of any container.
I have 32 cores and 130 Gb of memory on my computer. And have 488,981,686 read pairs in my files. Is this a system requirement issue or is there a parameter I could fix for in usage with the scATACpipe container?
Thank you
The resources on your computer should be more than sufficient for cutadapt
. I don't quite understand why the container would not work with the pipeline. I have created a new container using the latest version of cutadapt
(v4.4), do you mind trying to change the following line:
scATACpipe/modules/local/cutadapt.nf
Line 12 in 8e00489
to
container "hukai916/cutadapt:0.1"
, and try again?Hi, I will try that and let you know how it goes.
Also, I tried running the pipeline on another computer in the lab set to 60 cores and 200 G memory. Again cutadapt failed but this time it gave a different error:
"WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
What's interesting is when I watch my system monitor, my memory has not filled up, it's only the swap memory that suddenly is filled and it's my understanding that the computer shouldn't use swap until the memory is filled. So I don't know why its filling the swap instead. Is there anything in the container settings about swap?
One idea I have as I rerun it is to lower the core number to give each core more memory. I'll keep you posted. Thanks for all your help with this.
Good morning,
I've just checked the code I ran this weekend and after adjusting the cutadapt pipeline and running lower cores, I received an error in executing the "GET_WHITELIST_BARCODE". I've attached the nextflow log and my config.
Caused by:
Process SCATACPIPE:PREPROCESS_DEFAULT:GET_WHITELIST_BARCODE (1)
terminated with an error exit status (1)
Command executed:
step1: concatenate all barcode reads or read chunks that belong to the sample_name:
cat barcode_Sample_S1_*.fastq.gz > barcode_Sample_S1_combined.fastq.gz
step2: determine the whitelist barcode, use 10,000 reads for quick test:
(zcat barcode_Sample_S1_combined.fastq.gz || true) | awk '{ print $0 } NR==40000 {exit}' | gzip > subset.fastq.gz
note "|| true" is to capture and skip the SIGPIPE error
get_whitelist_barcode.py whitelist_barcodes subset.fastq.gz whitelist_Sample_S1_
Command exit status:
1
Command output:
(empty)
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Traceback (most recent call last):
File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 55, in
fractions = [get_fraction(index_fastq, barcode_file) for barcode_file in barcode_files]
File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 55, in
fractions = [get_fraction(index_fastq, barcode_file) for barcode_file in barcode_files]
File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 27, in get_fraction
with open(barcode_file) as f:
IsADirectoryError: [Errno 21] Is a directory: 'whitelist_barcodes/.nextflow'
Work dir:
/home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/0e/ff0d1341e03ec0e39bf50fe9235b8c
Okay so cutadapt actually finished! But then of course a new error had to pop up with bwa mem. "[e7/7d7c3b] NOTE: Process SCATACPIPE:PREPROCESS_DEFAULT:BWA_MAP (1)
terminated with an error exit status (1) -- Execution is retried (2)"
I check the command.err and .log and both say
"WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 4714624 sequences (240000007 bp)...
[M::process] read 4714608 sequences (240000072 bp)..."
samtools sort: failed to read header from "-"
and the command.sh files says:
#!/bin/bash -euo pipefail
filename=$(basename bwa_index/.bwt)
index_name="${filename%.}"
sample_name=R1_Sample_S1_1.barcoded.trimmed.fastq.gz
outname="${sample_name%%.*}"
outname="${outname#R1_}"
bwa mem -t 24 bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 24 -m 715827882 -O bam -o ${outname}.sorted.bam
Sorry for the late reply, I am back to work now.
Seems that your bwa_index
folder is empty, by default, if bwa index files are not supplied, scATACpipe will build them using the supplied genome fastq file. Can you check the SCATACPIPE:PREPROCESS_DEFAULT:BWA_INDEX
output first? Let me know if any questions.
Hi, welcome back. I hope you had a great vacation. I tried rerunning the pipeline without supplying any index files so everything would be the default version. I've included a lot below to try to figure out why it's not working. Let me know if you need any other files.
First, here is my config file for this run so it should have created everything itself.
// Save the following configurations into custom.config
params {
preprocess = 'default'
input_fastq = '/home/blackmonlab/Desktop/scATACpipe/scATACpipe/sample_sheet_test_data1.csv'
outdir = './MouseSC3'
species_latin_name = 'Mus musculus'
ref_fasta_ensembl = 'mus_musculus'
split_fastq = false
barcode_correction = 'pheniqs'
whitelist_barcode = '/home/blackmonlab/Desktop/scATACpipe/scATACpipe/assets/whitelist_barcodes'
filter = 'both'
doublet_removal_algorithm = 'archr'
archr_thread = 8
archr_blacklist = false
archr_batch_correction_harmony = true
filter_sample = false
filter_seurat_ilsi = false
custom_peaks = false
archr_scrnaseq = false
archr_scrnaseq_grouplist = false
max_memory = '200.GB'
max_cpus = 32
max_time = '240.h'
}
// Use this command: sudo nextflow run main.nf -c Mouse_custom.config -profile docker,local.
but here is the error again.
Caused by:
Process `SCATACPIPE:PREPROCESS_DEFAULT:BWA_MAP (1)` terminated with an error exit status (1)
Command executed:
filename=$(basename bwa_index/*.bwt)
index_name="${filename%.*}"
sample_name=R1_Sample_S1_1.barcoded.trimmed.fastq.gz
outname="${sample_name%%.*}"
outname="${outname#R1_}"
bwa mem -t 32 bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 32 -m 805306368 -O bam -o ${outname}.sorted.bam
Command exit status:
1
Command output:
(empty)
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 6286194 sequences (320000070 bp)...
[M::process] read 6286100 sequences (320000029 bp)...
samtools sort: failed to read header from "-"
Work dir:
/home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/8d/b51980964f520ea4ee9555e00a3dfc
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
-- Check '.nextflow.log' file for details
When I look at the output for BWA_index the command log seems to have indicated it worked.
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
[bwa_index] Pack FASTA... 28.16 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=5456444902, availableWord=395935328
[BWTIncConstructFromPacked] 10 iterations done. 99999990 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999990 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999990 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999990 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999990 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999990 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999990 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999990 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 899999990 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 999999990 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 1099999990 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 1199999990 characters processed.
[BWTIncConstructFromPacked] 130 iterations done. 1299999990 characters processed.
[BWTIncConstructFromPacked] 140 iterations done. 1399999990 characters processed.
[BWTIncConstructFromPacked] 150 iterations done. 1499999990 characters processed.
[BWTIncConstructFromPacked] 160 iterations done. 1599999990 characters processed.
[BWTIncConstructFromPacked] 170 iterations done. 1699999990 characters processed.
[BWTIncConstructFromPacked] 180 iterations done. 1799999990 characters processed.
[BWTIncConstructFromPacked] 190 iterations done. 1899999990 characters processed.
[BWTIncConstructFromPacked] 200 iterations done. 1999999990 characters processed.
[BWTIncConstructFromPacked] 210 iterations done. 2099999990 characters processed.
[BWTIncConstructFromPacked] 220 iterations done. 2199999990 characters processed.
[BWTIncConstructFromPacked] 230 iterations done. 2299999990 characters processed.
[BWTIncConstructFromPacked] 240 iterations done. 2399999990 characters processed.
[BWTIncConstructFromPacked] 250 iterations done. 2499999990 characters processed.
[BWTIncConstructFromPacked] 260 iterations done. 2599999990 characters processed.
[BWTIncConstructFromPacked] 270 iterations done. 2699999990 characters processed.
[BWTIncConstructFromPacked] 280 iterations done. 2799999990 characters processed.
[BWTIncConstructFromPacked] 290 iterations done. 2899999990 characters processed.
[BWTIncConstructFromPacked] 300 iterations done. 2999999990 characters processed.
[BWTIncConstructFromPacked] 310 iterations done. 3099999990 characters processed.
[BWTIncConstructFromPacked] 320 iterations done. 3199999990 characters processed.
[BWTIncConstructFromPacked] 330 iterations done. 3299999990 characters processed.
[BWTIncConstructFromPacked] 340 iterations done. 3399999990 characters processed.
[BWTIncConstructFromPacked] 350 iterations done. 3499999990 characters processed.
[BWTIncConstructFromPacked] 360 iterations done. 3599999990 characters processed.
[BWTIncConstructFromPacked] 370 iterations done. 3699999990 characters processed.
[BWTIncConstructFromPacked] 380 iterations done. 3799999990 characters processed.
[BWTIncConstructFromPacked] 390 iterations done. 3899999990 characters processed.
[BWTIncConstructFromPacked] 400 iterations done. 3999999990 characters processed.
[BWTIncConstructFromPacked] 410 iterations done. 4099999990 characters processed.
[BWTIncConstructFromPacked] 420 iterations done. 4199999990 characters processed.
[BWTIncConstructFromPacked] 430 iterations done. 4299999990 characters processed.
[BWTIncConstructFromPacked] 440 iterations done. 4399999990 characters processed.
[BWTIncConstructFromPacked] 450 iterations done. 4499999990 characters processed.
[BWTIncConstructFromPacked] 460 iterations done. 4599999990 characters processed.
[BWTIncConstructFromPacked] 470 iterations done. 4699999990 characters processed.
[BWTIncConstructFromPacked] 480 iterations done. 4799789782 characters processed.
[BWTIncConstructFromPacked] 490 iterations done. 4892030822 characters processed.
[BWTIncConstructFromPacked] 500 iterations done. 4974010886 characters processed.
[BWTIncConstructFromPacked] 510 iterations done. 5046870966 characters processed.
[BWTIncConstructFromPacked] 520 iterations done. 5111625190 characters processed.
[BWTIncConstructFromPacked] 530 iterations done. 5169174934 characters processed.
[BWTIncConstructFromPacked] 540 iterations done. 5220321238 characters processed.
[BWTIncConstructFromPacked] 550 iterations done. 5265776182 characters processed.
[BWTIncConstructFromPacked] 560 iterations done. 5306172630 characters processed.
[BWTIncConstructFromPacked] 570 iterations done. 5342073078 characters processed.
[BWTIncConstructFromPacked] 580 iterations done. 5373977430 characters processed.
[BWTIncConstructFromPacked] 590 iterations done. 5402330102 characters processed.
[BWTIncConstructFromPacked] 600 iterations done. 5427525990 characters processed.
[BWTIncConstructFromPacked] 610 iterations done. 5449916134 characters processed.
[bwt_gen] Finished constructing BWT in 614 iterations.
[bwa_index] 1742.02 seconds elapse.
[bwa_index] Update BWT... 25.69 sec
[bwa_index] Pack forward-only FASTA... 22.23 sec
[bwa_index] Construct SA from BWT and Occ... 781.42 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw bwa_index/Genome.primary.chrPrefixed.fa.gz
[main] Real time: 2638.710 sec; CPU: 2599.526 sec
And in the 6c work folder, there is a bwa_index folder with files in them.
-rw-r--r-- 1 root root 5961 May 31 11:34 Genome.primary.chrPrefixed.fa.gz.amb
-rw-r--r-- 1 root root 2478 May 31 11:34 Genome.primary.chrPrefixed.fa.gz.ann
-rw-r--r-- 1 root root 2728222532 May 31 11:33 Genome.primary.chrPrefixed.fa.gz.bwt
-rw-r--r-- 1 root root 682055614 May 31 11:34 Genome.primary.chrPrefixed.fa.gz.pac
-rw-r--r-- 1 root root 1364111280 May 31 11:47 Genome.primary.chrPrefixed.fa.gz.sa
And in the cutadapt c6 folder, there are files as well.
log_cutadapt_Sample_S1_1.txt R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz
R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz
and all seem to have something in them.
Here is the beginning of the R2_sample file:
"@NTAGCTCGAGATTGGG:VH00580:38:AACMLYGM5:1:1101:65437:1284 3:N:0:TTTGGACG
GTCCTGGCTTGTTTTATGTCAACTTGACACAAGCTAGAGTCATCTTAGAGG
+
CCCCCCCCCCC;CCCCCCC-C;CCCCCCCCCCCCCCC-CCCCCCCCCCCCC
@NTACCGGACGGATGGG:VH00580:38:AACMLYGM5:1:1101:65475:1284 3:N:0:TTTGGACG
TGGTTAAACTAGGGAATGATTTTGAATAGAGCACAATAGCATTAAGAGTCG
+
CCCC;CC;C-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CC
@NGAGAGATGGCTTATC:VH00580:38:AACMLYGM5:1:1101:65627:1284 3:N:0:TTTGGACG
GTTAAGCAAAGTGTACTTTGAGGTATCATCGTTAAAAAAATAAAATTGATG
"
Thoughts? and thank you!
Seems that the bwa mem
step is to blame. I suggest the following debugging steps.
- go to the working dir:
cd /home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/8d/b51980964f520ea4ee9555e00a3dfc
- activate the corresponding docker container (should be
hukai916/bwa_xenial:0.1
) - the actual commands being executed by this module are (.command.sh):
filename=$(basename bwa_index/*.bwt)
index_name="${filename%.*}"
sample_name=R1_Sample_S1_1.barcoded.trimmed.fastq.gz
outname="${sample_name%%.*}"
outname="${outname#R1_}"
bwa mem -t 32 bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 32 -m 805306368 -O bam -o ${outname}.sorted.bam
Try run each command at a time, make sure index_name
is correct and the index file is actually there. Let me know,
Thank you. I am actually about to head on vacation now myself for a week so I won't be able to test this out until I get back.
Hi,
I'm back in the lab and working on debugging this. When I mount my files to the container, it is unable to use files that are being pointed to by another file and gives the . So, to get around this, I went to the original bwa_index folder at 6c/9d5c273a42ec4f4369368036147c95.
I ran the naming code and everything was correct.
I then went to the folders with the actual sample files at c6/518fd88378bc056f65ec7bf026a2be and ran the following code
bwa mem -t 32 ../../6c/9d5c273a42ec4f4369368036147c95/bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 32 -m 805306368 -O bam -o ${outname}.sorted.bam
And it ran! I have an output file called "Sample_S1_1.sorted.bam" in my current directory: c6/518fd88378bc056f65ec7bf026a2be
Given this, I'm not sure why when its not in the direct container, it was unable to find the index files. Thoughts?
Hi quick question, I'm going through the pipeline manually now and how was the adapter chosen to be used in cutadapt?
Thanks!
Hi, the adapters used for cutadapt is default to the first 13bp used in Illumina standard adapter. They can be configured to use other sequences here:
Lines 31 to 32 in 8e00489
Thanks hukai916. I'm progressing through the steps well. I'm now at the DEDUP_BAM module and am curious what I should use as the $barcode_tag?
In scATACpipe, the DEDUP_BAM
is called twice. The first run is a rough de-duplication without cell barcode correction, the aim is to find out the inflection point to determine the valid cell numbers. The second run is to deduplicate with the corrected cell barcodes saved in the "CB" tag.
Therefore, for the first run, the barcode_tag
is set to "N/A" while for the second run, it is set to "CB". These will be set automatically if you were able to run the pipeline automatically instead of manually. For your convenience, the corresponding lines are below:
First run:
scATACpipe/workflows/preprocess_default.nf
Line 159 in 8e00489
Second run:
scATACpipe/workflows/preprocess_default.nf
Line 229 in 8e00489
It is still mysterious to me why you can not run the pipeline automatically in your environment.
Thank you! And I'm not sure either but going through the pipeline step by step has been good for me to learn the process at least. It also means I can more easily go back and adjust any mapping steps as well if need. I'll probably try to rerun the pipeline in the near future but for now, I'll keep doing this.
As for the remove_dup step. I tried running it and ran into this error. Do you know which version of sinto you used?
python3 ../../../remove_duplicate.py --inbam mm39.filtered.bam --barcode_tag "N/A" --outdir ./ --outbam mm39.dedup.bam --nproc 75
Traceback (most recent call last):
File "/home/blackmonlab/Desktop/Priscilla/SingleCell/mm39/raw_data/Fastqs/../../../remove_duplicate.py", line 271, in <module>
intervals = utils.chunk_bam_by_chr(bam_temp, nproc)
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'sinto.utils' has no attribute 'chunk_bam_by_chr'
Hi, when I investigated Sinto.utils, https://github.com/timoast/sinto/blob/master/sinto/utils.py, I find chunk_bam but nothing called chunk_bam_by_chr. Was this in an older version? Did you create it? Is there a way that I can use this segment of the pipeline and then export the results?
I tried running chunk_bam itself and it froze my computer while creating empty temporary files whose names sometimes started on one chr and ended on another. So not exactly sure what happened but I'm digging into it.
Thanks for all your help.
Hi,
The sinto
is a customized version. When you run the pipeline, the docker container ("hukai916/sinto_kai:0.3") that harbors this customized sinto
will be downloaded automatically. You can find the script here: https://github.com/hukai916/sinto/tree/master/sinto
The function chunk_bam_by_chr
is used for remove duplicates (DEDUP_BAM
module). The chunk_bam
function implemented in the original sinto
package may group multiple chromosome regions into the same chunk, which is not compatible with our multi-processing strategy, therefore, we implemented chunk_bam_by_chr
so that each chunk only come from a single chromosome.
Hi, I downloaded the sinto folder and put it into the same folder with the remove_duplicate.py script. I called it sinto2 and updated the script to import utils from sinto2.
I then tried running the script as shown down below but am getting weird results with chunks to scaffolds and such. Any ideas why this would be or how I should download your version of sinto?
The init.py script is still finding the metadata for "sinto"
(scATACpipe) blackmonlab@blackmonlab-M-CLASS:~/Desktop/Priscilla/SingleCell/mm39/raw_data/Fastqs$ python3 ../../../remove_duplicate.py --inbam mm39.filtered.bam --barcode_tag "N/A" --outdir ./ --outbam mm39.dedup.bam --nproc 50
Processing chunk_chr1_0.0_to_chrJH584295.1_39.52 ...
Processing chunk_chr1_3903085.58_to_chrJH584295.1_79.04 ...
Processing chunk_chr1_7806171.16_to_chrJH584295.1_118.56 ...
Processing chunk_chr1_11709256.74_to_chrJH584295.1_158.08 ...
Processing chunk_chr1_15612342.32_to_chrJH584295.1_197.60000000000002 ...
Which sinto folder did you download? I assume you are running the script locally, if so, do you have 50 cores (nproc = 50)? Can you share the input bam so that i can try replicatethe issue? Thanks,
Hi, I git cloned sinto from the link you shared and then renamed the sinto folder within that to sinto2 which I copied to my working directory.
I then downloaded the remove_duplicates.py script into the same working directory and updated it to import sinto2.
My computer has upto 100 cpus available.
Here is a link to the original bam file and the filtered bam I created. https://drive.google.com/file/d/1lc0nscqrh3zWaCfkXUxTpewBKfjPJTj7/view?usp=sharing
Hi, I just wanted to follow up and check if you were able to download the bam file correctly?
Here is a workflow document I have for what I've been doing.
https://docs.google.com/document/d/1a6F1WyY4PI8PJtYyo2VBMqJt_Z_mxKELRK8wbBFavDk/edit?usp=sharing
Can you grant me access? Thanks!
Shared!
Thanks! I confirm that I have access to both files now, let me look into it and let you know.
The bam files are too large to download, can you try downsampling it to 10% and share the link? You can use samtools view -b -s 0.1 foo.bam > sampled.bam
for randomly downsampling 10% of the BAM. Thanks,
You should be receiving a notice from google with the new shared files. Thanks!
Hi @priscilla-glenn, I was able to download and downsample your original bam file to 5% for testing purpose. I can repeat what you observed with the custom sinto
(that you saved as sinto2
). If you see info like below:
Processing chunk_chr1_0.0_to_chrJH584295.1_329.3333333333333 ...
Processing chunk_chr1_32525713.166666668_to_chrJH584295.1_658.6666666666666 ...
Processing chunk_chr1_65051426.333333336_to_chrJH584295.1_988.0 ...
Processing chunk_chr1_97577139.5_to_chrJH584295.1_1317.3333333333333 ...
Processing chunk_chr1_130102852.66666667_to_chrJH584295.1_1646.6666666666665 ...
Processing chunk_chr1_162628565.83333334_to_chrJH584295.1_1976.0 ...
It basically means that you are actually invoking the correct sinto
version. The function chunk_bam_by_chr
would split each chromosome into --nproc
pieces and process every ith piece from each chromosome in one processor. For example, if --nproc 10
, each chromosome/scaffold would be split into 10 pieces. and processor1 would analyze the first piece from each chromosome/scaffold; likewise, processor2 would analyze the second piece from each chromosome/scaffold, and so on. Meanwhile, each parallel job would print out information consisting of the start coordinate of the first chromosome piece and the end coordinate of the last piece that it is processing, which is shown above. Let me know if unclear.
Thank you very much. And so that means the final output is still correct even though the printing in the terminal is as shown above? Or do I need to adjust the code in some way? Thank you so much for your help in this!
Correct, though the printing is sort of confusing, the results should be right, no need to adjust. When the run finishes, you should see a summary info like below.
Soft_clipped (reads): 631524
No corrected barcode (of all paired fragments): 0
Not properly mapped (fragments): 0
Total unique fragments: 20434638
Total duplicate fragments: 194386
Thank you very much, it worked!
On the get_whitelist_barcode.py script, it states
"""
Given whitelist_barcode folder, index_fastq, outfile_prefix, select the correct whitelist with the highest overlapping fraction. If fraction less than 0.5, exits with error msg.
"""
Wouldn't you want the barcode_fastq instead of the index_fastq?
The index_fastq
here refers to the fastq file containing the cell barcodes, or called cell index.