COMBINE-lab/usefulaf

Error when running `index`

naity2 opened this issue · 8 comments

Hi @DongzeHE,

Sorry to bother you again. I pulled the latest Docker image (5edb1b9eb9ee) but still failed to run the pipeline. I have pasted the log below. Just FYI, I did see the files splici_fl86_t2g_3col.tsv and splici_fl86.fa in the ref subdirectory.

Thank you very much!

ALEVIN_FRY_HOME did not exist; creating it.
salmon 1.8.0
salmon version 1.8.0 is sufficiently new.
alevin-fry 0.5.1
alevin-fry version 0.5.1 is sufficiently new.
/opt/conda/bin/time command appears to execute a valid GNU time

Extracting the splici reference using command 

 ./build_splici_ref.R /home/ubuntu/data/nextflow/work/c7/cb0af62f40d2e67960a090c9dfccd2/genome.fa /home/ubuntu/data/nextflow/work/c7/cb0af62f40d2e67960a090c9dfccd2/genes.gtf 91 /home/ubuntu/data/nextflow/work/c7/cb0af62f40d2e67960a090c9dfccd2/custome_index/ref/    --filename-prefix splici 

trying URL 'http://cran.r-project.org/src/contrib/BiocManager_1.30.18.tar.gz'
Content type 'application/x-gzip' length 289602 bytes (282 KB)
==================================================
downloaded 282 KB

* installing *source* package ‘BiocManager’ ...
** package ‘BiocManager’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (BiocManager)

The downloaded source packages are in
	‘/tmp/Rtmpxgx7wq/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: http://cran.r-project.org

Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.3 (2022-03-10)
Installing package(s) 'BiocVersion', 'fishpond'
also installing the dependencies ‘abind’, ‘gtools’, ‘qvalue’, ‘svMisc’

trying URL 'http://cran.r-project.org/src/contrib/abind_1.4-5.tar.gz'
Content type 'application/x-gzip' length 21810 bytes (21 KB)
==================================================
downloaded 21 KB

trying URL 'http://cran.r-project.org/src/contrib/gtools_3.9.2.1.tar.gz'
Content type 'application/x-gzip' length 223562 bytes (218 KB)
==================================================
downloaded 218 KB

trying URL 'https://bioconductor.org/packages/3.14/bioc/src/contrib/qvalue_2.26.0.tar.gz'
Content type 'application/x-gzip' length 2759011 bytes (2.6 MB)
==================================================
downloaded 2.6 MB

trying URL 'http://cran.r-project.org/src/contrib/svMisc_1.2.3.tar.gz'
Content type 'application/x-gzip' length 148255 bytes (144 KB)
==================================================
downloaded 144 KB

trying URL 'https://bioconductor.org/packages/3.14/bioc/src/contrib/BiocVersion_3.14.0.tar.gz'
Content type 'application/x-gzip' length 969 bytes
==================================================
downloaded 969 bytes

trying URL 'https://bioconductor.org/packages/3.14/bioc/src/contrib/fishpond_2.0.1.tar.gz'
Content type 'application/x-gzip' length 2144860 bytes (2.0 MB)
==================================================
downloaded 2.0 MB

* installing *source* package ‘abind’ ...
** package ‘abind’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (abind)
* installing *source* package ‘gtools’ ...
** package ‘gtools’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
x86_64-conda-linux-gnu-cc -I"/opt/conda/lib/R/include" -DNDEBUG   -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1648493055476/work=/usr/local/src/conda/r-base-4.1.3 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix  -c init.c -o init.o
x86_64-conda-linux-gnu-cc -I"/opt/conda/lib/R/include" -DNDEBUG   -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1648493055476/work=/usr/local/src/conda/r-base-4.1.3 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix  -c roman2int.c -o roman2int.o
x86_64-conda-linux-gnu-cc -I"/opt/conda/lib/R/include" -DNDEBUG   -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1648493055476/work=/usr/local/src/conda/r-base-4.1.3 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix  -c setTCPNoDelay.c -o setTCPNoDelay.o
x86_64-conda-linux-gnu-cc -shared -L/opt/conda/lib/R/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/lib -Wl,-rpath-link,/opt/conda/lib -L/opt/conda/lib -o gtools.so init.o roman2int.o setTCPNoDelay.o -L/opt/conda/lib/R/lib -lR
installing to /opt/conda/lib/R/library/00LOCK-gtools/00new/gtools/libs
** R
** data
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (gtools)
* installing *source* package ‘qvalue’ ...
** using staged installation
** R
** data
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (qvalue)
* installing *source* package ‘svMisc’ ...
** package ‘svMisc’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (svMisc)
* installing *source* package ‘BiocVersion’ ...
** using staged installation
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (BiocVersion)
* installing *source* package ‘fishpond’ ...
** using staged installation
** libs
x86_64-conda-linux-gnu-c++ -std=gnu++11 -I"/opt/conda/lib/R/include" -DNDEBUG  -I'/opt/conda/lib/R/library/Rcpp/include' -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib   -fpic  -fvisibility-inlines-hidden  -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1648493055476/work=/usr/local/src/conda/r-base-4.1.3 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix  -c RcppExports.cpp -o RcppExports.o
x86_64-conda-linux-gnu-c++ -std=gnu++11 -I"/opt/conda/lib/R/include" -DNDEBUG  -I'/opt/conda/lib/R/library/Rcpp/include' -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib   -fpic  -fvisibility-inlines-hidden  -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1648493055476/work=/usr/local/src/conda/r-base-4.1.3 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix  -c readEDS.cpp -o readEDS.o
x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared -L/opt/conda/lib/R/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/lib -Wl,-rpath-link,/opt/conda/lib -L/opt/conda/lib -o fishpond.so RcppExports.o readEDS.o -L/opt/conda/lib/R/lib -lR
installing to /opt/conda/lib/R/library/00LOCK-fishpond/00new/fishpond/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (fishpond)

The downloaded source packages are in
	‘/tmp/Rtmpxgx7wq/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'AnnotationDbi', 'BiocFileCache', 'biomaRt', 'GenomeInfoDb',
  'GenomicFeatures', 'gert', 'glmnet', 'later', 'limma', 'openssl',
  'randomForest', 'RSQLite', 'S4Vectors', 'usethis'
Warning message:
package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'eisaR' 'BSgenome' 
Downloading GitHub repo COMBINE-lab/roe@HEAD

Skipping 11 packages not available: SingleCellExperiment, fishpond, S4Vectors, GenomicRanges, GenomicFeatures, GenomeInfoDb, eisaR, BSgenome, Biostrings, BiocGenerics, Biobase
* checking for file ‘/tmp/Rtmpxgx7wq/remotes1d66ba9b37/COMBINE-lab-roe-0733b48/DESCRIPTION’ ... OK
* preparing ‘roe’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* looking to see if a ‘data/datalist’ file should be added
Omitted ‘LazyData’ from DESCRIPTION
* building ‘roe_0.2.1.tar.gz’

* installing *source* package ‘roe’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (roe)
- Locating required files...
- Processing spliced transcripts and introns...
  - Loading the input files
  - Processing spliced transcripts
  - Processing introns
  - Extracting sequences from the genome
- Writing outputs
  - Writing spliced and intron sequences
Done

Done. Building index.

building index:
./simpleaf: line 238: ./build_splici_ref.R: No such file or directory
command: ./build_splici_ref.R /home/ubuntu/data/nextflow/work/c7/cb0af62f40d2e67960a090c9dfccd2/genome.fa /home/ubuntu/data/nextflow/work/c7/cb0af62f40d2e67960a090c9dfccd2/genes.gtf 91 /home/ubuntu/data/nextflow/work/c7/cb0af62f40d2e67960a090c9dfccd2/custome_index/ref/    --filename-prefix splici
=============

Hello @naity2,

Don't say that! This is absolutely my fault. I hit something on my keyboard before I commit. I apologize for the mistake and thank you so much for bearing with me.

The docker image is updated. Could you please try one more time? Thank you so much for your patience.

-Dongze

Thank you @DongzeHE!

I made progress until encountering this error:

[2022-05-27 16:10:22.194] [puff::index::jointLog] [info] mphf size = 949.331 MB
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk size = 131360999
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 0 = [0, 131361001)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 1 = [131361001, 262722000)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 2 = [262722000, 394083006)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 3 = [394083006, 525444005)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 4 = [525444005, 656805011)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 5 = [656805011, 788166010)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 6 = [788166010, 919527015)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 7 = [919527015, 1050888014)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 8 = [1050888014, 1182249032)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 9 = [1182249032, 1313610031)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 10 = [1313610031, 1444971030)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 11 = [1444971030, 1576332029)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 12 = [1576332029, 1707693028)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 13 = [1707693028, 1839054027)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 14 = [1839054027, 1970415026)
[2022-05-27 16:10:25.552] [puff::index::jointLog] [info] chunk 15 = [1970415026, 2101775943)
[2022-05-27 16:10:59.311] [puff::index::jointLog] [info] finished populating pos vector
[2022-05-27 16:10:59.311] [puff::index::jointLog] [info] writing index components
[2022-05-27 16:11:02.429] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2022-05-27 16:11:03.160] [jLog] [info] done building index
for info, total work write each  : 2.331    total work inram from level 3 : 4.322  total work raw : 25.000 
Bitarray      7963247552  bits (100.00 %)   (array + ranks )
final hash        316176  bits (0.00 %) (nb in final hash 941)
cp: cannot stat '/home/ubuntu/data/nextflow/work/5c/cdf68384993c8cad32347c679c3f10/custome_index/ref//transcriptome_splici_fl86_t2g_3col.tsv': No such file or directory

Hi @naity2,

Could you try again?

-Dongze

Just want to mention that I saw that you are using nextflow, recently we also wrote a nextflow workflow for alevin-fry, and we provide the quantification result of many public datasets we have processed. 😀

Thank you, @DongzeHE!

I have successfully finished running the pipeline and will check out your nextflow workflow.

May I ask you one more question? I used the load_fry() function to load the output into a SCE object. However, the rownames are ensembl IDs instead of gene names. I was wondering if it is possible to add gene names to the index/t2g_3col.tsv file so that I can map the gene ids to names for downstream analysis?

Hi @naity2 ,

Great! I apologize for all the mistakes I made, and thank you so much for bearing with me.

For your question, unfortunately, we can't really change the format of the t2g_3col.tsv file because alevin-fry needs this specific format. However, one solution is that you can easily generate the gene id to gene name mapping using the following command given that gffread is installed. If not, you can install it via conda or from their repo:

# generate gff from gtf
$ gffread genes.gtf -o genes.gff

# make the file
$ grep "gene_name" genes.gff | cut -f9 | cut -d';' -f2,3 | sed 's/=/ /g' | sed 's/;/ /g' | cut -d' ' -f2,4 | sort | uniq > geneid_to_name.txt

I will also try to include the geneid_to_name.txt file in the output folder of ./simpleaf index soon. Thank you for your suggestions!

-Dongze

rob-p commented

Thanks for bearing with us @naity2!

I like your idea @DongzeHE. Additionally, I think we could add an optional argument in roe and pyroe to read such a mapping and provide gene names in addition to IDs as labels.

Awesome! Thank you @DongzeHE @rob-p again for this wonderful tool and have a great long weekend!