mhalushka/miRge3.0

Error with Alignment when performing differential expression analysis

Closed this issue · 24 comments

Hello,

I am having an issue with the alignment step of miRge3.0. My pre-processing step works just fine but I have no idea why my alignment step is not working. I assume it might have to do with my flags or the library but I'm not too sure.

I would really appreciate it if I can get some help!

My command I used is this (this command was taken from the script I made):
miRge3.0 -s /data2/samir/miRNA_Pecaut/fastq/2018_03_14/CFG1721_0Gy_48hr_Male.fastq.gz -lib /home/samir/miRNA_Pecaut/miRge3_Lib -on mouse -db miRGeneDB -o /data2/samir/miRNA_Pecaut/miRge --bowtie-path /data2/samir/miRNA_Pecaut/miRge/bowtie-1.3.0-linux-x86_64 -cpu 12 -a AACTGTAGGCACCATCAAT -dex -mdt /home/samir/miRNA_Pecaut/DESmetadata.csv

My run.log is down below:
bowtie version: 1.3.0
cutadapt version: 2.8
Samtools version: 1.10
Collecting and validating input files...

miRge3.0 will process 1 out of 1 input file(s).

Cutadapt finished for file CFG1721_0Gy_48hr_Male in 48.2767 second(s)
Collapsing finished for file CFG1721_0Gy_48hr_Male in 0.4544 second(s)

Matrix creation finished in 0.647 second(s)

Data pre-processing completed in 50.0465 second(s)

Alignment in progress ...
Traceback (most recent call last):
File "/home/samir/.local/bin/miRge3.0", line 8, in
sys.exit(main())
File "/home/samir/.local/lib/python3.8/site-packages/mirge/main.py", line 105, in main
pdDataFrame = bwtAlign(args,pdDataFrame,workDir,ref_db)
File "/home/samir/.local/lib/python3.8/site-packages/mirge/libs/manifoldAlign.py", line 100, in bwtAlign
alignPlusParse(bwtExec, bwt_iter, pdDataFrame, args, workDir)
File "/home/samir/.local/lib/python3.8/site-packages/mirge/libs/manifoldAlign.py", line 19, in alignPlusParse
bowtie = subprocess.run(str(bwtExec), shell=True, check=True, stdout=subprocess.PIPE, text=True, stderr=subprocess.PIPE, universal_newlines=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '/data2/samir/miRNA_Pecaut/miRge/bowtie-1.3.0-linux-x86_64/bowtie -x /home/samir/miRNA_Pecaut/miRge3_Lib/mouse/index.Libs/mouse_mirna_MirGeneDB -n 0 -f --norc -S --threads 12 /data2/samir/miRNA_Pecaut/miRge/miRge.2022-07-11_11-29-59/bwtInput.fasta' returned non-zero exit status 1.

Thank you for your time!
-Samir

Hi @Samir-A,

This is probably due to incorrect path of the miRge libraries, can you try the following command and let me know if that works.

$ /data2/samir/miRNA_Pecaut/miRge/bowtie-1.3.0-linux-x86_64/bowtie -x /home/samir/miRNA_Pecaut/miRge3_Lib/mouse/index.Libs/mouse_mirna_MirGeneDB -n 0 -f --norc -S --threads 12 /data2/samir/miRNA_Pecaut/miRge/miRge.2022-07-11_11-29-59/bwtInput.fasta > test_out.sam

This will guide us if the error is with path of miRge libraries. Also, can you confirm the size of the temp bowtie input file by the following command:

ls -lh /data2/samir/miRNA_Pecaut/miRge/miRge.2022-07-11_11-29-59/bwtInput.fasta

Thanks,
Arun.

Thanks for reaching out Arun!

So I tried those commands and my fasta file does have size but my library does not exist for bowtie. I put the output of the commands in a text document.

I followed the directions to create the miRge library from the website but is seems the library for bowtie is not being created. Do I need to run a command to build a bowtie index or does the miRge command make the index for me?

miRge.txt

@Samir-A,

We provide pre-indexed libraries. It is probably incorrect path, can you try the following and show me the list of index files in the directory:

$ ls /home/samir/miRNA_Pecaut/miRge3_Lib/
$ ls /home/samir/miRNA_Pecaut/miRge3_Lib/mouse/index.Libs/

Thank you,
Arun

Yes! I put the output into the text doc.

Was I supposed to copy over a /mouse/index.Libs/ index folder?

Thanks,
Samir

miRge.txt

@Samir-A ,

You have skipped a step from the documentation. That is to unzip the file. Can you follow the command and this should fix the issue.
$ cd /home/samir/miRNA_Pecaut/miRge3_Lib/
$ tar -xzf mouse.tar.gz
This will have all the required subfolders (so, no need to copy over).
Ref: https://mirge3.readthedocs.io/en/master/quick_start.html#command-line-interface-cli

This will extract all the index files and bowtie will then be able to locate it and align to the reference libraries. Please let me know if you have other questions.

Also, to note, you have provided just one file CFG1721_0Gy_48hr_Male.fastq.gz and -dex requires two or more input files. If you are running in a script (as you have mentioned previously) to perform DE, you should have all the input files seperated by comma. For more detailed documentation, click here. I hope this helps.

Thank you,
Arun.

Ahh my mistake. Thanks for the great support! Also, nice catch for noticing I just used one file, did this because I wanted to test my command using one file before I set up my script to run on all my fastq files. One more thing, I can just use the *fastq.gz notation rather than listing the fastq's with ","'s right?

Thanks,
Samir

@Samir-A ,

Thank you.

Unfortunately, *.fastq.gz won't work. Instead you can try creating a folder named, input_files and move all your files there and then you can provide the folder name to -s as shown below:

miRge3.0 -s input_files

Alternatively, you can list all the file names in a text file and provide the text file name as input:
$ ls *.fastq.gz > input_files.txt
$ miRge3.0 -s input_files.txt

miRge3.0 will verify if the files are fastq or fastq.gz so, if there are any other files in the list or folder it will be omitted. I hope this helps.

Thank you,
Arun.

Hello Arun,

So I unzipped the library and set up my command and it does run but when I get to the running differential expression part, it say I am missing a package. I did install the R package locfit but it still says I don't have the package when I try the command again. Any ideas?

Here is my run.log:

bowtie version: 1.3.0
cutadapt version: 2.8
Samtools version: 1.10
Collecting and validating input files...

miRge3.0 will process 5 out of 5 input file(s).

Cutadapt finished for file CFG1721_0Gy_48hr_Male in 48.9274 second(s)
Collapsing finished for file CFG1721_0Gy_48hr_Male in 0.4676 second(s)

Cutadapt finished for file CFG1724_0Gy_4hr_Male in 58.9655 second(s)
Collapsing finished for file CFG1724_0Gy_4hr_Male in 6.4946 second(s)

Cutadapt finished for file CFG1725_0Gy_4hr_Male in 57.2982 second(s)
Collapsing finished for file CFG1725_0Gy_4hr_Male in 10.8474 second(s)

Cutadapt finished for file CFG1728_0Gy_48hr_Male in 45.6886 second(s)
Collapsing finished for file CFG1728_0Gy_48hr_Male in 10.4223 second(s)

Cutadapt finished for file CFG1735_0Gy_4hr_Male in 62.6832 second(s)
Collapsing finished for file CFG1735_0Gy_4hr_Male in 17.873 second(s)

Matrix creation finished in 4.5638 second(s)

Data pre-processing completed in 329.0319 second(s)

Alignment in progress ...
Alignment completed in 101.6451 second(s)

Summarizing and tabulating results...
Summary completed in 13.7124 second(s)

Performing differential expression...

Error: package or namespace load failed for ‘DESeq2’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
there is no package called ‘locfit’
Execution halted

The analysis completed in 453.6618 second(s)

done

Thanks for the help!
Samir

@Samir-A ,

The package DESeq2 is missing in R. This is one time installation only, can you follow the steps below:

Type R on command prompt to open R: $ R
Once you enter in the R terminal, execute the following commands.
> if (!require("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
>BiocManager::install("DESeq2")
Ref: https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Install ggplot
>install.packages(“ggplot2”)

Thank you,
Arun.

Hey Arun,

Thanks for the help! Everything seems to work now!

But I do want to ask a quick question. Which reference genome in the mouse library? Is it mm9 or mm10?

Thanks,
-Samir

@Samir-A,

Cool. GRCm38 is the genome version (mm10).

Thank you,
Arun

Hey Arun,

I got an error when running miRge3.0 that I am not sure is bad. Do you mind telling me if it will disrupt my data?

mkdir: cannot create directory ‘/data2/samir/miRNA_Pecaut/miRge’: File exists
mkdir: cannot create directory ‘/data2/samir/miRNA_Pecaut/miRge/All_Batches’: File exists
CFG1724_0Gy_4hr_Male.fastq.gz,CFG1725_0Gy_4hr_Male.fastq.gz,CFG1735_0Gy_4hr_Male.fastq.gz,CFG2249_0Gy_4hr_Male.fastq.gz,CFG2376_0Gy_4hr_Male.fastq.gz,CFG1734_2Gy_4hr_Male.fastq.gz,CFG1746_2Gy_4hr_Male.fastq.gz,CFG1747_2Gy_4hr_Male.fastq.gz,CFG2253_2Gy_4hr_Male.fastq.gz,CFG2375_2Gy_4hr_Male.fastq.gz
bowtie version: 1.3.0
cutadapt version: 2.8
Samtools version: 1.10
Collecting and validating input files...

miRge3.0 will process 10 out of 10 input file(s).

Cutadapt finished for file CFG1724_0Gy_4hr_Male in 57.2258 second(s)
Collapsing finished for file CFG1724_0Gy_4hr_Male in 0.804 second(s)

Cutadapt finished for file CFG1725_0Gy_4hr_Male in 56.288 second(s)
Collapsing finished for file CFG1725_0Gy_4hr_Male in 8.6722 second(s)

Cutadapt finished for file CFG1735_0Gy_4hr_Male in 62.8297 second(s)
Collapsing finished for file CFG1735_0Gy_4hr_Male in 13.8347 second(s)

Cutadapt finished for file CFG2249_0Gy_4hr_Male in 81.2089 second(s)
Collapsing finished for file CFG2249_0Gy_4hr_Male in 13.7892 second(s)

Cutadapt finished for file CFG2376_0Gy_4hr_Male in 35.2416 second(s)
Collapsing finished for file CFG2376_0Gy_4hr_Male in 13.9044 second(s)

Cutadapt finished for file CFG1734_2Gy_4hr_Male in 58.4758 second(s)
Collapsing finished for file CFG1734_2Gy_4hr_Male in 19.7145 second(s)

Cutadapt finished for file CFG1746_2Gy_4hr_Male in 41.2486 second(s)
Collapsing finished for file CFG1746_2Gy_4hr_Male in 22.887 second(s)

Cutadapt finished for file CFG1747_2Gy_4hr_Male in 48.5739 second(s)
Collapsing finished for file CFG1747_2Gy_4hr_Male in 25.6193 second(s)

Cutadapt finished for file CFG2253_2Gy_4hr_Male in 87.6589 second(s)
Collapsing finished for file CFG2253_2Gy_4hr_Male in 27.7313 second(s)

Error in DESeqDataSetFromMatrix(countData = countData, colData = metaData, :
ncol(countData) == nrow(colData) is not TRUE
Calls: DESeqDataSetFromMatrix -> stopifnot
Execution halted
Cutadapt finished for file CFG2375_2Gy_4hr_Male in 48.4084 second(s)
Collapsing finished for file CFG2375_2Gy_4hr_Male in 27.6984 second(s)

Matrix creation finished in 8.0713 second(s)

Data pre-processing completed in 768.5629 second(s)

Alignment in progress ...
Alignment completed in 163.7851 second(s)

Summarizing and tabulating results...
Summary completed in 26.3132 second(s)

Performing differential expression...

The analysis completed in 968.8842 second(s)

done
[3] Done nohup sh runmiRge.sh > runmiRge.out

@Samir-A ,

I don't understand the error clearly, what was your input command line argument?
With respect to DESeq2, your metadata information and your input query FASTQ file names don't match. You can refer to the metadata example from here.

Thank you,
Arun.

Yes and the vice versa is also true, that you have more input files than the names mentioned in metadata.csv file. (you should have metadata only for samples you are processing).

The output contains a PCA and a Volcano plot, see the PCA and assess if the like samples are clustering togather, or if there are any major variance driving the samples. If your PCA looks good, then look at what miRNAs are significantly differentially expressed and check if those miRNAs have been reported in literature to work that is similar to your analysis. Further, you can study targets and validate those downstream. (I know you thought of this, but just to mention - you should discuss the results and interpretation with your lab head or PI).

I hope this helps.

Thank you,
Arun.

Hey Arun,

So I got another error with differential expression. Do you have any idea why the processing stops at differential expression?

I also found a fix for the issue online. Here is the link: https://support.bioconductor.org/p/82136/

I put the run.log below and my script I am using.

cutadapt version: 2.8
Samtools version: 1.10
Collecting and validating input files...

miRge3.0 will process 10 out of 10 input file(s).

Cutadapt finished for file CFG1724_0Gy_4hr_Male in 58.3145 second(s)
Collapsing finished for file CFG1724_0Gy_4hr_Male in 0.8755 second(s)

Cutadapt finished for file CFG1725_0Gy_4hr_Male in 56.8507 second(s)
Collapsing finished for file CFG1725_0Gy_4hr_Male in 9.4328 second(s)

Cutadapt finished for file CFG1735_0Gy_4hr_Male in 61.855 second(s)
Collapsing finished for file CFG1735_0Gy_4hr_Male in 15.117 second(s)

Cutadapt finished for file CFG2249_0Gy_4hr_Male in 82.5928 second(s)
Collapsing finished for file CFG2249_0Gy_4hr_Male in 14.8281 second(s)

Cutadapt finished for file CFG2376_0Gy_4hr_Male in 37.3864 second(s)
Collapsing finished for file CFG2376_0Gy_4hr_Male in 15.0727 second(s)

Cutadapt finished for file CFG1734_2Gy_4hr_Male in 58.3754 second(s)
Collapsing finished for file CFG1734_2Gy_4hr_Male in 19.4078 second(s)

Cutadapt finished for file CFG1746_2Gy_4hr_Male in 38.5711 second(s)
Collapsing finished for file CFG1746_2Gy_4hr_Male in 21.4128 second(s)

Cutadapt finished for file CFG1747_2Gy_4hr_Male in 48.3227 second(s)
Collapsing finished for file CFG1747_2Gy_4hr_Male in 24.7802 second(s)

Cutadapt finished for file CFG2253_2Gy_4hr_Male in 77.8626 second(s)
Collapsing finished for file CFG2253_2Gy_4hr_Male in 27.6064 second(s)

Cutadapt finished for file CFG2375_2Gy_4hr_Male in 48.8205 second(s)
Collapsing finished for file CFG2375_2Gy_4hr_Male in 27.7913 second(s)

Matrix creation finished in 8.3169 second(s)

Data pre-processing completed in 762.3164 second(s)

Alignment in progress ...
Alignment completed in 166.4888 second(s)
Summarizing and tabulating results...
Summary completed in 27.6557 second(s)

Performing differential expression...

converting counts to integer mode
Error in DESeqDataSet(se, design = design, ignoreRank) :
all variables in design formula must be columns in colData
Calls: DESeqDataSetFromMatrix -> DESeqDataSet
Execution halted

The analysis completed in 968.0665 second(s)

Thanks for all your help, you have been very helpful!
script.txt

@Samir-A ,
Can you send me a sample DESmetadata.csv file?

Hey Arun,

Sure,
DESmetadata.csv

@Samir-A ,

Got it. As per documentation, the metadata column header should be id,group. This is what is read in the R-script (here). Sorry, this requirement is case sensitive. Let me know if this works out.

At line 33 in your shell script, you need to edit awk 'BEGIN{print "id,Group"} to awk 'BEGIN{print "id,group"}. This should fix the problem.

Thank you,
Arun.

Thanks Arun, that did the trick. Also, it there a way to name the output folder a specific name?

Thanks,
Samir

@Samir-A ,

Great. Yes, there is an option to name the output folder with the following parameter:
-onam or --outDirName, this is not part of documentation or command line arguments (to retain the development default settings and to avoid special characters in the names created by users). Here is an example usage.

miRge3.0 -s SRR772403.fastq -a illumina -lib miRge3_Lib -on human -db miRGeneDB -o output_dir -onam myOwn_Folder_Name

Plesae note: You should not provide spaces or special characters in the output folder name when using the onam parameter. This may lead to system errors (I/O errors).

I hope this helps.

Thank you,
Arun.

Hey Arun,

The command works great! I do want to ask about the -bam flag though. So when I put the -bam flag to get the bam files for each of my replicates and ran samtools view on my bams, I did not get any output.

I wanted to ask is there a specific way I need to use the bam flag?

Here is my script:
script.txt

Thanks,
Samir

Hi @Samir-A,

I thought I answered your question while I was responding to another issue simultaneously. This must have skipped my mind. The bam option works for human libraries and is not implemented for other species. However, we are now revising the llbraires to incorporate updates and adding bam argument is one of them.

Thank you,
Arun.

Hey Arun,

Thanks for letting me know. So, is there an alternative way to visualize our data then? (other than the volcano plots and PCAs)

Thanks,
Samir

@Samir-A,

There is this html miRge3_visualization.html in the output directory, that will enable interactive visualization of summary, most abundant miRNAs across each sample etc. However, if your main goal is to visualize read coverage between two groups, we are working on it, it should be available soon. On the other hand, you could consider plotting heatmaps in R (for significantly differentially expressed miRNAs) using miR.RPM.csv across two groups which I feel is more acceptable for reporting in a scientific journal. Let me know your thoughts.

pheatmap

Thank you,
Arun.