suhrig/arriba

using own fa file and gtf file error: gzip: stdin: not in gzip format

Opened this issue · 8 comments

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa to .gz file and add this link in the download.sh, it will show the error: gzip: stdin: not in gzip format.

If I use .fa file directly, it will have another error:
EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

I have check the first character:
head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Do you have some suggestion for this problem when I want to use the diy fa file and gtf file?

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

sorry for missing copy the >, I think I alreday have such format, but it still show the same error :

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

Sep 05 00:16:05 ...... FATAL ERROR, exiting

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I just find the > can not be copies

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa

yes. I have use this before but it still show the same error

head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Hm, it probably has something to do with how you added the file to the download_references.sh script. Can you attach your modified script?

Sorry for the show response. I am currently busy. I will have more time next week.

In the meantime, have you tried building the STAR index yourself? What the script does is actually not so complicated. All it does is concatenate the RefSeq viruses to the human genome and then builds a STAR index from it. Maybe it's the easiest of you build the index manually?