using own fa file and gtf file error: gzip: stdin: not in gzip format

Question

using own fa file and gtf file error: gzip: stdin: not in gzip format

Opened this issue 4 months ago · 8 comments

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa to .gz file and add this link in the download.sh, it will show the error: gzip: stdin: not in gzip format.

If I use .fa file directly, it will have another error:
EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

I have check the first character:
head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Do you have some suggestion for this problem when I want to use the diy fa file and gtf file?

Answer 1 · 2024-09-05T04:08:01.000Z

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Answer 2 · 2024-09-05T04:17:11.000Z

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

sorry for missing copy the >, I think I alreday have such format, but it still show the same error :

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

Sep 05 00:16:05 ...... FATAL ERROR, exiting

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Answer 3 · 2024-09-05T04:19:31.000Z

I just find the > can not be copies

Answer 4 · 2024-09-05T04:27:54.000Z

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa

Answer 5 · 2024-09-05T04:43:46.000Z

Your FastA file starts with a line break. You must remove the line break.
sed -i '/^$/D' my_genomeviral.fa

yes. I have use this before but it still show the same error

head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Answer 6 · 2024-09-05T06:23:03.000Z

Hm, it probably has something to do with how you added the file to the download_references.sh script. Can you attach your modified script?

Answer 7 · 2024-09-05T14:59:03.000Z

ASSEMBLIES[my_genome]=" https://m-ee4cea.a1bfb5.bd7c.data.globus.org/scratch/chadbren_root/chadbren99/ppxinyi/referenceGRC38/ Homo_sapiens.GRCh38.dna.primary_assembly.fa <https://m-d953b2.9601a.bd7c.data.globus.org/umms-chadbren-dataden/apurvadb/10236-CB/Homo_sapiens.GRCh38.dna.primary_assembly.fa> ANNOTATIONS[my_annotation]=" https://m-ee4cea.a1bfb5.bd7c.data.globus.org/scratch/chadbren_root/chadbren99/ppxinyi/referenceGRC38/Homo_sapiens.GRCh38.105.transcript.gtf COMBINATIONS["my_genome+my_annotation"]="my_genome+my_annotation" I just added these links in the download_references.sh. I also find that if I use the ASSEMBLIES file you set and use my own ANNOTATIONS file, it will show such error: Sep 05 01:30:42 ...... FATAL ERROR, exiting .log file: Downloading assembly: http://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Appending RefSeq viral genomes Downloading annotation: https://m-ee4cea.a1bfb5.bd7c.data.globus.org/scratch/chadbren_root/chadbren99/ppxinyi/referenceGRC38/Homo_sapiens.GRCh38.105.transcript.gtf STAR --runMode genomeGenerate --genomeDir STAR_index_my_genomeviral_my_annotation --genomeFastaFiles my_genomeviral.fa --sjdbGTFfile my_annotation.gtf --runThreadN 8 --sjdbOverhang 250 STAR version: 2.7.11b compiled: 2024-08-28T19:55:43-04:00 gl-login2.arc-ts.umich.edu: /gpfs/accounts/chadbren_root/chadbren99/ppxinyi/STAR-2.7.11b/source Sep 05 01:29:52 ..... started STAR run Sep 05 01:29:52 ... starting to generate Genome files Sep 05 01:30:42 ..... processing annotations GTF

…

On Thu, Sep 5, 2024 at 2:23 AM suhrig ***@***.***> wrote: Hm, it probably has something to do with how you added the file to the download_references.sh script. Can you attach your modified script? — Reply to this email directly, view it on GitHub <#250 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BCFUQI2MABJ472CTQD4AAZLZU72FZAVCNFSM6AAAAABNVSZ3DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZQGY4TMNRRGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 8 · 2024-09-09T22:30:37.000Z

Sorry for the show response. I am currently busy. I will have more time next week.

In the meantime, have you tried building the STAR index yourself? What the script does is actually not so complicated. All it does is concatenate the RefSeq viruses to the human genome and then builds a STAR index from it. Maybe it's the easiest of you build the index manually?