broadinstitute/Drop-seq

CD24A Expression Missing

Closed this issue · 14 comments

Hi,

I recently installed Dropseq v2.3.0 and am trying to compare this to v1.12 which I used to get my data from.
While the data extraction seem to be agreeing from the old version to the new, I am missing the CD24A/CD24 expressions.
This used to show up quite well when the data was run through in the 1.12 version, but I do not see a single expression of it in the new.

I've tried lowering the minimum barcode and genes read to 100 (I used to use 800 from the older version) and it still didn't show anything.

Is this a problem that has been observed before, or could a part of my library references, or my data be corrupted?
Any input will help.

Thanks,
Yehyun

Are you using the same GTF as previously? Having different gene annotations will definitely change some results.

The only change one might notice in the expression outputs between versions is how we handle reads that cross exon/intron boundaries. Previously, any read that had at least one base in an exon was counted as being intronic. Now, the analysis is more strict and reads must be entirely contained in an exon to be counted, so if your reads all cross an exon/intron boundry, you may have issues. If that's the case, you might try adding an argument to your DGE run LOCUS_FUNCTION_LIST=INTRONIC to include reads that are in introns (or span exon/intron boundaries) to see if those genes return.

If you inspect your reads and annotations for those genes with IGV, you may be able to confirm that reads aren't falling cleanly into exons (or spanning exons).

I am using the GTF that was posted in the link provided in the Drop-seq manual. I believe it is that same source that the previous version got its GTF from (I am not sure whether the website renewed/updated its files).

I've tried running LOCUS_FUNCTION_LIST=INTRONIC, but still no sign of CD24 genes were found.

It seems strange to me that an entire gene (CD24, or CD24A) information is missing from the data, when it was quite well expressed in the previous version.
It seems like other genes of interest somewhat agree with the previous version.

Please let me know if you have any other clues about this phenomenon.
Thanks much.

After some investigation, we noticed the following on hg19 in the original GTF:

HG357_PATCH protein_coding exon 107422453 107422630 . - . gene_id "ENSG00000272398"; transcript_id "ENST00000606017"; exon_number "1"; gene_name "CD24"; gene_biotype "protein_coding"; transcript_name "CD24-201"; exon_id "ENSE00003696480";

The particular gene you're interested in is on a contig that isn't included in the version of the hg19 sequence we used as metadata. This means that reads wouldn't align to it, and the gene model wouldn't be included in the metadata.

In GRCh38 (which we've been using for a few years) we see expression of this gene in iPSC cells that I happen to be working on currently. Looks like that gene has moved to chr6.

chr6 havana exon 106969831 106971834 . - . gene_id ENSG00000272398; gene_version 5; transcript_id ENST00000619133; transcript_version 4; exon_number 3; gene_name CD24; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name CD24-204; transcript_source havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS75499; exon_id ENSE00003727194; exon_version 1; tag basic; transcript_support_level 2;

I can't speak to which set of annotations you're currently using, but that's what I'd look for in your data. Switching versions of metadata clearly has an impact here, but it's not something we'd 'fix'.

Thanks for this comment, I think it solves much of my confusion.

I actually suspected something like this and downloaded different versions of GTF files, but due to my lack of understanding of the Dropseq codes, I could not get the create_Dropseq_reference_metadata.sh to create the files (it says a header from the gtf file is missing at one of the steps).

Could you provide a set of fasta and gtf files that would pass this and/or the line of codes that I can input to resolve this issue?

Thanks much in advance.

Here's the line call:
create_Drop-seq_reference_metadata.sh -n hg19_old -r /z/yechoi/references/GSM1629193_ERCC/GSM1629193_hg19_ERCC.fasta -s hg19 -g /z/yechoi/references/hg38/Homo_sapiens.GRCh38.101.gtf -d /z/yechoi/Desktop/Drop-seq_tools-2.3.0/ -o /z/yechoi/bams/new_bam -t $TMPDIR -a $STAR/STAR -b $BGZIP/bgzip -i $BGZIP/samtools

Here is the error msg that pops up at the ConvertToRefFlat step:
Exception in thread "main" picard.PicardException: No header line found in file /z/yechoi/bams/new_bam/hg19_old.gtf at picard.util.TabbedTextFileWithHeaderParser.<init>(TabbedTextFileWithHeaderParser.java:127) at org.broadinstitute.dropseqrna.annotation.GTFParser.<init>(GTFParser.java:66) at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:113) at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:111) at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:78) at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74) at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67) at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51) at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61) at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42) ERROR: non-zero exit status 1 executing /z/yechoi/Desktop/Drop-seq_tools-2.3.0//ConvertToRefFlat ANNOTATIONS_FILE=/z/yechoi/bams/new_bam/hg19_old.gtf SEQUENCE_DICTIONARY=/z/yechoi/bams/new_bam/hg19_old.dict OUTPUT=/z/yechoi/bams/new_bam/hg19_old.refFlat

Here's the 10 lines of the original GTF file that I am using:
#!genome-build GRCh38.p13 #!genome-version GRCh38 #!genome-date 2013-12 #!genome-build-accession NCBI:GCA_000001405.28 #!genebuild-last-updated 2020-03 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1"; 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

There are a few problems:

  1. You're using hg19 fasta and hg38 annotations (GSM1629193_hg19_ERCC.fasta, Homo_sapiens.GRCh38.101.gtf)

  2. The error looks like the file may have been altered and is a single line. It should look like this:

Homo_sapiens.GRCh38.101.gtf
#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.28
#!genebuild-last-updated 2020-03
1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";

I downloaded a copy of Homo_sapiens.GRCh38.101.gtf, and it converts perfectly fine with ConvertToRefFlat.

[Tue Sep 29 14:55:28 EDT 2020] ConvertToRefFlat ANNOTATIONS_FILE=/downloads/Homo_sapiens.GRCh38.101.gtf SEQUENCE_DICTIONARY=/downloads/Frazer/GRCh38_maskedAlt.dict OUTPUT=/downloads/test.txt VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Sep 29 14:55:28 EDT 2020] Executing as nemesh@wmcaa-b26 on Mac OS X 10.15.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_171-b11; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: null
INFO 2020-09-29 14:55:41 GTFParser Seen many non-increasing record positions. Printing Read-names as well.
INFO 2020-09-29 14:55:43 GTFParser read 100,000 GTF records. Elapsed time: 00:00:14s. Time for last 100,000: 11s. Last read position: 1:59,056,629
....
INFO 2020-09-29 14:56:16 GTFParser read 2,900,000 GTF records. Elapsed time: 00:00:47s. Time for last 100,000: 1s. Last read position: 22:41,902,970
[Tue Sep 29 14:56:16 EDT 2020] org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat done. Elapsed time: 0.81 minutes.
Runtime.totalMemory()=640155648

Sorry, I was running a hg19 command previously and forgot to change out the fasta file.
Here's the actual code that I am running:
create_Drop-seq_reference_metadata.sh -n hg38 -r /z/yechoi/references/hg38/Homo_sapiens.GRCh38.cdna.all.fa -s hg38 -g /z/yechoi/references/hg38/Homo_sapiens.GRCh38.101.gtf -d /z/yechoi/Desktop/Drop-seq_tools-2.3.0/ -o /z/yechoi/bams/new_bam -t $TMPDIR -a $STAR/STAR -b $BGZIP/bgzip -i $BGZIP/samtools

As for the GTF file, I think the "code" option of the Comment window got rid of the line skips.
Here is what I have in the GTF file:

#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.28
#!genebuild-last-updated 2020-03
1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

I am getting the same error output even with the code line above. Also, can you please try inputting it into create_Drop-seq_reference_metadata.sh instead of ConvertToRefFlat? The FilterBam step outputs a GTF file, and on my setup, the GTF file is empty.

alecw commented

Hi @yechoi942 ,

We can't provide the level of support you need. If the error you're getting is still No header line found in file, I believe this is a line termination problem. The program thinks the file is one big line, and since that line starts with a comment (pound sign), it thinks the file is essentially empty. It is weird because the file being read by ConvertToRefFlat is one that was created by FilterGtf, and one would think that line terminators written by one java program would be understandable by another java program. I suspect there is something strange about your compute environment. These are programs that have been run many times by many people without issues, and I'm almost certain that if we try to reproduce on our end, we won't see the problem.

Do you a colleague who can look at your files and compute environment who can help you with this?

Regards, Alec

Thanks, Alec.
Currently, I do not know anyone who shares the exactly the same compute environment.
If it's okay, can I list a couple of questions and see if I can get answers about them?

  1. I tried running the commands individually (CreateSequenceDictionary, ConverToRefFlat, ReduceGtf, and CreateIntervalsFiles) instead of using the combined MetaData Creation Program (create_Drop-seq_reference_metadata.sh).
    In this case, the program did not throw an error out. Would this create the same set of outputs as the metadata.sh program?

  2. When I did the steps described above, I found out that the resulting refFlat file is empty though it processes without any errors. I've checked the .dict file is not empty (though I do not know whether it is in a correct format). Do you know why this might be?

Please let me know if you have any ideas about these.

Also, it would be great if you could provide a link to the hg38 fasta and gtf files that work with your system. If all fails, I will try the files that you tried on your own to conclude that it really is the compute environment issue.

Thanks to you both for putting up with this.

alecw commented

My guess is that the sequence names in your GTF do not match the sequence names in your FASTA (and therefore the sequence names in the sequence dictionary you created). A common problem is for the sequences to be named "chr1" in one file and just "1" in the other file. You should be able to look in both the sequence dictionary the the GTF and check that the sequence names look compatible. You can also pass VERBOSITY=DEBUG to ConvertToRefFlat. If you do that, it will print a message for each line that contains a sequence that doesn't match one in your sequence dictionary.

alecw commented

Hi Yehyun,

I don't know. It all depends on your reference and GTF. However, if you add VERBOSITY=DEBUG to ConvertToRefFlat, you'll find out if there are any other unrecognized sequence names.

Regards, Alec

Thanks.

I tagged "chr" to each row and found a fasta formatted to have the identical namings of gene locations which solved this issue.
Now I am seeing CD24 in my new data that's generated.

Thanks all for your help.