czbiohub-sf/tabula-muris

Where can I find the raw unprocessed reads?

FalkoHof opened this issue · 12 comments

Hey,
is there a possibility to download the tabula muris data as unaligned, not adapter trimmed data from AWS, or is this only possible on SRA?

Ich checked the AWS bucket but both folders '10x_bam_files' and 'facs_bam_files' seem to only contain bam files with the *.Aligned.out.sorted.bam suffix. And in this case I would assume that even when unaligned reads are kept, the reads have all been preprocessed/trimmed?

Thanks!
Falko

@jamestwebber might be able to help

The data on AWS is not adapter trimmed–the insert size for these reads is pretty large so there was no need to do so.

Thanks for the response! I also assume discordant reads were kept, so I could get all original reads from the bam file?

I believe so, but I haven't verified this by doing the round-trip back to a fastq file and verifying...if you give it a try we could compare read #s with the original fastq files.

We discussed with the AWS public data folks and decided to upload only BAM files because we thought that would cover everyone's needs. But if they are missing reads (maybe chimeric reads or something) that might need to be changed.

Thanks for the swift reply! I will download a few and report back in latest about a week or so.
Best,
Falko

We checked three files:
SRR6571079 / A1-MAA000400-3_8_M-1-1
SRR6571474 / A10-MAA000586-3_8_M-1-1
SRR6571475 / A11-MAA000586-3_8_M-1-1

See below. The fastq column contains the number of reads as reported in "spots read" by fasterq-dump [SRR]. The bam column shows the number of unique read pairs from the aws bam files as reported by samtools fastq [BAM]

Total read pairs:
SRR FASTQ BAM
SRR6571079 3288555 3048963
SRR6571474 1284467 1175468
SRR6571475 2070727 1903004

So it seems that the bam files do not contain all reads?

Hm you must be right. The BAM files on AWS are the output from STAR, and my best guess is that it split out reads that it identified as chimeras or splice junctions. That's unfortunate--it would have been nice to have the raw files in that resource, rather than only in SRA.

Do you have any plans of uploading the raw files to aws? Having either fastq/unaligned bam files as well would be awesome!

We ended up uploading the BAM files after discussion with the AWS public data team, but we didn't realize we'd be missing out on a small number of reads that are potentially interesting. Given that AWS is hosting the data for us I don't think there's a plan to add the fastq files as well, but they should be available from SRA if you want them.

I'm using the 10x "bamtofastq" tool to attempt recovery of the original fastq reads from the AWS bam files. The tool runs, but the barcodes are found in the R "reads" files, and these only contain 26 nt total for all reads.

This is my first time playing with this type of data, so I may be out of bounds here, but I'm guessing these BAMs aren't useful for recovering the raw reads with 10x's tool ?

For the 10x data, I believe the bamtofastq tool should produce a 26bp R1 file containing the barcode and UMI, an 8bp file of indexes, and an R2 file which contains the actual mRNA read. Is that what you're getting?

That tool isn't going to work on the SmartSeq2 BAM files (I'm not sure what it will produce).

You're right! I stand corrected. I just found those R2 files and came back here to delete my comment, but you beat me to it :p
Thank you!