broadinstitute/Drop-seq

Tag Bam

v-mahughes opened this issue · 3 comments

I am running your dropseq workflow on Linux. I have bam files corresponding to the expected outputs at each step that I can compare each of my intermediate bam files to. When I compare my Fastq_to_Sam output using 'cmp --silent Four_S2_unmapped.bam Four_unmapped.bam || echo "files are different"', it says they are not equal, but using samtools to view the first 50 lines of each, they appear to be equivalent. My output is Four_S2_unmapped.bam and the expected output is Four_unmapped.bam. I was wondering if there is another way to evaluate their equality/understand where they are different? Please let me know if you need anything else. Here are the first 10 lines of each bam file.

Four_S2_unmapped.bam:

NB501935:1009:HLJLJBGXJ:1:11101:10000:10887 77 * 0 0 * * 0 0 TTAGTAGTAATTGGCGGCCCC AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10887 141 * 0 0 * * 0 0 GAGTGAGGGGGCCGCCAATTACTACTAAGTACTCTGCGTTGATACCACTG AAAAAEEEEAEAEAEA/EEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10940 77 * 0 0 * * 0 0 ACTTACCCCATCACTCCTGTC AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10940 141 * 0 0 * * 0 0 GAGTGATGGGGTAAGTGTACTCTGCGTTGATACCACTGCTTCCGCGGACA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEAEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:12295 77 * 0 0 * * 0 0 TCCCGTTCGTCGAGGTAGGGG AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:12295 141 * 0 0 * * 0 0 TTTCTAGATCTTGTAGGTGTGCTTAATTGTTTTTTATAAATTTTTTTTTG AAAAAAEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEAEEEEEE< RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:14274 77 * 0 0 * * 0 0 TTGCCACCAAGCAGTGGTATC AAAAA/EEEE/EEEE6EEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:14274 141 * 0 0 * * 0 0 CAAGTACTCTGCGTTGATACCACTGCTTGGTGGCAAGTACTCTGCGTTGA A//AAEEEEEE/EEEEEEAE/AEEEEEE/E//EEE//E/EEAA//EEEA/ RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:15227 77 * 0 0 * * 0 0 ATCATGTTCAGTTCACCTCAC AAAAAEEAEAEEEEEE/EE/E RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:15227 141 * 0 0 * * 0 0 GAGTGAACAGGACATGATGTACTCTGCGTTGATACCAAAAAAAAAAAAAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/E/EAEAEEAA RG:Z:A

Four_unmapped_bam:

NB501935:1009:HLJLJBGXJ:1:11101:10000:10887 77 * 0 0 * * 0 0 TTAGTAGTAATTGGCGGCCCC AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10887 141 * 0 0 * * 0 0 GAGTGAGGGGGCCGCCAATTACTACTAAGTACTCTGCGTTGATACCACTG AAAAAEEEEAEAEAEA/EEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10940 77 * 0 0 * * 0 0 ACTTACCCCATCACTCCTGTC AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10940 141 * 0 0 * * 0 0 GAGTGATGGGGTAAGTGTACTCTGCGTTGATACCACTGCTTCCGCGGACA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEAEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:12295 77 * 0 0 * * 0 0 TCCCGTTCGTCGAGGTAGGGG AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:12295 141 * 0 0 * * 0 0 TTTCTAGATCTTGTAGGTGTGCTTAATTGTTTTTTATAAATTTTTTTTTG AAAAAAEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEAEEEEEE< RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:14274 77 * 0 0 * * 0 0 TTGCCACCAAGCAGTGGTATC AAAAA/EEEE/EEEE6EEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:14274 141 * 0 0 * * 0 0 CAAGTACTCTGCGTTGATACCACTGCTTGGTGGCAAGTACTCTGCGTTGA A//AAEEEEEE/EEEEEEAE/AEEEEEE/E//EEE//E/EEAA//EEEA/ RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:15227 77 * 0 0 * * 0 0 ATCATGTTCAGTTCACCTCAC AAAAAEEAEAEEEEEE/EE/E RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:15227 141 * 0 0 * * 0 0 GAGTGAACAGGACATGATGTACTCTGCGTTGATACCAAAAAAAAAAAAAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/E/EAEAEEAA RG:Z:A
(ex_vivo) NORTHAMERICA.v-mahughes@GCRSANDBOX309:~/ex_vivo/preprocess/dropseq_inputs/unmapped_bams$ samtools view Four_unmapped.bam | head -n 10
NB501935:1009:HLJLJBGXJ:1:11101:10000:10887 77 * 0 0 * * 0 0 TTAGTAGTAATTGGCGGCCCC AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10887 141 * 0 0 * * 0 0 GAGTGAGGGGGCCGCCAATTACTACTAAGTACTCTGCGTTGATACCACTG AAAAAEEEEAEAEAEA/EEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10940 77 * 0 0 * * 0 0 ACTTACCCCATCACTCCTGTC AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:10940 141 * 0 0 * * 0 0 GAGTGATGGGGTAAGTGTACTCTGCGTTGATACCACTGCTTCCGCGGACA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEAEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:12295 77 * 0 0 * * 0 0 TCCCGTTCGTCGAGGTAGGGG AAAAAEEEEEEEEEEEEEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:12295 141 * 0 0 * * 0 0 TTTCTAGATCTTGTAGGTGTGCTTAATTGTTTTTTATAAATTTTTTTTTG AAAAAAEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEAEEEEEE< RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:14274 77 * 0 0 * * 0 0 TTGCCACCAAGCAGTGGTATC AAAAA/EEEE/EEEE6EEEEE RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:14274 141 * 0 0 * * 0 0 CAAGTACTCTGCGTTGATACCACTGCTTGGTGGCAAGTACTCTGCGTTGA A//AAEEEEEE/EEEEEEAE/AEEEEEE/E//EEE//E/EEAA//EEEA/ RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:15227 77 * 0 0 * * 0 0 ATCATGTTCAGTTCACCTCAC AAAAAEEAEAEEEEEE/EE/E RG:Z:A
NB501935:1009:HLJLJBGXJ:1:11101:10000:15227 141 * 0 0 * * 0 0 GAGTGAACAGGACATGATGTACTCTGCGTTGATACCAAAAAAAAAAAAAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/E/EAEAEEAA RG:Z:A

alecw commented

Hi @v-mahughes ,

I'm not sure how your expected outputs are created, but anyway...

First of all, look at the headers of the pair of files. You may see a difference there.

Second, BAM files are bgzipped (block gzip compressed). A pair of BAM files could contain the identical content, but may not be binary identical. There could be different compression levels, or just different implementations of gzip that produced results that are functionally equivalent but not identical. You can convert the files (both header and records) to text and compare those.

Regards, Alec

Thanks! that makes sense.

Another question I have related to the STAR alignment step:

is genomeDir equivalent to the directory that contains the interval files generated by Dropseq's CreateIntervalFiles function? or do I need to use STAR to generate this?

alecw commented

Hi @v-mahughes ,

It sounds like your first question is answered, so I'm going to close this ticket.

Yes, you need to use STAR to create the genome directory. See the STAR documentation for details.

Regards, Alec