score_guides not giving expected results

Question

score_guides not giving expected results

Closed this issue 6 years ago · 8 comments

We’re having an issue with dashit that might be a bug. The files we’ve been working with are in /scratch/emilyAIHprelim. We started with sub_ALL.fasta and ran score_guides with the -s option using the file /scratch/emily/guide_libraries/nribo2_150.csv. That created two files, sub_all_dashed.fasta and sub_all_undashed.fasta. When we now run score_guides against each of those, we get the expected results:

Scoring guides against sub_ALL_dashed.fasta
149 guides in ../guide_libraries/nribo2_150.csv vs. sub_ALL_dashed.fasta hit 687633/687633 = 100.00%
Parsing FASTA took 19.752558574s
dashit@ip-172-31-19-203:/scratch/emily/AIHprelim$ score_guides ../guide_libraries/nribo2_150.csv sub_ALL_undashed.fasta 
Scoring guides against sub_ALL_undashed.fasta
149 guides in ../guide_libraries/nribo2_150.csv vs. sub_ALL_undashed.fasta hit 0/312367 = 0.00%
Parsing FASTA took 26.848835082s

However, when we then go back and look at individual guides that were in nribo2_150.csv and grep for them in these two files, we do not get the expected results.
Based on score_guides, it looks like the parsing into _dashed.fasta and _undashed.fasta worked fine, but based on the grep results it looks like it didn’t work at all.

"Site, Site index, Number of reads covered by site, cumulative number of reads covered",,,
GAGGCCTCTCCAGTCCGCCG,1,1,1
TCCGATAACGAACGAGACTC,2,1,2
TTTTTCAAAGTAAACGCTTC,3,1,3
AGGCACTCGCATTCCACGCC,4,1,4
TCCCCGATCCCCATCACGAA,5,1,5
GAATAACGCCGCCGCATCGC,6,1,6
CACAGTTATCCAAGTAGGAG,7,1,7
GCCGCCGCAGGTGCAGATCT,8,1,8
TAGAGGTGAAATTCTTGGAC,9,1,9
dashit@ip-172-31-19-203:/scratch/emily/AIHprelim$ grep GAGGCCTCTCCAGTCCGCCG sub_ALL_dashed.fasta -c
6214
dashit@ip-172-31-19-203:/scratch/emily/AIHprelim$ grep GAGGCCTCTCCAGTCCGCCG sub_ALL_undashed.fasta -c
15769

ok now I’ve also grepped for the next 9 hits and found that all of them are in _dashed.fasta but are NOT in undashed.fasta, which is what is supposed to happen. So this might be a problem of mis-treating the first line of the file but correctly treating the rest (which @alyden actually suggested earlier could be the cause!) On the other hand, this does not explain the broader result that led us to find this question, which is that when we ran optimize_guides on _undashed.fasta, we came up with tons of guides that hit very close to the guides that should have already been eliminated with the split on nribo2_150.csv.

Answer 1 · 2018-05-18T21:04:37.000Z

@emilydc What server are those files on? If I can get a copy I'll take a look today and see what's going on.

Answer 2 · 2018-05-18T22:37:55.000Z

They're on DASHit.czbiohub.org, you need the id_dashit ssh key from Emily's laptop, then you can Ssh -i id_dashit dashit@dashit.czbiohub.org

…

On Fri, May 18, 2018, 16:04 neevor ***@***.***> wrote: What server are those files on? If I can get a copy I'll take a look today and see what's going on. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJ4MELRKKHmXnhtvw_O1rMuvf9mbtpQgks5tzzdlgaJpZM4UEBgC> .

Answer 3 · 2018-05-18T22:41:02.000Z

That sounds good. I can also run it locally if the files being used are in s3. Whatever is easiest for everyone.

Answer 4 · 2018-05-23T18:37:00.000Z

This problem is caused by a malformed /scratch/emily/guide_libraries/nribo2_150.csv file.

score_guides expects two lines of meta data at the top of the guides CSV file, namely:

Total number of reads: 1000000
Site, Site index, Number of reads covered by site, cumulative number of reads covered

score_guides, expecting this format, skips the first two lines and starts reading guides on the third line of the guides CSV input.

This issue is occurring because in /scratch/emily/guide_libraries/nribo2_150.csv, the first two lines are:

"Site, Site index, Number of reads covered by site, cumulative number of reads covered",,,
GAGGCCTCTCCAGTCCGCCG,1,1,1

Since score_guides skips the first two lines, in /scratch/emily/guide_libraries/nribo2_150.csv it skips the GAGGCCTCTCCAGTCCGCCG guide, which is why your grep is seeing it in the _undashed.fasta file.

The next question is why /scratch/emily/guide_libraries/nribo2_150.csv is malformed. I re-ran optimize_guides and it correctly created a guides CSV with the expected metadata:

@emilydc , was /scratch/emily/guide_libraries/nribo2_150.csv created by hand, or by optimize_guides?

I'll add some error checking code in score_guides to guard against input in an unexpected format.

Answer 5 · 2018-05-23T19:05:56.000Z

Added error checking against malformed guides CSV files to score_guides in 4c7466a

Marking resolved

Answer 6 · 2018-05-23T19:21:17.000Z

@EmperorDali nice. I wasn't able to get an ssh key to get access to that server. Can we put something in 1password for getting access to that box?

Answer 7 · 2018-05-23T19:47:48.000Z

Awesome, thanks for figuring it out! Nribo2_150.csv was indeed made by hand, and I neglected to realize it was not in the exact right format. It might be good to add a way of inputting a guide library by hand - for example, if someone else sent me an existing library in the form of just a list of guides, I might want to see how they perform against my dataset. Emily

…

Sent from my iPhone

On May 23, 2018, at 3:21 PM, neevor ***@***.***> wrote: @EmperorDali nice. I wasn't able to get an ssh key to get access to that server. Can we put something in 1password for getting access to that box? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 8 · 2018-05-23T20:32:43.000Z

I think for now, if someone sends you guides, you can paste them into a CSV file, just grab the first two header lines from a real guides CSV and make sure the guides start on line 3.

If this becomes onerous, we can automate it, but I think it makes sense to err on the side of keeping things simple until we really need a new feature added.