make *well* a mandatory column in samples CSVs and enable it to be used to drop samples
jbloom opened this issue · 8 comments
I think the samples CSVs should have a mandatory well column, as it seems like a good idea to track which well samples were run in. This could be useful for QC etc?
We could also then add a good way to drop samples in the configuration by specifying these well names.
@anloes, what do you think? If we want to do this, the main thing it would require is retrospectively assigning well labels to the samples you have already run that are specified in https://github.com/jbloomlab/flu_seqneut_DRIVE_2021-22_repeat_vax/tree/main/data/plates.
Is it easy to do that somehow?
I could then add code that also keeps track of wells, and allows specific samples to be specified for dropping in the YAML by well if they are low quality.
Yes, I should be able to add a well column to the data.
Is it just the _S1_
etc in the FASTQ names? I wasn't sure if well number was somehow already encoded in the FASTQ names. If so, I can extract it.
Yes, though some plates were run together in sets of 2 so possible numbers are S1-S192. I would suggest that we accept at least the possibility for 4 plates to be multiplexed on the same run. All plates in the current dataset are loaded longform, with dilutions across rows, but other plates that may be added to the project will have a rotated format and 12 no-serum controls instead of 16.
OK, so is it basically the case that per-plate the wells will be S1 to S96 (corresponding to wells 1 - 96), or S97 to S192 (corresponding to wells 1-96), etc?
If so, I can parse that out easily for the past plates.
I don't think we need to use the wells automatically: I think it is still better for you to keep annotating sample, dilution, etc like you do now to support arbitrary layout.
I think it is just good to start making it standard to also track the well as an identifier in files.
Does this sound reasonable?
Yes, that is correct, samples 1-96 correspond to plate 1 and S97-192 correspond to plate 2. Yes, it seems reasonable to track well ids.
Do you think it is better to track these as 1, 2, ..., 96? Or as A1, A2,..., H12? Either is fine with me, so whatever you think is more informative convention.
If the latter, how do I convert? In your current plates, does 2 correspond to A2 or B1?
2 corresponds to B1, 1-8 are A1-H1, 9-16 would be A2-H2. It depends on how these are being used exactly which might be more informative. I would lean towards A1, B2, etc. though, as that clarifies the plate format, rather than 1-96 which could be interpreted to refer to either direction.
@anloes, the pipeline now tracks wells as required input in the samples CSV.
You can also drop problematic wells with wells_to_drop
for that plate.
I tested this works for your data by adding
wells_to_drop:
- B3 # no counts
under the config for plate10
, and it successfully drops that well that has no data.
I have not updated the repo for your project to use this new pipeline version for that submodule
yet as I am going to work on some additional improvements, but if you get to that point before me simply pull the latest version of seqneut-pipeline
into your project submodule, commit that, and pull request to update it. Otherwise I will do that at some point in next few days.