jbloomlab/seqneut-pipeline

make *well* a mandatory column in samples CSVs and enable it to be used to drop samples

jbloom opened this issue · 8 comments

jbloom commented

I think the samples CSVs should have a mandatory well column, as it seems like a good idea to track which well samples were run in. This could be useful for QC etc?

We could also then add a good way to drop samples in the configuration by specifying these well names.

@anloes, what do you think? If we want to do this, the main thing it would require is retrospectively assigning well labels to the samples you have already run that are specified in https://github.com/jbloomlab/flu_seqneut_DRIVE_2021-22_repeat_vax/tree/main/data/plates.

Is it easy to do that somehow?

I could then add code that also keeps track of wells, and allows specific samples to be specified for dropping in the YAML by well if they are low quality.

anloes commented

Yes, I should be able to add a well column to the data.

jbloom commented

Is it just the _S1_ etc in the FASTQ names? I wasn't sure if well number was somehow already encoded in the FASTQ names. If so, I can extract it.

anloes commented

Yes, though some plates were run together in sets of 2 so possible numbers are S1-S192. I would suggest that we accept at least the possibility for 4 plates to be multiplexed on the same run. All plates in the current dataset are loaded longform, with dilutions across rows, but other plates that may be added to the project will have a rotated format and 12 no-serum controls instead of 16.

jbloom commented

OK, so is it basically the case that per-plate the wells will be S1 to S96 (corresponding to wells 1 - 96), or S97 to S192 (corresponding to wells 1-96), etc?

If so, I can parse that out easily for the past plates.

I don't think we need to use the wells automatically: I think it is still better for you to keep annotating sample, dilution, etc like you do now to support arbitrary layout.

I think it is just good to start making it standard to also track the well as an identifier in files.

Does this sound reasonable?

anloes commented

Yes, that is correct, samples 1-96 correspond to plate 1 and S97-192 correspond to plate 2. Yes, it seems reasonable to track well ids.

jbloom commented

Do you think it is better to track these as 1, 2, ..., 96? Or as A1, A2,..., H12? Either is fine with me, so whatever you think is more informative convention.

If the latter, how do I convert? In your current plates, does 2 correspond to A2 or B1?

anloes commented

2 corresponds to B1, 1-8 are A1-H1, 9-16 would be A2-H2. It depends on how these are being used exactly which might be more informative. I would lean towards A1, B2, etc. though, as that clarifies the plate format, rather than 1-96 which could be interpreted to refer to either direction.

jbloom commented

@anloes, the pipeline now tracks wells as required input in the samples CSV.

You can also drop problematic wells with wells_to_drop for that plate.

I tested this works for your data by adding

wells_to_drop:
  - B3  # no counts 

under the config for plate10, and it successfully drops that well that has no data.

I have not updated the repo for your project to use this new pipeline version for that submodule yet as I am going to work on some additional improvements, but if you get to that point before me simply pull the latest version of seqneut-pipeline into your project submodule, commit that, and pull request to update it. Otherwise I will do that at some point in next few days.