Interview preparation for applying for the start-up company AHEAD.
Here are a few quick ways to make use of the current repository and the related ML models.
- A tiny Swin Transformer model for image classification
In [1]: import FlowCal ...: import numpy as np ...: from ahead.util import get_fsc_ssc_chunks ...: from PIL import Image ...: from transformers import pipeline In [2]: pipe = pipeline( ...: "image-classification", ...: "phunc20/swin-tiny-patch4-window7-224-finetuned-wuhan") In [3]: file_flow_id = "flowrepo_covid_EU_013_flow_001" ...: fcs_path = next((data_dir/f'raw_fcs/{file_flow_id}').glob("*.fcs")) In [4]: chunk_generator = get_fsc_ssc_chunks( ...: fcs_path, ...: chunk_size=10_000, ...: typ="A", ...: gate_fraction=0.75, ...: ) In [5]: chunk = next(chunk_generator) ...: chunk.shape Out[5]: (10000, 2) In [6]: FlowCal.plot.density2d( ...: chunk, ...: mode="scatter", ...: savefig="chunk.png", ...: ) In [7]: image = Image.open("chunk.png") ...: pipe(image) Out[7]: [{'score': 0.999931812286377, 'label': 'sick'}, ...: {'score': 6.822467548772693e-05, 'label': 'healthy'}]
- Python version 3.8 is recommended by the
FlowCal
doc page - And the packages inside
requirements.txt
- For Python newcomers, it suffices to
$ pip install -r requirements.txt
- In particular,
FlowCal
alone requires many dependecies, so it is highly possible that any package you need is included
- For Python newcomers, it suffices to
Alternatively, instead of requirements.txt
, one could also install the current repo (so that one could use
some of the repo's code)
- Either by
$ pip install git+https://github.com/phunc20/ahead.git
- Or download the code and install from local
$ pip install -e <downloaded_repos_local_path> $ # e.g. say you've downloaded (and unzipped) the repo at $ # ~/downloads/ahead $ # then it suffices to $ # pip install -e ~/downloads/ahead
- (# events) in each sample differs quite drastically
- It could go as small as
680
events - Could also go as big as several hundreds of thousand events, e.g.
360_000
- It could go as small as
- Seemingly imbalanced binary classes (Sick/Healthy)
First come a few observations:
-
For each FCS file, We are given 35 channels, among which we are allowed to only use 31, i.e. those whose
"use"
column is marked1
inEU_marker_channel_mapping.xlsx
- In this particular challenge, the marker information seems useless because including it won't bring any benefit.
- As someone with almost no domain knowledge, I honestly do not know which of
the 31 channels to discard/keep. I came up with 3 ways to proceed:
- Use all 31 channels
- Use 31 channels with dimensionality reduction
- Suggested by
FlowCal
's tutorial and by the problem description, maybe we could use FSC-SSC (gated) density plot to do some kind of image classification. In particular,- I shall not attempt to do calibration to MEF because we are not provided with calibration bead data
- I do not see fluorescence channel either, so I will only focus on 2D FSC-SSC density plots
- If we choose to do this as an image classification task, then one immediate benefit is that even patients whose sample are of less number of events could be inferred using the same model (although the correctness of the inference is not guaranteed)
- As an alternative to the previous, i.e the 3rd way, aside from
- FSC-A vs SSC-A
- FSC-W vs SSC-W
- FSC-H vs SSC-H
we could have tried all sorts of 2D density plots invovling all the
combinations of the
31
channels, i.e. roughly 31 choose 2 combinations
-
As mentioned in the above subsection, the number of events varies quite a lot. Consequently, I decide to group
- 500
- 1,000
- 5,000
- 10,000
events together to make up an instance.
By doing so,- Not only do we increase the number of samples/instances
- But it also enables us to do prediction when future data come in small number of events
This should be similar to sliding windows. To make things simple, I choose not to have the windows overlap one another.
Rmk.
- At the end, I do not find enough time to do all the listed experiments
Possessing cell samples from 40 patients, let's isolate 4 patients
(2 sick, 2 healthy) to form our validation set. Besides,
let's choose those whose (# events)
are intermediate.
- 2 sick:
flowrepo_covid_EU_034_flow_001
andflowrepo_covid_EU_048_flow_001
with98_608
and123_154
events, resp. - 2 healthy:
flowrepo_covid_EU_013_flow_001
andflowrepo_covid_EU_004_flow_001
with170_075
and183_001
events, resp.
Most of my developped code are in
docs/notebooks/
- and
ahead/
Because I developped the code on a very old laptop (Thinkpad X61s with Intel(R) Core(TM)2 Duo CPU, to be more precise), most of the code I wrote I ran on Colab. Sorry for not devoting more time on converting them into Python script.
But from another point of view, since most of the code carries experiment smells, i.e. nothing really of production-ready, so I think it isn't so bad that they were in Jupyter-notebook form.
By reading the question description
and by reading FlowCal
's tutorials
side-by-side, it's hard to not have
suspected that the gating techniques
introduced in
https://flowcal.readthedocs.io/en/latest/python_tutorial/gate.html
may help with the bonus question.
In particular, the
FlowCal.gate.density2d
helps reduce manual works to only one
input arg: gate_fraction
.
I think it could be considered automatic.