theislab/scTab

Slight mismatch in data cell numbers

Closed this issue · 2 comments

Hi!

According to your manuscript and in your docs here, the total data size is 22.189.056 cells.
However, when going through the code, I get 22.190.622 cells. This is also evidenced in your notebook here (the output of cell 10).
As far as I know, no further filtering steps happen after this step. What explains the gap?

Regards, Gaetan

Hi @gdewael,

this is due to the fact that in the 03_write_store_merlin.ipynb I cut the dataset size of the train, validation and test set to be a multiple of 1024 (which is the row group size of the parquet files). That way each row group of the parquet files is full and you don't have a half empty row group in the last parquet file of each split. Some of the code in the other notebooks assumes that the row groups are always full.

You can see this in the last cell of the 03_write_store_merlin.ipynb notebook:

n_samples = X.shape[0]
n_samples = (n_samples // ROW_GROUP_SIZE) * ROW_GROUP_SIZE
X = X[:n_samples].rechunk((CHUNK_SIZE, -1))
obs_ = obs_.iloc[:n_samples].copy()

Let me know if you have any more questions.

Best,
Felix

That clears it up. Thanks for the response!