The kana
file format contains an embedded HDF5 file that captures the analysis state of the kana application.
This embedded state file stores the parameters and results for each step in a simple single-cell RNA-seq analysis.
By storing this state, we can easily reload existing analyses into kana without recomputation.
It is also straightforward to extract results from this state in other data analysis frameworks (e.g., R/Bioconductor).
The kanaval repository contains a specification of the expected structure and content of the state file.
We use a specification-as-code approach that enforces the specification with a validator library, implemented with header-only C++ for portablity to any system that supports a foreign function interface.
It is thus possible to create kana
files from other languages, validate them, and upload them to kana.
The first 8 bytes define an unsigned 64-bit integer in little-endian, specifying the format type. This is used to denote whether the input data files are embedded (0) or linked (1); the former is used for export to a standalone file while the latter is used to save the state to the browser's cache.
The next 8 bytes define another unsigned 64-bit integer describing the format version.
We use semantic versioning where each version number is described by 3 digits, i.e., XXXYYYZZZ
.
The next 8 bytes define another unsigned 64-bit integer specifying the size of the HDF5 file containing the analysis state.
Let's call this value state_nbytes
.
The next state_nbytes
bytes contain a HDF5 state file.
Each analysis step is represented by a HDF5 group that contains the parameters and results.
See the next section for details on the expected groups.
The remaining bytes contain the embedded input files when dealing with an embedded format type.
Each file can be excised by reading the offsets and sizes in the inputs
group in the state file.
Inside the HDF5 state file, each analysis step is represented by a HDF5 group.
Version 3.0:
- Inputs
- RNA quality control
- ADT quality control
- CRISPR quality control
- Cell filtering
- RNA normalization
- ADT normalization
- CRISPR normalization
- Feature selection
- RNA PCA
- ADT PCA
- CRISPR PCA
- Combine embeddings
- Batch correction
- Neighbor index
- k-means clustering
- SNN graph clustering
- Choose clustering
- Marker detection
- Custom selections
- Cell labelling
- t-SNE
- UMAP
File metadata is stored in its own group.
Version 2.1:
- Inputs
- RNA quality control
- ADT quality control
- Cell filtering
- RNA normalization
- ADT normalization
- Feature selection
- RNA PCA
- ADT PCA
- Combine embeddings
- Batch correction
- Neighbor index
- k-means clustering
- SNN graph clustering
- Choose clustering
- Marker detection
- Custom selections
- Cell labelling
- t-SNE
- UMAP
Version 1.2:
- Inputs
- Quality control
- Normalization
- Feature selection
- PCA
- k-means clustering
- SNN graph clustering
- Choose clustering
- Marker detection
- Custom selections
- Cell labelling
- t-SNE
- UMAP
Calling the validate()
function will validate the state file,
which will throw a reasonably informative error if there are any problems.
#include "H5Cpp.h"
#include "kanaval/validate.hpp"
H5::H5File handle(path, H5F_ACC_RDONLY);
kanaval::validate(handle, embedded, version);