SpatiotemporalSpiking

This repository contains the code used in Sukman and Stark, 2022, Cortical pyramidal and parvalbumin cells exhibit distinct spatiotemporal extracellular electric potentials.

Requirements

The code was tested on machines running Windows (Windows 10) and Linux (Ubuntu 18), but is expected to work on any OS.

The code was tested with Python 3.9.1 and the packages described in the requirements.txt file. Other versions of Python and/or the packages are not guaranteed to work or yield similar results.

Some statistical tests were conducted using MATLAB R2021B. In addition, to calculate mutual information values (Module 4), it is required to install the package developed in Timme and Lapish, 2018 (eNeuro).

Modules

The code should be executed from the src directory and the input to all code segments should be provided by changing only the constants.py and paths.py files.

The code files, all found under the src/ directory, are grouped into 5 modules:

1. Raw data parsing

The module is executed by running read_data.py.
The module is used to read the raw data, and extract only the relevant spikes and assign labels to them (see Spike sorting and ground truth labels under Materials and Methods). Utility functions for this module are under data_utils. For this module, be sure to update the following (for additional customization, see constants.py):

constants.py:
- SESSION_EMPTY_INDS - A dictionary containing all session names wanted to be read as keys, coupled with a list of shanks to skip. This was manually created based on prior knowledge, yet it can also be extracted automatically from the tagging mat file (see DATA_MAT_PATH below).
paths.py:
- DATA_PATH - Path to raw data. Data should contain .spk, .res, and .clu files according to the format presented in Hazan et al., 2006 (J. Neurosci. Methods) and a .stm file with a field of pairs indicating the start and end times of light stimuli (see more information under Provided data below).
- DATA_MAT_PATH - Path to mat file containing tagging data. The mat file should have 6 fields, filename (the name of the recording session), shankclu (pairs of recording shank and unit), region (recording region tag, 0 corresponds to nCX; >=1 corresponds to CA1), act (binary list indicating if the unit was activated by light stimuli), exc (binary list indicating if the unit was excitatory), and inh (binary list indicating if the unit was inhibitory).
- DATA_TEMP_PATH - Path in which the module will save the outputs as numpy arrays after removing spikes that occurred during lights stimuli.
- TEMP_PATH_FULL - Path in which the module will save the outputs as numpy arrays including spikes occurring during light stimuli.

2. Feature extraction

The module is executed by running preprocessing_pipeline.py.
The module is used to extract features from the outputs of Module 1 (See Tables 1-3 for the description of the features, and subsection chunking of the Materials and Methods for the chunk statistics extracted from the original features). Utility functions for this module are under data_utils and features. For this module, be sure to update the following (inputs for previous modules should not be changed; for additional customization, see constants.py):

constants.py:
- SEED - The Seed used for randomized chunk partition.
- CHUNK_SIZES - Chunk sizes to use for chunks, note that 0 means no-chunking.
- TRANS_WF - Binary value indicating whether to perform delta-transformation on the waveforms before extracting waveform-based features. If True, only waveform-based features will be extracted.
paths.py:
- SAVE_PATH - Path to save features and metadata for every unit. This output will not include the chunk statistics.
- SAVE_PATH_RICH - Path to save features and metadata for every unit including the chunk statistics for every feature.

3. Machine learning

The module is executed by running ml_pipeline.py.
The module is used to create and evaluate cell type classification models based on the outputs of Module 2 (See the Classification subsection under Materials and Methods for methodology). Outputs of the module include the datasets, and the evaluation results. Results are saved in 5 files: CSV file including performance and feature importance for every iteration, modality, and chunk size; NPY file with the raw predictions on the test set for all trained models; NPY file with the raw SHAP value for each sample for all trained models; NPY file with the raw predictions on the non-trained-upon-region test set if region-based partition was performed; NPY file with the raw SHAP value for each sample for all trained models on the non-trained-upon-region test set if region-based partition was performed set for all trained models. Utility functions for this module are under ml. For this module, be sure to update the following (inputs for previous modules should not be changed; for additional customization, see constants.py):

constants.py:
- MODALITIES - Names of modalities coupled with feature indices.
- CHUNKS_MAP - Mapping between modality and chunk sizes to test if wanted to test different chunk sizes for different modalities.
- RUN_NAME - Identifier of the run, the outputs of the module will be saved in a directory named accordingly.
- REGION_BASED - Whether to
- TRAIN_CA1 - Relevant only when REGION_BASED is True, in which case controls the training region.
- SHUFFLE - Whether to shuffle the labels before training. If 0 or False no shuffling occurs, otherwise repeats shuffle SHUFFLE times.
- GS_ZERO - Whether to perform grid search procedure with chunking (True corresponds to no-chunking)
paths.py:
- DATASET_PATH - Path in which datasets will be saved.
- RESULTS_PATH - Path in which results files will be saved.
- ML_INPUT_PATH - Path to the post-processed data.

4. Statistics

The module is executed by running the individual notebooks inside the statistics_analysis/ directory.
The module is used to perform statistical analysis, mainly based on the outputs of Module 3. Note that not all tests are performed in the notebooks. Some statistics are performed in the next module (specifically statistics related to Figure 3C, Figure 4, and Table 4). In addition, some statistical tests are performed in MATLAB (specifically Kruskal-Wallis tests, correlation significance and mutual information calculation). Note that some results files used for the analysis were modified to have the importance values of the original features, feature families and spatial event groups. Code for the manipulation is provided and explained in statistics_analysis/combine_res.py. In addition, p-values based on permutation tests may vary across executions due to the random nature of the sampling. Utility functions for this module are under statistics. For this module, be sure to update the following (inputs for previous modules should not be changed):

paths.py:
- MAIN_RES - Results of the regular execution and feature importance for the original features
- BASE_RES - Baseline (using shuffled labels) of the regular execution and feature importance for the original features
- FAMS_RES - Results for the spatial models of the regular execution and feature importance for the spatial feature families
- BASE_FAMS_RES - Baseline (using shuffled labels) for the spatial models of the regular execution and feature importance for the spatial feature families
- EVENTS_RES - Results for the spatial models of the regular execution and feature importance for the spatial event groups
- BASE_EVENTS_RES - Baseline (using shuffled labels) for the spatial models of the regular execution and feature importance for the spatial event groups
- REGION_CA1_RES - Results of the CA1-trained models with importance for the original features based on the CA1 test set
- REGION_NCX_RES - Results of the NCX-trained models with importance for the original features based on the NCX test set

5. Visualization

The module is executed by running vis_figures.py.
The module is used to create the panels from the figures in the manuscript and specific statistical analysis (see above). Utility functions for this module are under vis_utils. Note that some visual aspects of the outputs were added/fixed manually. For this module, be sure to update the following (inputs for previous modules should not be changed):

paths.py:
- MAIN_PREDS - Raw predictions for test set samples for each iteration, modality, and chunk size of the regular execution
- TRANS_WF_RES - Results of the execution with waveform-based features calculated on the transformed spikes (no chunk statistics).
- CORR_MAT - mat file containing Spearman's correlation coefficients between the features and p-values.
- MI_MAT - mat file containing mutual information values between the features and p-values.

Provided data

A dataset is provided at https://doi.org/10.5281/zenodo.7273925 for testing the modules.

Raw data are provided for a single shank from a single recording session under data_files/raw_data/es25nov11_13/. Files with .spk, .res, and .clu formats are constructed as described in Hazan et al., 2006. .xml file contains metadata for the recording session. .stm file contains information about the light stimuli (start and end times, type of stimuli, intensity of stimuli, and shank).
Mat file containing tagging information is provided in data_files/tags.mat. The file contains information for every unit from every shank (shankclu) and recording session (filename) about the functionality (excitatory-exc; inhibitory-inh), responsiveness to light (act) and region (region; 0 corresponds to neocortex and >=1 corresponds to CA1).
As the raw data provided for testing the first module are a subset of the full dataset used in the manuscript, all post-processed data (after feature extraction) are also provided under data_files/postprocess_data/. The directory includes two sub-directories: cluster_data/ containing the features of all three modalities and cluster_data_wf_trans/ which contains only waveform-based features extracted from the transformed spikes. Note that cluster_data/ does not contain chunk statistics. To extract chunk statistics you may utilize preprocessing_pipeline.py.
Since execution times can be prolonged (see Replicating results below), results files are provided under data_files/results/ and can be used for Modules 4 and 5. See paths.py for explanations of the different result files
Statistical data, specifically mat files containing the p-values of the correlation coefficients and mutual information values are available under data_files/statistics/.

Replicating results

To add chunk statistics to the data, run preprocessing_pipeline.py. Be sure to run Module 1 before or alternatively manually execute only the function add_chunk_statistics. Due to having to load and modify large files, execution may take a few minutes on a standard pc (tested on 7th generation Intel i7 processor and 16 GB of RAM).
Run Module 3 with NUM_ITER=50, and ML_INPUT_PATH set to SAVE_PATH_RICH. This may take around a day on a standard PC, to validate the procedure, here and in the next steps, you may run it with fewer iterations. Then repeat with ML_INPUT_PATH set to SAVE_PATH and adapt MODALITIES in constants.py accordingly (see explanation within the file). Also remember to change output names to avoid overwriting previously computed performance. Finally, combine the two output files created using the function combine_chunks from statistics_analysis/combine_res.py followed by get_feature_imp, get_events_imp and get_family_imp.
To get chance level results, run Module 3 with NUM_ITER=1000, SHUFFLE=1, and ML_INPUT_PATH set to SAVE_PATH_RICH. It is highly recommended to use CHUNKS_MAP (see constants.py) to shorten run times. With a standard PC this process may take several weeks, mainly due to SHAP analysis when by chance deep trees are trained, hence it is possible to change/forgo the grid search procedure though results may vary. The process can be accelerated using parallel processing, which allowed us to perform the analysis in less than a week. You should then run get_feature_imp from statistics_analysis/combine_res.py.
Last, execute Module 3 with the REGION_BASED flag set to True to train the models on a single region. Change NUM_ITER back to 50 and repeat the process twice with TRAIN_CA1 set to both True and False. You should then run get_feature_imp from statistics_analysis/combine_res.py.

EranStarkLab/SpatiotemporalSpiking