EnnyvanBeest/UnitMatch

Removing empty clusters is not doing what it is supposed to

dervinism opened this issue · 27 comments

Hi,

This line in the ExtractKilosortData messes up the data:

[clusinfo, sp, emptyclus] = RemovingEmptyClusters(clusinfo, sp);

There are no empty clusters in the TSV file. All sorts of other non-empty clusters are being removed instead.

Hi,
could you please be a bit more specific on how you use it?
For example, which Kilosort version do you use? Are you doing manual curation? I haven't ran into this problem myself before so it's a bit hard to debug without additional information.

I use Kilosort3 with manual curation in Phy2. I commented the line and replaced it with emptyclus = [];. So the cluster info appears to be loaded correctly with tdfread but executing the remaining of the ExtractKilosortData function ends up with what appears to be original KS clusters prior to spikesorting. I've also noticed that some of your modified spikes functions are not working the same way as the original ones.

Below are a few lines which may be linked to this issue:

AllUniqueTemplates = cat(1, AllUniqueTemplates, unique(sp.spikeTemplates));

and later

clusinfo.cluster_id = AllUniqueTemplates;

So it appears that after commenting out the line [clusinfo, sp, emptyclus] = RemovingEmptyClusters(clusinfo, sp);, the sp variable is not being updated causing other issues downstream.

I don't know why ExtractKilosortData.m is giving you problems. We typically don't do manual curation so perhaps this is where the problem lies? If you prefer to stick to your own clusinfo/sp extraction, you can use UnitMatch without using our preprocessing pipeline as explained in DEMO_UNITMATCH.m .

Hope that helps.

Ok, I commented out clusinfo.cluster_id = AllUniqueTemplates; line and the info now appears to be correct.

I noticed that there are multiple channel order/ coordinate parameters in the script and so I am not sure which of them are used to determine channel coordinates. There are UMparam.AllChannelPos, clusinfo.ch, and clusinfo.depth. I am guessing that the first parameter can be used to reorder channels as it is set by the user. Is this correct?

clusinfo.ch is the recording channel (index)
clusinfo.depth is depth on probe
UMparam.AllChannelPos (needed for UnitMatch) is the position in x,y,(z), per recording channel. E.g. for Neuropixels 384 recording channels each has their own x and y coordinates (and z with other probe configurations than Neuropixels).

There are a lot of data clean-up/ checks in the script (e.g. automatic curation / checking channel configuration / etc.) that may not be ideal for you depending on your recording device or software used (I have not tested with Phy2/Kilosort 3 yet). So please refer to the DEMO_UNITMATCH.m to see what the input is for the UnitMatch pipeline and feel free to prepare your data not using extractkilosortdata.m.
The reason all these checks are done in the extractkilosortdata.m script is because we did find empty clusters in the data (they were left empty after merging by kilosort), and we also found the channel positions were not always read out properly, so here it checks the channelpositions from the raw (spikeGLX) data - besides doing automated curation (optional) with Bombcell. If you are certain that it works for you commenting out some lines, of course that's an option, but I personally cannot guarantee that it works properly in that case. Please carefully inspect your data before feeding it into the UnitMatch pipeline.

Thanks, I understand the meaning of these parameters. My question was rather whether any of them (e.g., UMparam.AllChannelPos) are used to reorder channels for UnitMatch (in the case that channels in the binary file are not ordered in line with the physical coordinates).

I have another question. I see that UnitMatch merges units within the same recording channel when assigning a unique ID. For manually curated units this feature can be undesirable. Is there a way to disable within-session unit mergers?

UMparam.AllChannelPos has the coordinates per channel, ordered by how SpikeMap is ordered (each neuron's waveform data should have the same amount of channels as AllChannelPos). There's no reordering.

UnitMatch main output is the matching probability. Indeed, we're implementing different 'UniqueID algorithms' (which are under construction, please keep an eye on the changes in the main branch). It is our view that if a within-session probability is very high there is no reason to not merge it if you trust the output. It is difficult to disable within-session unit merges because this will likely cause a situation where unit A and B on day 1 with a high matching probability (=merge) are both to be merged with unit C on day 2.

However, unit assignment is an add-on to the pipeline. The user is free to use the match probability table in a pair-wise (recording) fashion and decide for themselves how/when to assign matches - dependent on their scientific question. In fact, we encourage this, as one of the strong points of this code is that the output is matching probabilities rather than a binary choice.

I find this confusing. UMparam.AllChannelPos seems to duplicate clusinfo.depth. Are clusinfo.ch and clusinfo.depth used in any way at all by UnitMatch?

not everyone uses Kilosort. So UMparam.AllChannelPos can either be generated using the provided ExtractKilosortData.m if you do use Kilosort. Or it can just be provided as an input.

Re. what is used please read the information on the DEMO scripts:

% UMparam.RawDataPaths = {'\\path\to\firstrecording','\\path\to\secondrecording','\\path\to\nthrecording'};  % This is a cell array with info on where to find the decompressed recording (.cbin files) --> Necessary when you want UnitMatch to do waveform extraction
% UMparam.AllDecompPaths = {'\\path\to\firstrecording','\\path\to\secondrecording','\\path\to\nthrecording'};  % This is a cell array with info on where to find the decompressed recording (.bin files) --> Necessary when you want UnitMatch to do waveform extraction
% UMparam.AllChannelPos = {[RecordingSites_Recording1],[RecordingSites_Recording2]}; % These are coordinates of every recording channel on the probe (e.g. nRecordingChannels x 2)
% clusinfo = struct; % Note, this can be kilosort input, 
% - clusinfo (this is a struct that contains per unit the following information):
% * cluster_id (e.g. kilosort output clus_id)
% * Good_ID: ones for units that should be included in the analysis
% * RecSesID: Recording Session ID
% * Probe: Which probe (if just 1, ones of numel cluster_id)
% * Depth: depth on probe (optional)
% * Shank: Which shank (optional)
% * Coordinates: Typically 3D Allen Common Coordinate framework coordinates per unit (optional)

I was wondering if there is a way to track units across recording session that do not have the same (consistent) channel ordering? Or some way of specifying how channels can be descrambled correctly. I have data that certain sessions were spike sorted with parts of Neuropixels recording banks swapped around. Interestingly, UnitMatch still tracks units on the same channels across these session despite the fact that these units are clearly not the same units and only have superficial similarities. I find that a lot of, if not most of, matched units are like that. Have you tried running UnitMatch on a similar data? Have you noticed anything like that?

Yes there is, you can set
PipelineParams.separateIMRO = 0; % Run for every IMRO separately (for memory reasons or when having multiple probes this might be a good idea)

Part of why unitmatch works well is location / spatial trajectory based. Therefore, if the coordinates linked to a unit are wrong, it might match units wrong. It's therefore essential that UMparam.AllChannelPos is correct per session so that the correct recording channel is matched to the correct position.
Nevertheless, we have quantifield unitmatch performance on acute datasets (see supplementary figures in the bioRxiv manuscript) and did not find a lot of matches despite the same channel positions.

Thanks, this is what I was looking for. Now I can match units at the correct positions.

Actually, it does not working correctly yet. Many units are still being matched in locations where they shouldn't be matched. It seems like some hybrid regime (a bug somewhere?).

I tried cleaning output files and folders and rerunning the script but seem to run into the following error:

Extracting raw waveforms. Progress:   0%
Error using memmapfile/set.format
The Format field specified is invalid.

Error in memmapfile/set.Format (line 205)
        obj.format = v;

Error in memmapfile (line 586)
                obj.(hCapitalize(fieldname)) = varargin{i+1};

Error in ExtractAndSaveAverageWaveforms (line 79)
                    ap_data = memmapfile(AllDecompPaths{GoodRecSesID(uid)}, 'Format', {'int16', [nChannels, n_samples], 'data'});

Error in UnitMatch (line 104)
Path4UnitNPY = ExtractAndSaveAverageWaveforms(clusinfo,param);

Error in DEMO_UNITMATCH_ms_preliminary (line 134)
[UniqueIDConversion, MatchTable, WaveformInfo, UMparam] = UnitMatch(clusinfo, UMparam);

So this issue is fixed by replacing the line in UnitMatch.m

param.nChannels = length(Allchannelpos{1})+1; %First assume there's a sync channel as well.

by

param.nChannels = length(Allchannelpos{1})+param.nSyncChans; %First assume there's a sync channel as well.

However, I ran the script with UMparam.separateIMRO = 0; and UMparam.separateIMRO = 1;, and in both case the results are the same. The coordinates that I supply in UMparam.AllChannelPos are ignored. Is there a way to make the algorithm pay attetntion to the coordinates in UMparam.AllChannelPos?

Could you please show me here what UMparam.AllChannelPos looks like for you? And you don't run the extractkilosortdata.m anymore, right? Because this would load in the channel information from the IMRO information of spikeglx, overwriting your manual UMparam.AllChannelPos.
It would also help to see the 'ProjectedLocation' figure. This should match the UMParam.AllChannelPos you give as an input.

The param.nSync problem should be fixed with the pull from just now. I assume you had this error on the main branch, not on the workinprogress branch?

re. PipelineParams.separateIMRO = 0, you're right. This is only a useful parameter if you use the full example pipeline (main_example.m), which I think you don't. If it's set to 1, it basically just finds which probes (serial numbers) and channelmaps were the same and only analyzes those together that had the same channelmap+probe. Otherwise it analyzes all data from a given mouse simultaneously.

For the unitmatch core function we expect the user to input the kilosort directories and the channelmaps related to these kilosort directories.

But if I don't run extractkilosortdata, then clusinfo is not generated and I get the error:

Error using getClusinfo
Cannot access 'clusinfo' because 'Z:\SUN-IN-Petersen-lab\EphysData\MartynasDervinis\ms-preliminary\03_data\PP01\PP01_2020-06-25_15-54-22\PreparedData.mat' does not exist.

Error in DEMO_UNITMATCH_ms_preliminary (line 125)
clusinfo = getClusinfo(UMparam.KSDir); % prepare clusinfo struct

Is there another way to generate clusinfo variable?

I am afraid your usecase is unclear to me still. Why don't you want to use the IMRO table as stored in SpikeGLX meta information? (which is what is extracted as well with extractkilosortdata.m). I don't think it's wise to use any other IMRO table/ channel map than the one you actually used to record. Otherwise unitmatch is unlikely to function properly.

If I am missing something and you really need to have another channelmap than the one stored, you can alwasy first run extractkilsosortdata.m, and then overwrite UMparam.AllChannelPos again with the one you manually added - again, I'm unsure what the rationale would be for this.

But the data was not collected using SpikeGLX but OpenEphys.

I tried overwriting AllChannelPos field but that did not work either. Presumably because clusinfo generation is affected by this field. My guess is that AllChannelPos has to be overwritten at a correct stage before clusinfo is generated. Is that true?

I see. Are you sure you don't have a sync channel then? Because if you assume you don't while you do, you're ending up with reading in completely wrong waveforms.
I've received some data from someone before and this pipeline does work on their data (openEphys + Kilosort).
Please check:
https://github.com/EnnyvanBeest/UnitMatch/blob/main/MATLAB/DEMO_UNITMATCH_OpenEphys.m

There is no sync channel. The extracted waveforms are fine, I inspected them.

The example you sent me is extracting kilosort data and generating clusinfo file automatically. It does not apply to my case. But anyway, the issue is not with running the script. This has been tested and works ok. The issue is specifically with having certain recording sessions that have some channels swapped around.

It finally seems to work with channel reordering. I updated the fork. There are also a few functions that could be useful (https://github.com/petersen-lab/petersen-lab-matlab/blob/main/spikes/detectMatchingUnits.m and https://github.com/petersen-lab/petersen-lab-matlab/blob/main/spikes/reorderUnitMatchOutput.m).

I think this is very specific to your data acquisition. Indeed the channel order of UMparam.AllChannelPos should be in the order of how it's stored (and it needs to be a separate input per KSdir), so that it's matching the order of how the waveforms are then extracted from the raw data.

Either way, glad it works for you now as well.