arbor-sim/arbor

Streamline sampling.

Opened this issue · 16 comments

  1. Make the shape of record outputs (as returned by simulation.samples()) consistent between probesets and vector (or _cell kinded) probes. See #1898 (comment)
  2. Have option to set probe recording policy upon declaration of the probe.
    a. The moment sim.sample() is called could be seen as an implicit setting of recording policy, because why else would one set a schedule and store a handle?
  3. Have way to label the probe, such that a user can discover the location of the recorder (instead or in addition to (gid,lid)) or retrievable by a custom label.
    a. And/Or have way to translate cell_member and probe_id/probe_address and handle (could be the label in step 2. not the tag that's already there.)
  4. Have a tutorial showing this off.
  1. Is Python only, right? But I like this! Less complication to deal with.
  2. Seems orthogonal, spikes and samples have been separate so far. Better split this into a separate issue.
  3. Ditto, and the location is set by the user? Also we already label them. Or am I misunderstanding something?
  4. Not sure I understand/

Point 1 and 2 were phrased incorrectly, I think spike detector was inserted by someone else using my hands. The recording policy of any probe should be settable upon defining the probe(set).

Point 2 and 3 are different solution to the same problem, so I made that clearer. The problem is:

class my_rcep(arbor.recipe):
    def probes(self, gid):
        return [
            arbor.cable_probe_membrane_voltage('"midpoints"'),
            arbor.cable_probe_membrane_voltage('(root)'),
            arbor.cable_probe_membrane_voltage_cell(),
            etc...
        ]

handle0 = sim.sample((0, ?), arbor.some_schedule(...))
handle1 = sim.sample((0, ??), arbor.some_schedule(...))
handle2 = sim.sample((0, ???), arbor.some_schedule(...))
handle3 = sim.sample((0, thisnumberdoesntevenhavebeanexistingindex=9999), arbor.some_schedule(...))

Better is:

class my_rcep(arbor.recipe):
    def probes(self, gid):
        return [
            arbor.cable_probe_membrane_voltage("recorder_1", '"midpoint"'),
        ]

handle0 = sim.sample("recorder_1", arbor.some_schedule(...))

or

my_rcord = arbor.cable_probe_membrane_voltage('"midpoint"')

class my_rcep(arbor.recipe):
    def probes(self, gid):
        return [
            myrcord,
        ]

sim.samples(my_rcord)

I like the idea of having vector probes to be equivalent to multi-location scalar probes. No idea though if this is realisable in C++ efficiently.

Definitely pro labels as well, it ties in nicely with our remaining place-like methods. Should be something like [(probe, label)] though.

It occurs to me now that sim.sample() is maybe the most logical place to set a default sampling policy, perhaps even by calling that the policy should be implicitly set; why else create a handle and set a schedule?

What's the reason for separating probes from samplers in the context of being able to set samplers to not record? Is there still an impact on memory/performance if all probes would have a sampler attached by default? If not, then we could reduce cognitive overhead by not having to have to label (or index) them separately.

  1. We could add one extra arg (a schedule) to probe constructors such that their sampler is defined.
  2. Reading them out after the simulation could be through a new probe method (e.g. ProbeClass.samples()). To extract data, users could just loop over sim.probes() and call the method, and filter on the metadata it returns if they need to.
  3. Switching samplers off could be done through policies or a new user callable probe method.

What do you think, @boeschf?

I was not part of that particular design, but it has some merits:

  • you can treat the recipe as opaque and select subsets of the probes to actually sample. That might help during model construction or when doing detailed analyses, eg peaking into STATE
  • we can separate the what+where from the timepoints we evaluate the values on.
  • also, different processes (=simulation objects) in the distributed case might do different things

I agree though that it was built with the poweruser in mind and solves for the 10% case.
But keep in mind that in the C++ API you never manifest these arrays as in Python, you supply a callback
that's being executed on the values. The Python API uses this to build up arrays.
How about we layer some convenience on top? Example adapted from single_cell_recipe.py
Before

# ...
    def probes(self, gid):
        # just to have two ...
        return [arbor.cable_probe_membrane_voltage('"midpoint"'), arbor.cable_probe_membrane_voltage('"midpoint"')]
# ...

handles = [sim.sample((0, 0), arbor.regular_schedule(0.1)), sim.sample((0, 1), arbor.regular_schedule(0.1))]
sim.run(tfinal=30)

data, meta = sim.samples(handles[0])[0]
# ...
data, meta = sim.samples(handles[1])[0]

After

# ...
    def probes(self, gid):
        return [arbor.cable_probe_membrane_voltage('"midpoint"'),  arbor.cable_probe_membrane_voltage('"midpoint"')]
# ...


handles = sim.sample_all(arbor.regular_schedule(0.1)) 
sim.run(tfinal=30)

for data, meta in sim.samples(handles):
  # ...

From an API standpoint this is only marginally different from setting a default schedule. I prefer this slightly for being consistent with the existing API and worded actively, ie 'do that' vs 'here, have an object'.

One more layer of abstraction could be futures, where the handle is the data as soon as it has been filled by Arbor
and empty before.

This helper can be used in the meantime, to reduce the problem to one index to keep track of:

def simpler_sample(self,rec,sched):
    if getattr(rec, "probes"):
        smplrs=[]
        for gid in range(rec.num_cells()):
            smplrs_on_gid=[]
            for pi in range(len(rec.probes(gid))):
                smplrs_on_gid.append(self.sample((gid,pi),sched))
            smplrs.append(smplrs_on_gid)
        return smplrs

But this works, no? You hand it the recipe explicitly.

This works, yes, but it would be better if I could probe sim for the rec, which is not possibly atm. But, @boeschf wrote he's working on sth already, so I'll just keep this snippet around in case anyone asks. You still need to bookkeep an index after all.

Ok, I was just a bit confused. So, yes, this is cumbersome and slow as in $\mathcal{O}(N_\mathrm{cell})$, too.

My current stance is: We can simplify the sampling interface (for now, cable cell only) by

Data

  • probe data will either be
    • double :: scalar probe
    • [] std::pair<const double*, const double*> :: vector probe (an iterator/range)
  • these can be unified into the second case

Meta

  • mlocation :: = [(branch id, pos)] for scalar probes
  • mcable_list :: = [mcable] = [(branch id, prox, dist)] for vector probes
  • cable_probe_point_info :: for point mechanism state
  • [cable_probe_point_info] :: for cell-wide point mechanism state queries.
    and
struct cable_probe_point_info {
    // Target number of point process instance on cell.
    cell_lid_type target;

    // Number of combined instances at this site.
    unsigned multiplicity;

    // Point on cell morphology where instance is placed.
    mlocation loc;
};

Unification:

  • we can merge cases 1 and 2 by replacing mlocation with mcable(id, pos, pos)
  • as well as 3 and 4 by always using a vector
  • we also can merge 1+2 and 3+4 by using this
    struct cable_probe_point_info {
      // Target number of point process instance on cell.
      cell_lid_type target;
      // Number of combined instances at this site.
      unsigned multiplicity;
    };
    
    struct cable_probe_metadata {
      // Point on cell morphology where instance is placed.
      mlocation loc;
      // If a point query, then this is set
      std::optional<cable_probe_point_info> point_info;
    };
  • this also removes a series of cases from the any and maybe we can 'just' use a variant

In Python we can then return a NumPy array of the proper shape (#samples, #points) and a list of metadata [info]

I'm not really convinced that we need to modify the C++ API. It's flexible enough to support any number of variations in the Python API, and was designed to support future probeable things that can have arbitrary return types. This, in fact, has already demonstrated its utility in the extended probe return-types that were added to the cable cell API.

It's still the case that we want to be able to support arbitrary novel cell types in the future? If so, let's not specialize the interface for cable cells.

Just to expand a bit, the idea with probe return values is that they constitute a contract between the cell type and the sampler: the sampler expects to get the type of data that is provided by cell groups that handle the cell type for that sort of probe address, and the sampler can confirm that it has the contract correct by casting the any_ptr in the sample record and the any_ptr in the returned probe metadata.

This allows us to support arbitrary information returned by probes on cells in the future that we haven't yet considered. It also allows us to handle specific cell types and probe addresses in the Python API in a safe manner.

We 'just' added LIF probes in #2021, which has a new, disjoint set of probe ctors and data/metadata. Much less interesting,
which is why this discussion here deals in cable cells only. I am fine with switching implementations across cell types.

My gripe is with these fellows:

void vector_sampler(arb::probe_metadata pm, std::size_t n, const arb::sample_record* samples) {
    auto* cables_ptr = any_cast<const arb::mcable_list*>(pm.meta);
    auto* point_info_ptr = any_cast<const std::vector<arb::cable_probe_point_info>*>(pm.meta);

    assert((cables_ptr != nullptr) || (point_info_ptr != nullptr));

    unsigned n_entities = cables_ptr ? cables_ptr->size() : point_info_ptr->size();

    std::cout << std::fixed << std::setprecision(4);
    for (std::size_t i = 0; i<n; ++i) {
        auto* value_range = any_cast<const arb::cable_sample_range*>(samples[i].data);
        assert(value_range);
        const auto& [lo, hi] = *value_range;
        assert(n_entities==hi-lo);

        for (unsigned j = 0; j<n_entities; ++j) {
            std::cout << samples[i].time << ", ";
            if (cables_ptr) {
                arb::mcable where = (*cables_ptr)[j];
                std::cout << where.prox_pos << ", " << where.dist_pos << ", ";
            } else {
                arb::mlocation loc = (*point_info_ptr)[j].loc;
                std::cout << loc.pos << ", ";
            }
            std::cout << lo[j] << '\n';
        }
    }
}

We know --- hence the asserts --- a few things about the data shape beforehand. Using the model I suggested above,
we could just leverage that directly. Furthermore this function is actually two, depending on whether cables_ptr or
point_info_ptr, but we cannot know that without running the function. Worse we write another two functions for
the scalar case. Imagine regularly switching probe types. Thus

  • Treating vector and scalar probes always as vector removes half the burden on the implementer
  • Making meta = (mcable, optional<point_info>) allows one sampler for all cases

I agree, though, on the extendability argument; adding a probe type within a cell type without conforming to the suggested interface would be painful. Also eyeing the idea of cell groups as a plugin. Nesting anys could help...

These problems even bubble up to the Python interface:

for (gid, tag), handle in handles.items():
    for data, meta in sim.samples(handle):
        columns = []
        if isinstance(meta, list):
            columns += list(map(str, meta))
        else:
            columns.append(str(meta))
        df = pd.DataFrame(data=data[:, 1:], columns=columns, index=data[:, 0]))
        # do something w/ df

at least, this could be:

for (gid, tag), handle in handles.items():
    for data, meta in sim.samples(handle):
        columns = list(map(str, meta))
        df = pd.DataFrame(data=data[:, 1:], columns=columns, index=data[:, 0]))
        # ...

and I'd even prefer:

for (gid, tag), handle in handles.items():
    data, meta = sim.samples(handle)
    columns = list(map(str, meta))
    df = pd.DataFrame(data=data[:, 1:], columns=columns, index=data[:, 0]))
    # ...