Continvvm/continuum

[Question] rehearsal example clarification regarding train/valid splits

kirk86 opened this issue · 2 comments

In the rehearsal example there's the following template:

scenario = ClassIncremental(
    CIFAR100(data_path="my/data/path", download=True, train=True),
    increment=10,
    initial_increment=50
)

memory = rehearsal.RehearsalMemory(
    memory_size=2000,
    herding_method="barycenter"
)

for task_id, taskset in enumerate(scenario):
    if task_id > 0:
        mem_x, mem_y, mem_t = memory.get()
        taskset.add_samples(mem_x, mem_y, mem_t)

    loader = DataLoader(taskset, shuffle=True)
    for epoch in range(epochs):
        for x, y, t in loader:
            # Do your training here

    # Herding based on the barycenter (as iCaRL did) needs features,
    # so we need to extract those features, but beware to use a loader
    # without shuffling.
    loader = DataLoader(taskset, shuffle=False)

    features = my_function_to_extract_features(my_model, loader)

    # Important! Draw the raw samples from `scenario[task_id]` to
    # re-generate the taskset, otherwise you'd risk sampling from both new
    # data and memory data which is probably not what you want to do.
    memory.add(*scenario[task_id].get_raw_samples(), features)

How does that change if we add train/valid splits?

for task_id, taskset in enumerate(scenario):
    if task_id > 0:
        mem_x, mem_y, mem_t = memory.get()
        taskset.add_samples(mem_x, mem_y, mem_t)

    dataset_train, dataset_val = tasks.split_train_val(taskset, val_split=0.1)
    train_loader = tud.DataLoader(dataset_train, shuffle=True)
    val_loader = tud.DataLoader(dataset_val, shuffle=True)
    
    for epoch in range(epochs):
        for x, y, t in train_loader:
            # Do your training here

    # Herding based on the barycenter (as iCaRL did) needs features,
    # so we need to extract those features, but beware to use a loader
    # without shuffling.
    unshuffled_loader = DataLoader(taskset, shuffle=False)  # --> here should it be taskset or dataset_train?

    features = my_function_to_extract_features(my_model, unshuffled_loader)

    # Important! Draw the raw samples from `scenario[task_id]` to
    # re-generate the taskset, otherwise you'd risk sampling from both new
    # data and memory data which is probably not what you want to do.
    memory.add(*scenario[task_id].get_raw_samples(), features) # --> scenario[task_id].get_raw_samples() returns all samples in the current taskset?

If unshuffled_loader uses dataset_train then len(features) $\neq$ len(scenario[task_id].get_raw_samples()[0]).
Another question, is do we need to add samples into memory buffer from both train and valid samples or just train samples? Because, in my understanding the taskset contains all samples before the train/valid split, right?

Hi @kirk86 ,
Thanks for the issue.

--> here should it be taskset or dataset_train?
I would do it based on dataset_val to avoid overfitting when sampling. But it is a matter of choice I believe.

--> scenario[task_id].get_raw_samples() returns all samples in the current taskset?
Yes.

If unshuffled_loader uses dataset_train then len(features) \diff len(scenario[task_id].get_raw_samples()[0]).
True. is that a problem?

Another question, is do we need to add samples into memory buffer from both train and valid samples or just train samples?
Usually we would/should use a separate memory buffer for validation with data from all tasks seen so for or do the split taskset before adding samples and add samples only on train. In other, case some samples might be on train and later on val.

The taskset contains all the samples of the current task.

I hope those answers helps :)

Hi @TLESORT,
thanks for the reply.

If unshuffled_loader uses dataset_train then len(features) \diff len(scenario[task_id].get_raw_samples()[0]).
True. is that a problem?

I think so because memory.add(arg1, arg2) throws an error when the two arguments have not equal number of sampels.

Another question, is do we need to add samples into memory buffer from both train and valid samples or just train samples?
Usually we would/should use a separate memory buffer for validation with data from all tasks seen so for or do the split taskset before adding samples and add samples only on train. In other, case some samples might be on train and later on val.

I didn't quite get the first part of using the separate memory buffer for validation, I'll add your suggested changes below to the MWE and let me know if there's a misunderstanding on my part.

for task_id, taskset in enumerate(scenario):
    dataset_train, dataset_val = tasks.split_train_val(taskset, val_split=0.1)
    train_loader = tud.DataLoader(dataset_train, shuffle=True)
    val_loader = tud.DataLoader(dataset_val, shuffle=True)

    if task_id > 0: # --> as suggested, first split taskset and then add samples to dataset_train only
        mem_x, mem_y, mem_t = memory.get()
        dataset_train.add_samples(mem_x, mem_y, mem_t)
    
    for epoch in range(epochs):
        for x, y, t in train_loader:
            # Do your training here

    # beware use a loader without shuffling.
    unshuffled_loader = DataLoader(dataset_val, shuffle=False)  # --> as suggested dataset_val to avoid overfitting?

    features = my_function_to_extract_features(my_model, unshuffled_loader)

    # Important! Draw the raw samples from `scenario[task_id]` to
    # re-generate the taskset, otherwise you'd risk sampling from both new
    # data and memory data which is probably not what you want to do.
    memory.add(*scenario[task_id].get_raw_samples(), features) # --> how should this change, is it dataset_train.get_raw_samples?

Could you also illustrate what you meant with Usually we would/should use a separate memory buffer for validation with data from all tasks seen so far, and where exactly it fits in the MWE?