Cadene/bootstrap.pytorch

Improve logging (logs.json) with SQLite

Cadene opened this issue · 0 comments

tl;dr: SQLite will replace logs.json

Our current implementation

We use a Logger object that stores data as lists of "values" associated to "keys" in a python dictionary. This dictionary is stored in RAM. At the end of a train epoch or eval epoch, Logger creates/flushes a logs.json file in the experiment directory.

logs/myexperiment/logs.json
{
  'train_epoch.epoch': [0, 1, 2, 3, 4, 5],
  'train_epoch.acc_top1': [0.0, 5.7, 13.8, 20.4, 28.1, 37.9]
}

Its problems

  • If the code crashes before a flush, the data is lost and we want to use Logger to monitor stuff such as CPU memory usage or GPU memory usage before a crash!
  • We need to write the full json files each time a new value has been added.
  • We need to load the full json files each time a new value has been added to visualize stuff.

Our constraints

  • We want to keep our logs in the experiment directory (no SQL/NoSQL datasets, SQLite maybe?).
  • We want to write new values only (For instance, we only write values of epoch 10 at epoch 10).
  • We want concurrent reads and writes (at least in differrent keys).

Some propositions

The following tools store the data on the file system (not in RAM).

H5PY one file

see

logs/myexperiment/logs.h5py

Pros:

  • Use numpy
  • Easy to access data['train_epoch.epoch'][10]

Cons:

  • Extendible datasets (when you don't specify the number of size) seems to need "resize" see.
  • We encountered a lot of bugs in the past due to HDF5 when we multi-thread/multi-process reading or writing

LMDB

see

logs/myexperiment/logs/train_epoch.epoch.lmdb

Pros:

Cons:

  • Cumbersome to use

netCDF

see

logs/myexperiment/logs.nc

One CSV per key / or binary file

logs/myexperiment/logs/train_epoch.epoch.csv

Pros:

  • Very easy to understand, and track

Cons:

  • Creates one file per tracked variable
  • Associating different variables for the same time step requires reading different files and aligning them
  • Difficult to implement (reinvent the wheel)

SQLite

see

logs/myexperiment/logs.sqlite

Pros:

  • Can grow big enough
  • Allow easy concurrent read/write
  • Caching system (TODO source)
  • Binary encoding
  • Indexing (easy to read only what we want)
  • Meta-data: timestamp, epoch_id, iteration_id
  • Fault-tolerant (if crash happen)

Cons:

  • Requires library to read, user must know SQL to do custom queries/applications (We could add a wrapper over SQLite in Logger)

Experiments comparison in SQLite

databases = []
for experiment in all_experiments:
  databases.append(open...)
for experiment, database in zip(all_experiment, databases):
  for metric in list_of_metrics:
    min_metric = select... # may be already in cache
    max_metric = select... # may be already in cache
    (use it here to agglomerate in python)