HDF5 dataset format: how to convert

I presume training on HDF5 will be more efficient than any of the other formats. And at least against the line GT file pairs, filesystem performance might be much better, too.

So my question is: how do I convert existing datasets into HDF5 format?

Hi Robert, at the moment there is no script that converts data from the command line. When running Cross-fold-train, the data is copied to hdf5 before the training starts, have a look here:

calamari/calamari_ocr/ocr/training/cross_fold.py

Lines 77 to 90 in 3b1969b

    
           # else load the data of each fold and write it to hd5 data files 
        
           with ExitStack() as stack: 
        
               folds = [ 
        
                   stack.enter_context(Hdf5DatasetWriter(os.path.join(self.output_dir, "fold{}".format(i)))) 
        
                   for i in range(self.n_folds) 
        
               ] 
        
               for i, sample in tqdm_wrapper( 
        
                   enumerate(data_generator.generate()), 
        
                   progress_bar=progress_bar, 
        
                   total=len(data_generator), 
        
                   desc="Creating hdf5 files", 
        
               ): 
        
                   sample: Sample = sample 
        
                   folds[i % self.n_folds].write(sample.inputs, sample.targets)

For my own training, I've hacked together some lines of code at https://github.com/andbue/nashi/blob/master/ocr/nashi_ocr/nashi_client.py to save preprocessed data in a single hdf5 file, so I can re-run training and prediction the need for preprocessing the images again. If I had the time, it would be sensible to integrate some of that into calamari, I guess.

Hi Andreas – thanks for your fast feedback!

I think I understood the writer part, but could you please fill me in on the reader side (for file pairs)? What's the minimal / best pattern to instantiate a data generator – scripts.dataset_viewer.DataWrapper perhaps?

That's where I would have started as well. Maybe a copy of dataset_viewer.py, setting PipelineMode.EVALUATION, writing sample.inputs and sample.targets to the Hdf5DatasetWriter instead of showing them in pyplot. If I'm not totally mistaken, this should work with all kinds of datasets. Just in case you end up with something helpful for other users as well: feel free to put it in a PR!

Understood, thanks! I'll give it a try.

	# else load the data of each fold and write it to hd5 data files
	with ExitStack() as stack:
	folds = [
	stack.enter_context(Hdf5DatasetWriter(os.path.join(self.output_dir, "fold{}".format(i))))
	for i in range(self.n_folds)
	]
	for i, sample in tqdm_wrapper(
	enumerate(data_generator.generate()),
	progress_bar=progress_bar,
	total=len(data_generator),
	desc="Creating hdf5 files",
	):
	sample: Sample = sample
	folds[i % self.n_folds].write(sample.inputs, sample.targets)