BoomBox

BoonBox is a small program which adds noise in TXT or ALTO XML files. It was created for the purpose of an experiment around handwritten text recognition (HTR) related data.

Set up

Create a virtual environment with Python's vitualenv or with Anaconda and use PiPy to install the requirements.

(venv)~$ pip install -r requirements.txt

Tuning the boombox

You can adjust different parameters to define the quantity and the type of noise made by the boombox. All of these parameters can be set in the config.py file.

Random Seed

You can make the noise somewhat reproducible by using a seed.

Word replacement scenario

In an attempt to imitate transcribers' errors, the boombox can delete a word or replace all of its letters with a fixed character (repeat 2 to 5 times). By default, the boombox can give the following outcome when a word replacement scenario is activated:

"": this is my example -> this is example
"X": this is my example -> this is XXX example

You can set a probability for on or the other replacement string to be chosen, and you can add more options for replacement strings.

You can also set a probability of the word replacement scenario to occur.

Since this scenario comes after all the other modifications are done to the text, keep in mind that it might overwrite some of the noise induced at character level.

Typos

The boombox induce realistic typos in the text. Using our own adaptation of the typo¹ library, there can be 9 different type of typos generated in the text:

swap will swap two adjacent letters: example -> exapmle
delete will remove 1 letter: example -> examle
insert will add 1 letter, considered a keyboard neighbor: example -> exampole
nearby will replace 1 letter with another letter, considered a keyboard neighbor: example -> examole
similar will replace 1 letter with another letter considered visually similar: example -> examqle
agglomerate will stick two words together, simulating a forgotten space: my example -> myexample
repeat will repeat a character: example -> exampple
unichar will take a double character and replace it with a single character: happy -> hapy
split will add a random space in the middle of a word: example -> exam ple

It is possible to deactivate some of these scenarios and to modify how likely one scenario is to occur.

Note that according to the behavior set in the typo library, and since we apply several transformation scenario in a row, it is possible that a typo is added on top of another typo (with the risk of cancelling it).

Lastly, you can set how likely it is for a typo (whichever type the typo) to be added in the data.

Press Play

Make sure you are in the boombox' main directory to run run.py, using the following options:

Boombox: a tool for generating noisy data for HTR training.Use config.py to set the parameters of the noise to be applied.

options:
  -h, --help            show this help message and exit
  -i PATH, --path PATH  path to the folder containing the files to be processed
  -t TYPE, --type TYPE  type of file to be processed: text or alto
  -o PATH_OUT, --path_out PATH_OUT
                        path to the folder where the noisy files will be saved
  --cer CER             character error rate to aim for
  -wr WORD_REPLACE, --word_replace WORD_REPLACE
                        word replacement probability
  --auto_dirname        if activated, automatically name the output directory after the noise level reached

You can use --cer and -wr to replace the values set in config.py when calling the script.

The most basic command to let the boombox play would be:

(venv)~/boombox$ python run.py -i path/to/source/files/ -t txt

(venv)~/boombox$ python run.py  -i path/to/source/files/ -t alto

The other options are... optional. 🥁

Targeted noise level vs. actual noise level

If you set a targeted character error probability to 10% and a word replacement probability to 0%, you will likely not end up with a text containing exactly 10% of noise. This is because of several parameters, the first being that it is just a probability that we are setting. However, this also caused by the fact that, as mentioned above, a typo scenario can occur on a previously modified part of the text. If you swap the same letters twice, you actually cancel the modification. Finally, the word replacement scenario can erase a word that didn't contain any typo, potentially adding a lot of noise in the data at once.

Luckily, the boombox applies a Character Error measure at the end of the transformation process in order to assess the actual noise level added in the text. If you use the --auto_dirname option, the new noisy dataset will be added in a folder named after the actually reached noise level.

You should consider the fact that our adaptation of the typo library also relies on the addition of an AZERTY configuration for neighboring characters, on top of adapting the visually similar character to the case of handwritten characters. ↩

alix-tz/boombox