sparkfish/shabby-pages

Add missing script that generates the clean / dirty split

jboarman opened this issue · 5 comments

I may have just missed it, but I don't see the script that generates the pages from the PDFs. We chose 150 DPI output, for example. If we wanted to regenerate the dataset at a different resolution or with new PDF sources, we would need this script.

The way I did it requires poppler-utils: pdftoppm document.pdf some_name -r preferred_resolution -png

Generating the clean/dirty split is another matter. I believe @kwcckw has the code for this, maybe in a notebook.

That's awesome that you tracked that issue right here in GH! 👍

To generate clean/dirty split, we require a dataset with clean & dirty images. so do we have the dataset here? Or it will be a general script to do so?

This should be a general script since shabby is more about creating a repeatable recipe than a specific dataset.

I added the code in this pull request: #73 and this should be resolved now.