Add missing script that generates the clean / dirty split

Question

Add missing script that generates the clean / dirty split

jboarman opened this issue 2 years ago · 5 comments

I may have just missed it, but I don't see the script that generates the pages from the PDFs. We chose 150 DPI output, for example. If we wanted to regenerate the dataset at a different resolution or with new PDF sources, we would need this script.

Answer 1 · 2023-05-13T20:31:38.000Z

The way I did it requires poppler-utils: pdftoppm document.pdf some_name -r preferred_resolution -png

Generating the clean/dirty split is another matter. I believe @kwcckw has the code for this, maybe in a notebook.

Answer 2 · 2023-05-13T20:57:26.000Z

That's awesome that you tracked that issue right here in GH! 👍

Answer 3 · 2023-05-14T01:30:20.000Z

To generate clean/dirty split, we require a dataset with clean & dirty images. so do we have the dataset here? Or it will be a general script to do so?

Answer 4 · 2023-05-14T23:40:19.000Z

This should be a general script since shabby is more about creating a repeatable recipe than a specific dataset.

Answer 5 · 2023-05-17T12:12:26.000Z

I added the code in this pull request: #73 and this should be resolved now.