Prevent Skipping?
Ryandonofrio3 opened this issue · 6 comments
Hello. I am looking to simply process all pages of my PDF but I find it skipping about 50% of all pages due to repetition. But I can manually confirm they are not repeats. For instance my just 6 page PDF of an academic text only got the methods section. Is there a way to disable this and "force" the entire output?
(.venv) PS C:\Users\--\Desktop\Nougat> nougat .\t3.pdf -o .\output\ c:\Users\---\Desktop\Nougat\.venv\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3484.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] 0%| | 0/2 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0 INFO:root:Processing file t3.pdf with 6 pages WARNING:root:Skipping page 1 due to repetitions. 50%|███████████████████████████████████████████████████████▌ | 1/2 [00:21<00:21, 21.34s/it]WARNING:root:Found repetitions in sample 1 WARNING:root:Skipping page 5 due to repetitions. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.44s/it] (.venv) PS C:\Users\---\Desktop\Nougat>
Ok makes sense. I'll add support next week
That would be great, I am running into the same issue as well
Ok makes sense. I'll add support next week
That would be very useful!!
Done in 8ad92cc
Will update pypi shortly
Done in 8ad92cc Will update pypi shortly
@lukas-blecher How do you set this behaviour? I'm on the latest commit. That commit is just a moved line AFAICT.
Ok the commit is weird. Add --no-skipping
when calling nougat