VikParuchuri/marker

Too much memory cost for big pdf 800 pages , cost 80GB ram.

whp98 opened this issue · 3 comments

sometimes it fail with cuda oom
My gpu is 4060ti 16G

pdf is this https://github.com/yuanliangding/books/blob/master/%E8%AE%A1%E7%AE%97%E6%9C%BA-%E7%BC%96%E7%A8%8B%E8%AF%AD%E8%A8%80-JAVA/Java%E5%B9%B6%E5%8F%91%E7%BC%96%E7%A8%8B%E5%AE%9E%E6%88%98.pdf

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 15.60 GiB of which 747.88 MiB is free. Including non-PyTorch memory, this process has 2.46 GiB memory in use. Of the allocated memory 2.14 GiB is allocated by PyTorch, and 166.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Error converting PDF to Markdown: Command '['marker_single', '/home/zzz/文档/PDF/Java并发编程实战.pdf', '/home/sss/
dsadas/pdf-to-markdown/output']' returned non-zero exit status 1.

that's not an absurd thing to have, many pdf servicse have page/file limits for this. You can solve this by slicing your pdfs with an other pdf lib and then joining them at the end.

The CPU ram issue should be fixed now