Feature request: URL extraction
ShakirAkbari opened this issue · 1 comments
Requesting an additional feature when extracting information from PDFs.
Can you please add the ability to extract URLs from the document?
I wrote this code to pull link text and links out of pdfs. Maybe you can incorporate part of this into your code base with an option to enable extract_links_from_pdf in the settings:
from gc import get_objects def parse_pdf(filename): def extract_links_from_pdf(pdf_path):
def extract_text_with_positions(pdf_path, page_num): def associate_text_with_links(page_text, page_links): def merge_adjacent_links(links):
if name == 'main':
|