VikParuchuri/marker

Is there any way to enhance the bibliography?

flight505 opened this issue · 0 comments

Hello,

I've encountered problems with the way bibliographies are handled when created. The issues are as follows:

  • The numbering is incorrect.
  • The bibliographies themselves are not parsed correctly or individually.
  • FYI running on MacBook pro m3 max (takes less than a minute, boot time is long - see [frankbaele]#259 as possible solution.

I believe improvements in this area could significantly enhance the utility of the conversion process. Below is an example code snippet from the PDFConverter class, which demonstrates how PDF files are converted to markdown format using either marker or LamaParse to evaluate which is the better converter.

class PDFConverter:
    """
    A class for converting PDF files to Markdown format using either Marker or LlamaParse methods.

    Attributes:
        input_folder (str): Path to the folder containing input PDF files.
        output_folder (str): Path to the folder where converted Markdown files will be saved.
        method (str): Conversion method, either "marker" or "llamaparse".
        parser (LlamaParse): LlamaParse object for PDF parsing (only if method is "llamaparse").
    """

    def __init__(
        self,
        input_folder,
        output_folder,
        method="marker",
    ):
        """
        Initialize the PDFConverter object.

        Args:
            input_folder (str): Path to the folder containing input PDF files.
            output_folder (str): Path to the folder where converted Markdown files will be saved.
            method (str, optional): Conversion method, either "marker" or "llamaparse". Defaults to "marker".
        """
        self.input_folder = input_folder
        self.output_folder = output_folder
        self.method = method
        if method == "llamaparse":
            self.parser = LlamaParse(
                result_type="markdown",
                language="en",
                parsing_instruction="You are parsing publications, extract the complete content of the document including tables and the bibliography, skip images. Use markdown formatting for the output.",
            )

    def ensure_directory_exists(self, directory):
        """
        Create the directory if it doesn't exist.

        Args:
            directory (str): Path to the directory to be created.
        """
        if not os.path.exists(directory):
            os.makedirs(directory)

    def convert(self):
        """
        Convert PDF files to Markdown format.

        This method handles the entire conversion process, including:
        - Ensuring input and output directories exist
        - Counting files to be converted
        - Converting files
        - Displaying progress using Streamlit
        """
        self.ensure_directory_exists(self.input_folder)
        self.ensure_directory_exists(self.output_folder)

        pdf_files = [f for f in os.listdir(self.input_folder) if f.endswith(".pdf")]
        total_files = len(pdf_files)
        already_converted = self.count_converted_files(pdf_files)
        to_convert = total_files - already_converted

        with st.spinner(f"Converting PDF files using {self.method.capitalize()}..."):
            msg = st.toast("Checking for PDF files to convert...", icon="🔎")
            time.sleep(0.5)
            if not pdf_files:
                msg.toast("No PDF files found in the input folder.", icon="👎")
                st.stop()
            if to_convert == 0:
                msg.toast("All PDF files are already converted.", icon="👍")
            else:
                msg.toast(
                    f"Total: {total_files}, Already Converted: {already_converted}, To Convert: {to_convert}"
                )

            converted_files_count = 0
            for i, pdf_file in enumerate(pdf_files):
                if self.convert_pdf_to_md(pdf_file):
                    converted_files_count += 1
                if not to_convert == 0:
                    time.sleep(0.5)
                    st.toast(f"Converting {i+1}/{total_files} files...", icon="☕")

            time.sleep(0.5)
            st.toast(
                f"Conversion completed! {converted_files_count} files converted.",
                icon="👍",
            )

    def count_converted_files(self, pdf_files):
        converted_count = 0
        for pdf_file in pdf_files:
            md_folder = os.path.join(self.output_folder, os.path.splitext(pdf_file)[0])
            md_file = os.path.splitext(pdf_file)[0] + ".md"
            md_path = os.path.join(md_folder, md_file)

            if self.method == "marker":
                meta_path = os.path.join(
                    md_folder, os.path.splitext(pdf_file)[0] + "_meta.json"
                )
                if os.path.exists(md_path) and os.path.exists(meta_path):
                    converted_count += 1
            else:  # LlamaParse method
                if os.path.exists(md_path):
                    converted_count += 1

        return converted_count

    def convert_pdf_to_md(self, pdf_file):
        pdf_path = os.path.join(self.input_folder, pdf_file)
        md_folder = os.path.join(self.output_folder, os.path.splitext(pdf_file)[0])
        md_file = os.path.splitext(pdf_file)[0] + ".md"
        md_path = os.path.join(md_folder, md_file)

        if self.method == "marker":
            meta_path = os.path.join(
                md_folder, os.path.splitext(pdf_file)[0] + "_meta.json"
            )

            if os.path.exists(md_path) and os.path.exists(meta_path):
                return False

            command = f"marker_single '{pdf_path}' '{self.output_folder}' --batch_multiplier 4"
            os.environ["OCR_ALL_PAGES"] = "True"
            os.environ["EXTRACT_IMAGES"] = "False"
            os.environ["DEFAULT_LANG"] = "English"
            os.system(command)

            return os.path.exists(md_path) and os.path.exists(meta_path)
        else:  # LlamaParse method
            if os.path.exists(md_path):
                return False

            os.makedirs(md_folder, exist_ok=True)  # Create the subfolder

            file_extractor = {".pdf": self.parser}
            documents = SimpleDirectoryReader(
                input_files=[pdf_path], file_extractor=file_extractor
            ).load_data()

            with open(md_path, "w", encoding="utf-8") as f:
                f.write(documents[0].text)

            return True

Attempts to Fix or Workarounds

  • [ x ] Checked the documentation for any known issues related to bibliography parsing.
  • [ x ] Experimented with different parser settings.

I would greatly appreciate any suggestions or insights on how to address these issues.

Example of Marker bibliography

Adibuzzaman M, DeLaurentis P, Hill J, Benneyworth BD (2018) Big data in healthcare—the promises, challenges and opportunities from a research perspective: a case study with a model database.

AMIA Annu Symp Proc 2017:384–392 Agbo CC, Mahmoud QH, Eklund JM (2019) Blockchain technology in healthcare: a systematic review. Healthcare 7:56 Aguet F, Brown AA, Castel SE, Davis JR, He Y, Jo B et al. (2017)
Genetic effects on gene expression across human tissues. Nature 550:204–213 Akbarian S, Liu C, Knowles JA, Vaccarino FM, Farnham PJ, Crawford GE et al. (2015) The PsychENCODE project. Nat Neurosci 18:1707–1712 Allen N, Sudlow C, Downey P, Peakman T, Danesh J, Elliott P et al.

(2012) UK Biobank: current status and what it means for epidemiology. Health Policy Technol 1:123–126 Assis-Hassid S, Grosz BJ, Zimlichman E, Rozenblum R, Bates DW
(2019) Assessing EHR use during hospital morning rounds: a multi-faceted study. PLoS ONE 14:e0212816 Bang CS, Baik GH (2019) Using big data to see the forest and the trees: endoscopic submucosal dissection of early gastric cancer in Korea. Korean J Intern Med 34:772–774 Bender D, Sartipi K (2013) HL7 FHIR: an agile and RESTful approach to healthcare information exchange. In Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems, IEEE. pp 326–331 Bibault J-E, Giraud P, Burgun A (2016) Big Data and machine learning in radiation oncology: state of the art and future prospects. Cancer Lett 382:110-117 Blobel B (2018) Interoperable EHR systems—challenges, standards and solutions. Eur J Biomed Inf 14:10–19 Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ (2018)
Next-generation machine learning for biological networks. Cell 173:1581–1592.

Campbell PJ, Getz G, Stuart JM, Korbel JO, Stein LD (2020) Pancancer analysis of whole genomes. Nature https://www.nature. com/articles/s41586-020-1969-6 Chambers DA, Amir E, Saleh RR, Rodin D, Keating NL, Osterman TJ, Chen JL (2019) The impact of Big Data research on practice, policy, and cancer care. Am Soc Clin Oncol Educ Book Am Soc Clin Oncol Annu Meet 39:e167–e175 Char DS, Shah NH, Magnus D (2018) Implementing machine learning in health care—addressing ethical challenges. N Engl J Med 378:981-983 Cho WC (2015) Big Data for cancer research. Clin Med Insights Oncol 9:135–136 Cnudde P, Rolfson O, Nemes S, Kärrholm J, Rehnberg C, Rogmark C,
Timperley J, Garellick G (2016) Linking Swedish health data registers to establish a research database and a shared decision-making tool in hip replacement. BMC Musculoskelet Disord 17:414 Cohn EG, Hamilton N, Larson EL, Williams JK (2017) Self-reported race and ethnicity of US biobank participants compared to the US
Census. J Community Genet 8:229–238 Connelly R, Playford CJ, Gayle V, Dibben C (2016) The role of administrative data in the big data revolution in social science research. Soc Sci Res 59:1–12.

Llamaparser (which is worse and skips most text)

Bibliography

  • Shendure and Ji (2008)
  • Topol (2019a)
  • Stephens et al. (2015)
  • Wetterstrand (2019)
  • Hasin et al. (2017)
  • Madhavan et al. (2018)
  • Adibuzzaman et al. (2018)
  • Krumholz (2014)
  • Fessele (2018)

Thank you.