Fixing the parsing of scientific notation and non-ASCII characters
Opened this issue · 2 comments
Because PDF encodings are a mess, parsing often leads to poor results with non-ASCII text of all kinds. For example, scientific notation in the form 10^3 is typically parsed as "103" by PyMuPDF and as "10 3" by grobid, which is really unfortunate when one tries to extract quantitative information (though the latter is fixable by post-processing). Similarly, non-ASCII characters are a bit of an RNG. For example, in one paper I was looking at IFN-β was parsed as IFN-\x02 by PyMuPDF and IFN- by grobid.
A few questions on this:
- I think grobid results are generally more fixable than PyMuPDF results. The newest paper for this repo uses grobid for some of its results - are you planning to add your own implementation of grobid parsing to this repo?
- As far as I can tell, there's currently no post-processing for the PyMuPDF parsing. In the recent paper, I there's also no mention of post-processing for grobid results. Do you currently plan to implement post-processing to fix some of this?
Some of these things would be relatively easy to fix. For example, when using grobid, scientific number notation could be rescued with something likes this:
def rescue_scientific_notation(text):
# replace space between 10 and the exponent with ^
text = re.sub(r'([*x]?)(10) (\d*)', r'\1\2^\3', text)
return text
Similarly, weird non-Ascii encodings could probably be rescued by asking ChatGPT to guess the correct character (this is a really crude example implementation):
from openai import OpenAI
import asyncio
def get_client():
api_key = "an-api-key"
client = OpenAI(api_key=api_key)
return client
def extract_gpt_response(response):
response = response.choices[0].message.parsed
return response
async def fetch_structured_response(query, model, response_format, system_prompt, client):
response = client.beta.chat.completions.parse(
model=model,
messages=[
{"role": "system", "content": system_prompt},
query
],
response_format=response_format
)
return response
async def get_structured_responses(queries, model, response_format, system_prompt, client):
# Run all requests concurrently
responses = await asyncio.gather(*(fetch_structured_response(query, model, response_format, system_prompt, client) for query in queries))
responses = [extract_gpt_response(r) for r in responses]
return responses
def return_queries(contents):
return [{"role": "user", "content": content} for content in contents]
class TranslationDictionary(BaseModel):
original_char: list[str]
new_char: list[str]
system_prompt_conversion ="""
You are looking at a text from a publication that contains potentially falsely encoded non-ASCII characters.
You are given examples in the format character, ord(char), hex(ord(char)), and the 20 characters before and after the character.
Your task is to create a python dictionary that maps each non-ASCII character to its correct ASCII equivalent.
Note that you can also match to the original character. Only do so if it genuinely makes sense in context.
If you cannot find a match, replace the character with U+2205.
"""
async def preclean_parsed_text(text):
ascii_chars = [chr(i) for i in range(128)]
greek_chars = [chr(i) for i in range(945, 970)]
other_known_chars = ["©", "±", "°", "′", "Δ", "ï", "ö", "ü", "ä", "\n", "\r", "\t"]
known_chars = ascii_chars + greek_chars + other_known_chars
# fix some known replacements
known_replacements = {"×": "x", "ï": "i", " ": " ", "u ¨": "ü", "a ¨": "ä", "o ¨": "ö"}
text = replace_text(text, known_replacements)
# rescue scientific notation (only works for grobid parsing)
text = rescue_scientific_notation(text)
# ------------------------------------------------------------
# try to clean up unusual characters
unusual_chars = []
for char_index, char in enumerate(text):
if char not in known_chars:
unusual_chars.append((char, ord(char), hex(ord(char)), text[char_index-20:char_index+20]))
# order by the ord() value
unusual_chars.sort(key=lambda x: x[1])
# keep at most 10 examples per char
unusual_chars_filtered = []
curr_char = ""
i = 0
for char, e1, e2, e3 in unusual_chars:
if char != curr_char:
curr_char = char
i = 0
if i < 10:
unusual_chars_filtered.append((char, e1, e2, e3))
i += 1
queries = return_queries([str(unusual_chars_filtered)])
responses = await get_structured_responses(
queries, model="gpt-4o",
response_format=TranslationDictionary,
system_prompt=system_prompt_conversion, client=openai_client)
response = responses[0]
# create the replacement dictionary
original_chars = response.original_char
new_chars = response.new_char
replacement_dict = {o: n for o, n in zip(original_chars, new_chars)}
# replace the text
text = replace_text(text, replacement_dict)
return text
This correctly fixes most of the worst offenders and the interferon names:
replacement_dict =
{'ϩ': '+', 'Ϫ': '-', 'ϫ': 'x', 'Ϯ': '+', 'Ͻ': '<', 'Ͼ': '>', 'Ј': "'", '؉': '+', '؊': '-', 'ء': 'U+2205', '\n': '\n', '†': '\n', '‡': '\n', '␣': 'α', '': 'β', '␥': 'γ'}
Are there plans to implement parsing post-processing?
@loesinghaus Hello! I'm here to help you with any issues or questions you have. Let's work together to solve the problem.
Currently, there is no specific mention of post-processing for handling scientific notation or non-ASCII characters in the paper-qa
codebase. The existing implementation uses libraries like PyMuPDF and grobid for parsing, but there is no documented post-processing logic for correcting issues like those you've described [1].
Your suggestion to implement post-processing, such as using regular expressions to correct scientific notation or leveraging AI models to handle non-ASCII characters, could be a valuable addition. If you are considering contributing these improvements, integrating them into the existing parsing functions or creating a dedicated post-processing module would be a logical approach.
Hi @loesinghaus yes we can add a post-parsing callback, it sounds like a good idea. It will need to be async
so we can call LLMs in it
We have started a callback section here: https://github.com/Future-House/paper-qa/blob/v5.7.0/paperqa/settings.py#L509-L536
Feel free to open a PR for this