FeatureRequest/HelpNeeded: highlight is not an exact subset of the text content
thiswillbeyourgithub opened this issue · 4 comments
Hi,
I'm the dev behind LogseqMarkdownParser and am working on a small script to directly turn highlights into anki flashcards.
It's not yet working because I'm running into an issue with text formats.
You see, I don't just want the highlight to be sent to anki, I want to grab the 1000 ish characters before and after the highlight, make a cloze card (= putting a hole in the text and you have to guess the content) with the highlight then sending that to anki.
The main issue I have is that for example I have this highlight:
For example, suppose ΔW is the weight update for a weight matrix W∈RA×B.
And the relevant section of text is this:
For example, suppose \\(\\Delta W\\) is the weight update for a weight ' 'matrix \\(W \\in \\mathbb{R}^{A \\times B}\\).
I'm guessing this is mathjax.
I can't seem to find a good python lib to parse mathjax into text, or text into mathjax, let alone reliably.
So is it possible to:
- Either add
{{{rawText}}}
for the highlight, that would not be parsed (so would still contain the mathjax) - Or parse the content of the article just like the highlight (currently only the highlight is parsed to text)
- Also, it seems the position highlight is broken because they are all equal to 0 on my end. Is this normal?
Thanks!
Hi ! Just a quick bump as I would really like to wrap up my project while I got some free time :) But if you can't find the time to take a look it totally fine of course!
Hi i think what you are seeing in the highlight text is raw text or at least markdown. Can you post a screenshot of the highlight itself?
Here's the highlighted section of the text:
The article link is that one: https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html
Hi,
I decided to go the "most robust way" anyway and implement a function that finds the best substring in a corpus that matches the highlight. This is computationaly intensive and probably will be an issue for very long texts but at least I can move on towards finishing this.
When I finish this project, if I think it's worth it I'll come back to you to see if that's worth a mention in a blog post or whatever :)
In the meantime, although I still think my request is legit and someone might have a real need for more precise filter access in the API, I'll let you decide if you want to close this or not :)
Have a nice day!