FreeLanguageTools/vocabsieve

KOReader Import not displaying PDFs, even after correcting metadata.

Closed this issue · 7 comments

Describe the bug
The VocabSieve reader wiki (https://wiki.freelanguagetools.org/vocabsieve_reader_import) suggests correcting metadata to see unseen books. However, KOReader Vocab Builder Import not displaying PDFs even after trying correcting metadata to target language.

To Reproduce
Steps to reproduce the behavior:

  1. Put relevant KOReader/book files your PC.
  2. Go to the pdf's .sdr folder, open the metadata.
  3. Update language fields to target language ('es' in my case)
  4. Use VocabSieve to import KOReader data.
  5. Don't see the book.

Expected behavior
See the book on KOReader import.

Desktop (please complete the following information):

  • OS: Windows 10
  • Vocabsieve version (if nightly, must be latest): 0.10.1

Additional context
I could probably use SQL to change the source book in the the vocabulary_builder SQL file to an epub. file with the correct metadata, if there is no solution to this issue.

Also, I asked about changing PDF metadata the KOReader GitHub, and if I am understanding correctly, they said that language info is absent on KOReader PDF documents: koreader/koreader#10729

Is there any way at all to import Vocab Builder words form PDFs?

PDFs just aren't considered when scanning the metadata precisely because of the issues listed (no language). Does vocab builder even include the context successfully? If so, I can probably consider allowing it.

PDFs just aren't considered when scanning the metadata precisely because of the issues listed (no language). Does vocab builder even include the context successfully? If so, I can probably consider allowing it.

It saves the context, but has some issues with keeping spaces around the target word. I was able to successfully able to get words in VocabSieve by going into the vocabulary_builder.sqlite3 file and changing the source to a book with metadata.

If you want to reproduce it, I did the following.

  1. Open a pdf and add words to vocab builder in koreader.
  2. Download the koreader setting folder to your pc.
  3. Open vocabulary_builder.sqlite3 with 'DB Browser for SQLite' or a similar app.
  4. Change the title_id value in the vocabulary table from the pdf to a book with metadata. The list of books can be found in the title table. (Do this with SQL)
  5. Add extra spaces to the end of prev_content and start of next_content values. (Also with SQL)
  6. Save changes and import words to VocabSieve

oops I don't use GitHub often, didn't mean to do that

KOReader now (since the recent nightly builds) allows editing book metadata.
Custom metadata fields (including language) are saved not to the book itself but to a file in the book sdr folder.
We can provide you with the storage location/format details to be implemented into the Import module, if you wish.

Sure, it would be great to document some details if you have them.
Does vocab builder actually work properly though? This is the main issue I have. PDF is not a ebook format and the text layer often has unexpected issues, like weird spacing, word breakage, etc. Is vocab builder able to handle that?

KOReader saves (1) book settings, highlights, notes; (2) custom metadata; (3) custom cover image in the sdr folder.
The name of the sdr folder is a book filename without extension, with the sdr extension added.
Depending on the settings, sdr folder can be located in the book folder or in the koreader/docsettings folder.
Custom metadata file name is custom_metadata.lua.
For example, for a book /foo/bar/book.pdf custom metadata file can be:
(1) /foo/bar/book.sdr/custom_metadata.lua
(2) koreader/docsettings/foo/bar/book.sdr/custom_metadata.lua

The content of custom_metadata.lua is:

-- we can read Lua syntax here!
return {
    ["custom_props"] = {
        ["authors"] = "Dariusz Terefenko",
        ["language"] = "en",
        ["series"] = "Jazz",
        ["series_index"] = 3,
    },
    ["doc_props"] = {
        ["authors"] = "Terefenko, Dariusz;",
        ["language"] = "En",
        ["pages"] = 824,
        ["title"] = "Jazz Theory",
    },
}

custom_props fields contain the metadata information edited by the user.

PDF is not a ebook format and the text layer often has unexpected issues, like weird spacing, word breakage, etc. Is vocab builder able to handle that?

In my experience, the main issue that the vocab builder has with pdf is not saving the spaces or punctuation around a vocab word. For example, an exported sentence may look like this. The bold word is the target vocab.

"He hooked a large fish on thelineIt was like nothing the crew had seen before."

To fix this, I append extra spaces to the end of all prev_content values and start of next_content values using SQL before importing pdf vocab. Then, when reviewing cards in Anki, if I think a card is missing punctuation I'll just edit it.

Occasionally, there are random page numbers in the middle of words, or other minor text issues, but the punctuation and space issues are the only ones that have happened consistently for me.