rufuspollock-okfn/bibserver

bibtex.py parser accents problems

vramiro opened this issue · 6 comments

While displaying the json genrated by the bibtex.py parser I got all the accents wrong (shifted in one position). For instance: `Eric instead of Èric (which correspond to change \u0301Eric to E\u0301ric)

I did a simple patch, but not sure it will work for all the cases.

In the string_subst(self, val) function change:

                    if key+1 < len(parts) and len(parts[key+1]) > 0:
                        parts[key+1] = parts[key+1][0:]

for

                    if key+1 < len(parts) and len(parts[key+1]) > 0:
                        ### Change order to display accents
                        parts[key] = parts[key] + parts[key+1][0]
                        parts[key+1] = parts[key+1][1:]

am I missing something with my solution?

Hi,

I also use this library in one of my projects and I agree with the bug. Nevertheless, I do not with the solution.

Your workaround is equivalent to replace in your bibtex e by e. And this is not correct.

I investigated a bit the issue, not that much. My understanding is the following:
The value can contain e or {e}. In both cases, when it enters to the section you mention, it turns to be `e. {} are stripped.
Then, from the dictionary (unicode to latex); it selects ' value and this leads to a wrong accent position.

I guess there is two problems at least:

  • {} are stripped for special character, and it should not. One need to find where is is replaced [1] and how we can fix that. This will fix the bug if the bibtex is encoded with braces.
  • Accents might have not braces around. IMHO, it should not be handled by ' value (for instance). I don't know if it's better to add {} around the next character or to fix it in another way.

[1] As far as I understand, it does not come from strip_braces calls in add_val. Probably before in parse_record(), in the block starting by the comment # for each line in record

OK. I think I got it. was easier than I thought.

There is my diff.
Basically, I think the for loop is useless.
The original issue is caused by a wrong indentation. It caused that after the first item in the dict, braces where removed from the string.

In my previous post, I was wrong. The first point I mentioned seems to be already supported.

diff --git a/parserscrapers_plugins/bibtex.py b/parserscrapers_plugins/bibtex.py
index cfea621..aa9d669 100755
--- a/parserscrapers_plugins/bibtex.py
+++ b/parserscrapers_plugins/bibtex.py
@@ -244,11 +244,8 @@ class BibTexParser(object):
             for k, v in self.unicode_to_latex.iteritems():
                 if v in val:
                     parts = val.split(str(v))
-                    for key,val in enumerate(parts):
-                        if key+1 < len(parts) and len(parts[key+1]) > 0:
-                            parts[key+1] = parts[key+1][0:]
                     val = k.join(parts)
-                val = val.replace("{","").replace("}","")
+            val = val.replace("{","").replace("}","")
         return val

     def add_val(self, val):

Let me know if everything is fine on your side.

No, it does not work for me. I keep having `Eric instead of Èric
I'm also getting & instead of &

Not sure if I made this clear, but for me the problem is with the browser display of the unicode json produced.

I see. Try this {E} instead of E. I thought the point number 1 is supported (from some of my tests), but maybe not.

Thanks for the answer again!

I think I did not made myself clear, so here we go with all the case:

  1. In my bibtex I have normalized entries (with BibtexTool[1]). All my accents follow the form as in {'E}ric
  2. The json produced by the parser translates this to \u0301Eric which is visualized in Chrome/Safari/Firefox as `Eric (instead of Èric)
  3. To actually get Èric visualized what I did was to change \u0301Eric to E\u0301ric (so, not sure it's a parser issue or visualization issue)

In latex {'E}ric and '{E}ric gives the same output, I think the parser here does not.

[1] http://strategoxt.org/Stratego/BibtexTools

Thanks for the details.

I did a quick search on the internet about the best coding for accent. I
found something interesting there:
http://tex.stackexchange.com/questions/57743/how-to-write-a-and-other-umlauts-and-accented-letters-in-bibliography/57745#57745
It tells us that {'E}ric is better than '{E}ric.
Of course, the parser should handle all cases, and I agree, it does not.

I do not belong to this project, so this is only my own opinion.
In addition to the previous patch, I would add a new dict similar to unicode_to_latex, supposed to contain translations for accents like '{E}.
Then, update unicode_to_latex with correct accents (like {'E}).
String_subst should iterate over the first dict and then the second one.

Regarding only bibtex.py, a uniq dict would be enough because we iterate over values, not keys. But, since it's a library, it can be used by elsewhere in the other way (This is the case in my project for instance). dict does not ensure the order.

What do you think about this suggestion?