bjodah/chempy

Update README for improved chemical formula parser

bertiewooster opened this issue · 8 comments

I am working on updating README.rst for the improved formula parsing #205. A few questions regarding the updated parsing which no longer accepts malformed chemical formulas such as "Ch4"--ChemPy will now raise a ParseError, rather than simply stopping at the last valid element (that formula was previously parsed to C aka carbon):

  • Should we call this a breaking change? I'm thinking not, because it doesn't break any valid chemical formulas (that I'm aware of). Maybe just call it an improved parser...
  • Should we note the version as of which this change was made? If so, what do we plan to name the next version?

Also, is it all right if I append to .gitignore

.vscode/

so that Visual Studio Code configuration files will be ignored?

I'm assuming there is no doctest for README.rst? So I'm manually testing docstrings by running them in a temporary doc_testing.py file, having forked from bjodah/chempy after @jeremyagray merged the parsing improvements.

When I try to run an example in chempy/util/tests/test_parsing.py

from chempy import Substance
Substance.from_formula("Ca2.832Fe0.6285Mg5.395(CO3)6").composition

I get a KeyError:

Exception has occurred: KeyError
'.'
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 650, in <genexpr>
    lambda x: "".join(_unicode_sub[str(_)] for _ in x),
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 650, in <lambda>
    lambda x: "".join(_unicode_sub[str(_)] for _ in x),
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 545, in <lambda>
    string += re.sub(r"([0-9]+\.[0-9]+|[0-9]+)", lambda m: sub(m.group(1)), stoich)
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 545, in _formula_to_format
    string += re.sub(r"([0-9]+\.[0-9]+|[0-9]+)", lambda m: sub(m.group(1)), stoich)
  File "[/Users/jemonat/Projects/chempy/chempy/util/parsing.py]()", line 649, in formula_to_unicode
    return _formula_to_format(
  File "[/Users/jemonat/Projects/chempy/chempy/chemistry.py]()", line 190, in from_formula
    unicode_name=formula_to_unicode(formula),
  File "[/Users/jemonat/Projects/chempy/doc_testing.py]()", line 3, in <module>
    Substance.from_formula("Ca2.832Fe0.6285Mg5.395(CO3)6").composition

That test got left out of the unicode and HTML sections but not the composition and latex sections. The good news is that it works for HTML and it should be similarly fixable for unicode.

The bad news is that it appears impossible to represent in unicode as they have apparently neglected to include subscript and superscript punctuation like . and , despite the necessity for representing subscript and superscript decimals. It's not like two more characters would break unicode. The other solutions I've seen suggested for this problem are to hijack some other diacritical symbol that's good enough or use a space but I dislike both of those because they're hacky, not standard, and possibly difficult to read.

I suppose a solution could be to fail early with a "unicode is broken" exception or use a regular decimal point until a better course of action presents itself. Suggestions welcome; I'll push something to patch it.

EDIT: I have a working "use a regular decimal point" fix; any better solution can be quickly dropped in its place.

@bertiewooster sure, just add .vscode to the .gitignore file, no worries there!

I'm fine with hijacking unicode characters which looks "approximately like a subscript point".

Thanks for the guidance, @bjodah. Will do; I'll add .vs to the .gitignore file too, just in case any contributor is using Visual Studio Professional.

Hopefully @jeremyagray also has the guidance needed to help push this across the finish line for a release!

Hi, just following up to check if @jeremyagray can address the last remaining coding issue (that I'm aware of) before a release by him or @bjodah, hijacking unicode characters which looks "approximately like a subscript point". I can then incorporate that code change into my forked branch and finish updating the README. Thanks!

Checking in on this - my code is currently broken because apparently v0.8.3 still cannot accept non-integer stoichiometry?

You state: "The good news is that it works for HTML and it should be similarly fixable for unicode"

How can I force HTML or Latex so that I don't get the key error with unicode?

I think this branch addresses your problem, with the above a caveats about lack of a proper Unicode symbol. There's some more related discussion in #223.

Thanks @jeremyagray. I ended reverting to a much older version of chempy to solve my immediate problems. Looking at the notes in #223, I kind of like the "" identifier for crystal water, but ".." is also fine IMO. It's easy enough to replace the ".." or "" symbols for plotting and reporting.