inukshuk/latex-decode

Issues with `~x` and `\,`

Closed this issue · 1 comments

I've run into some problems with the "whitespace" test scenario:

Scenarios: Whitespace
| latex | unicode | description |
| x\\,x | xx | small space |
| x~x | x x | non-breaking space |
| ~x |  x | non-breaking space |

Most significantly, the last case is failing for me with this message:

.................................................................................................................................................................................F....

(::) failed steps (::)


expected: "x"
     got: " x"

(compared using ==)
 (RSpec::Expectations::ExpectationNotMetError)
./features/step_definitions/latex.rb:6:in `/^the result should be ('|")(.*)$/'
features/symbols.feature:20:7:in `the result should be 'x''

Failing Scenarios:
cucumber features/symbols.feature:20 # Scenario Outline: LaTeX to Unicode transformation

89 scenarios (1 failed, 88 passed)
182 steps (1 failed, 181 passed)
0m0.037s

It seems like Cucumber/Gherkin is dropping the leading U+00A0 No-Break Space from the expected output. I'm using Cucumber 4.1.0, which could potentially have some sort of change, but I've never worked with Cucumber before. I've put the full build log at r97lyi9l6b0y974dxv16z6c02a0c9b-ruby-latex-decode-0.3.2-1.08cc2d4.drv.gz; use something like less -R to read it properly.

I can get the tests passing for the current behavior be replacing the scenario with:

  Scenario: Whitespace (small space)
    When I decode the string 'x\,x'
    Then the result should be 'x x'

  Scenario: Whitespace (non-breaking space)
    When I decode the string 'x~x'
    Then the result should be 'x x'

  Scenario: Whitespace (leading non-breaking space)
    When I decode the string '~x'
    Then the result should be ' x'

However, while making that change, I discovered another issue: \, is currently being converted to U+2009 Thin Space, but in LaTeX \, produces a non-breaking space, so I think it ought to be converted instead to U+202F Narrow No-Break Space. (For example, https://en.wikipedia.org/wiki/Non-breaking_space#Encodings gives \, as the TeX encoding of U+202F.) (Pedantically, since \, is a kern, I think it would be even better to have a non-breaking variant of U+2006 Six-Per-Em Space, but I don't know a way to achieve that.)

More broadly, I personally found it quite confusing to have these Unicode spaces appear literally in the examples. While the details vary, I found them hard to identify in editors and terminal emulators: at best I got some sore of highlighting to hint that an unusual space was present, but often they were visually indistinguishable. I haven't figured out enough about Cucumber to avoid that without more disruptive changes, though.

Hi! It's very likely that I just got the \, conversion wrong. U+202F sounds like a better choice. I'm a few years removed from last hacking on this and your assessment seems much better than mine.

Also agree that putting the white-space characters literally in the examples makes it hard for debugging (especially if Cucumber starts trimming away white space!). Adding extra characters around them makes sense as a quick improvement. To make the tests even easier to work with, we'd have to add dedicated step definitions for white space characters. For example, something like:

   When I decode the string '\,'
   Then the result should be a non-breaking space

Alternatively, we could add something more generic:

   When I decode the string '\,'
   Then the result should be U+202F

And we'd write the step definition using a regular expression for the actual code, so you could test for any Unicode character explicitly.