Strings starting with + are parsed as a list

Question

Strings starting with + are parsed as a list

amorison opened this issue 5 years ago · 24 comments

I have a file formatted this way

&iostuff
output_folder = +out
/

When parsed, ['iostuff']['output_folder'] equals ['+', 'out']. I guess this is done so that if the variable out is defined somewhere else, this list is replaced by the corresponding numerical value. But if it isn't defined, the string '+out' should be returned instead. Maybe a simple solution would be having a way to tell f90nml that some variables are strings and should be considered as such.

Answer 1 · 2019-10-27T11:48:51.000Z

I think it's just a straight-up bug. The tokenizer, which was originally written for Fortran parsing, is splitting +out into + and out, but it may well be that + has no special significance in namelists.

Either one would need to double back and combine these tokens, or just treat + as equivalent to an alphanumeric character.

You can always replace +out with '+out', but I'm guessing that you're dealing with a legacy namelist and may have too many instances of this to replace.

Answer 2 · 2019-10-27T11:52:31.000Z

I'm guessing that you're dealing with a legacy namelist and may have too many instances of this to replace.

Yes... It's actually an automatically generated namelist, and for some reason various fortran compilers do not print namelists in the same format...

If you have a raw idea of where I should look at in the code, I could try and write a PR.

Answer 3 · 2019-10-27T12:10:25.000Z

It looks like your originial guess was closer to the problem. We do need to parse + and - as tokens, to parse numbers and +/-Inf or NaN, but we also need to add a fallback to reconstitute the value as a string if it's not a number.

It should not be hard to fix this. I will do it sometime later today, but otherwise you're free to look in tokenizer.py and have a go if you can make sense of it.

Answer 4 · 2019-10-27T19:09:34.000Z

This patch will fix your issue:

--- a/f90nml/tokenizer.py
+++ b/f90nml/tokenizer.py
@@ -72,6 +72,11 @@ class Tokenizer(object):
                     self.characters = lookahead
                     self.prior_char = ieee_val[-1]
                     self.char = next(lookahead, '\n')
+                elif ieee_val.isalpha():
+                    # TODO: Parse a non-delimited string
+                    while not self.char.isspace():
+                        word += self.char
+                        self.update_chars()
                 else:
                     word = self.parse_numeric()

but it's assuming that non-delimited strings are effectively delimited by spaces, which is probably not correct although it does pass all of my current tests.

I'll consult the language standard and test a few different compilers to see how to handle this, but the very least we can get this case working without breaking anything else.

Answer 5 · 2019-10-27T19:22:23.000Z

Thanks a lot! Just tested it on my own file, I confirm it does fix the issue in my case. Do you plan on making a new release with that fix? Just to know if it's worth making a raw fix in my own app to get it working in the meantime.

Answer 6 · 2019-10-27T19:28:01.000Z

I'll get this fixed one way or another, though it may take a day or two to figure out how its supposed to truly work. (I can dream up one or two cases which I think will break this change).

But it's certainly a bug, and I'll get something pushed soon which fixes it. If you can wait a few days, then there shouldn't be any need to work around it.

Answer 7 · 2019-10-27T19:31:02.000Z

Perfect. I can work with hand editing of files in the meantime.

Thanks for the quick reaction!

Answer 8 · 2019-10-27T20:10:11.000Z

I looked into this some more, and it seems like non-delimited namelists may be forbidden now. From Note 10.36 of the 2008 standard:

A character sequence corresponding to a namelist input item of character type shall be delimited either withapostrophes or with quotes. The delimiter is required to avoid ambiguity between undelimited charactersequences and object name.

I also tried the following test program in gfortran 9.2 and got an error:

program strnml
    implicit none
    character(len=5) :: s
    namelist /test/ s

    open(5, file="test.nml")
    read(5, nml=test)

    print *, s
end program strnml

Input namelist test.nml:

&test
    s = abc
/

Error:

At line 7 of file strnml.f90 (unit = 5, file = 'test.nml')
Fortran runtime error: End of file

I don't mind supporting this case, particularly if there's a compiler which interprets it, but we may want to raise a warning if it comes up. Non-delimited strings could be considered a namelist error, and might be worth resolving them on the user end.

Answer 9 · 2019-10-27T22:39:31.000Z

I just tested your example as well. Strangely, I don't have the same error as you with gfortran 9.2. I have

At line 7 of file prgm.f90 (unit = 5, file = 'test.nml')
Fortran runtime error: Cannot match namelist object name abc

However, with ifort 17 (which is the compiler with have on our clusters, dating from 2017) I don't have any error and get the expected abc printed out. This makes sense since the ill-formed namelists were produced with a program compiled with ifort. So at least its behavior is consistent...

IMHO, I don't think it is f90nml's role to ensure files comply to the 2008 standard, especially when a compiler as common as ifort itself doesn't implement it properly.

Answer 10 · 2019-10-28T18:37:54.000Z

I can confirm that the example works for Intel Fortran, which is reason enough to support the case.

It also seems that spaces are acting as delimiters here, since the following namelist doesn't work (after increasing the length of s):

&test
    s = +abc +def
/

forrtl: severe (18): too many values for NAMELIST variable, unit 5, file [$my_path]/test.nml, line 2, position 17

I'll implement the space-delimiter version and hope that it catches these cases. I may add a flag to enable strict enforcement, but if so then I'll keep it disabled on default.

Answer 11 · 2019-10-30T15:30:25.000Z

I've sound some additional case which are currently broken, such as the following:

&test x = +1.345 /
&test x = +.345 /
&test s = .abc.def. /

and most of these issues are coming from the same place, which is that we're assuming non-delimited strings, and then happen to get lucky in most cases which don't involve tokens associated with other types.

If there's no rush here, then I'd like to set aside some extra time to sort all this out before pushing a quick fix to your particular case.

Answer 12 · 2019-10-30T15:40:39.000Z

I haven't had a chance to look into the parser but wouldn't merely checking whether float(<str>.lower().replace('d', 'e')) raises a ValueError cover all these cases?

Answer 13 · 2019-10-30T15:49:37.000Z

I think the main issue is that the string would never be assembled for testing, since the tokenizer would split the value into two separate strings.

There are other issues, for example +123+002 is a valid float in the namelist, although case like this are already resolved in the parser.

The first two examples here are actually simple one-line fixes, I only added them because they're very similar to your case. But any case where we lead with something like +, -, . or anything that would normally indicate something other than a string, is where we start to have problems.

I don't think it's too difficult to resolve this, but the tokenizer will need a small bit of restructuring.

Answer 14 · 2019-10-30T15:52:35.000Z

Oh OK got it.

In any case, sure, it's better that you take the time to get things right rather than merging a quick fix that you know to be flawed. Thanks for keeping me updated!

Answer 15 · 2019-10-31T15:36:16.000Z

The latest update appears to have fixed this case, and I've also added an example to the test suite.

The simple explanation is that I have removed a lot of the complexity in the tokenizer, which was still using much of the hard rules from its original use as a Fortran tokenizer, and we now glob together any characters that might appear in values. This will still fail on more exotic characters, but I think this is an improvement, and those cases should be easier to fix.

At some point, I'd like to add a warning about non-delimited strings, but that can wait for another time.

If you're able, could you try it out?

Answer 16 · 2019-10-31T17:22:47.000Z

The latest update works on this case

&iostuff
    output_folder = +out
/

However, there is another case on which it doesn't work

&iostuff
    output_folder = out/sub
/

f90nml cuts everything after the /, giving ['iostuff']['output_folder'] == 'out'. I suspect this is a similar issue. I hadn't noticed this case before...

Answer 17 · 2019-10-31T17:25:46.000Z

Can you check if that works in any Fortran compiler? The / token is usually a very hard stop in most namelist parsers. If I test something similar in Intel 16.0.3, then it assigns out to the value.

Answer 18 · 2019-10-31T17:43:48.000Z

Indeed, you're right, my apologies.

There is hence a bug in ifort, because this code

program strnml
    implicit none
    character(len=6) :: s
    namelist /test/ s

    s = 'op/out'
    open(5, file="test.nml", status='new')
        write(5,test)
    close(5)

    open(5, file="test.nml")
        read(5,test)
    close(5)

    print *, s
end program strnml

outputs op instead of the expected op/out and produces an ill-formed file such as the one in my previous comment.

I guess this isn't f90nml's role to handle such a corner case, then.

EDIT for reference: this bug can be worked around by opening the file with

open(5, file="test.nml", status='new', delim='quote')

Answer 19 · 2019-10-31T20:38:46.000Z

I see what you mean, Intel is outputting a non-delimited string, which is probably where these kind of namelists came from in the first place.

I could, for example, add a flag to force tokens to end on spaces and permit slashes inside of such tokens, but there's pretty obvious issues with that. Something like s = 'op / out' is probably just flat out uninterpretable.

Personally I'd rather not accommodate Intel on this, and I hope that maybe this can be reported to them and they can change it on their end. But I'm not entirely opposed to supporting these cases if there's a safe way to do so. Intel should not do this. But they do, and we have to deal with it :).

What's your feeling on this?

Answer 20 · 2019-11-01T15:08:15.000Z

I think it wouldn't be a good idea to support a case that is broken anyway. Supporting non-delimited strings as much as Intel but not more seems like the reasonable thing to do to me, so that f90nml's result is consistent with what you would get if you were to feed the namelist to a Fortran program.

I opened a thread on the Intel Fortran Compiler forum (not visible yet because it has to be validated by a moderator). I'll keep you informed if there are news on this side.

Answer 21 · 2019-11-01T15:10:34.000Z

Alright, then I will close this and do an update soon. Thanks for letting me know about the problem, and I'll keep a look for the issue on the forums.

Answer 22 · 2019-11-01T15:30:44.000Z

Thanks for fixing the original issue quickly! It's much appreciated.

Here is the thread:
https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/831685

Answer 23 · 2019-11-04T15:20:53.000Z

Thanks for submitting that to the forums, I think it was very instructive.

I've come to two conclusions:

Intel is doing the correct thing, but the correct thing is somewhat contradictory, namely that the default namelist output is something that is almost, but not quite, a namelist.
Anything which deviates from the standard need not conform to any rules, and can be considered UB; it may work, or it may not.

I've opened an issue on the new j3-fortran/fortran_proposals repo to explore changes to the Fortran standard.

j3-fortran/fortran_proposals#67

As for f90nml, we've probably dealt with this as best we can. At the least, I can now say with confidence that a namelist with non-delimited strings is not a namelist :).

Answer 24 · 2019-11-04T17:21:06.000Z

Thanks for the summary. That does seem like a problem in the standard itself... I agree that indeed there is nothing more to do on f90nml's side.