fsprojects/FSharpx.Extras

Lack of consistency + lack of awareness re. cultural processing of strings

Bellarmine-Head opened this issue · 7 comments

Looking at the code base as a whole, there seems to be a strong lack of consistency when it comes to dealing with cultural variation wrt the processing of strings. E.g. some people use s.ToUpper(), while others, perhaps knowing better, use s.ToUpperInvariant().

Some people use #if FX_NO_CULTURE_INFO_ARGS to determine whether to specify culture info and/or StringComparison args to standard .NET methods, while most others don't.

Also, there seems to be a complete lack of awareness of the part of many contributors about the whole issue of a) cultural vs non-cultural conversion of strings, and b) current culture vs invariant culture vs ordinal (with/without ignore case) comparison of strings.

As it stands, FSharpx is a library I would be fearful to use. Looking at the code, I know that I cannot predict the outcome of many operations when run on other-culture (e.g. non-English) machines.

It has been drummed into me over the last 10+ years the critical importance of always stating your cultural / ordinal intent when it comes to string conversion and comparison.

I would suggest a cleanup throughout the code, along these lines...

  1. For string conversions, always state your cultural (or non-cultural) intent. E.g. if you wish to convert a string to upper case, either use s.ToUpperInvariant() or the overload of ToUpper that takes a CultureInfo arg. Never use the s.ToUpper() overload.
  2. Avoid methods like s.Contains() which only work in a culture-sensitive and case-sensitive manner (use s.IndexOf instead).
  3. For string comparisons (e.g. s.StartsWith, s.EndsWith, s.Equals), always use an overload of the method that takes a StringComparison enum value, and carefully consider in each case whether your comparison should use the current culture, the invariant culture (regionless English) or - by far the most common choice in data processing - ordinal. Also give consideration to case-sensitivity.
  4. The use of #if FX_NO_CULTURE_INFO_ARGS should be deprecated in favour of always specifying culture info args (and the like).
  5. When a function is public (e.g. "startswith") and needs an extra arg (e.g. "comparisonType") to bring it into line, this will be a breaking change. But a worthwhile one, imo. All .NET developers, including F# developers, need to be aware of this stuff and to give it serious consideration at all times.

Hi @andrew-webb,

I agree this is a significant problem - we need a clear position on this issue and we need that to be mapped through to fixes in the code.

There is considerable discussion going on about quality guidelines for F# upstack components, we'd welcome your input, for example: http://fsharp.github.io/2014/09/19/fsharp-libraries.html. It is likely the profile of culture-specific processing needds to be raised more highly in that dicsussion. Please help us do that.

Cheers
Don

Hi Don,

I'll be happy to help.

After spotting this problem in FSharpx, I took a good look at the source of FSharp.Data and was pleased to see that it was far less of a problem there. Things are not perfect in FSharp.Data in that cultural intent isn't always specified explicitly, but I couldn't spot any real causes for concern. Perhaps after new guidelines are published there could be a cleanup of FSharp.Data to bring it into total alignment, but at this time I don't feel moved to create an issue about it.

All this brings back fond memories of 9 years ago when Dave Fetterman of the BCL Team wrote a paper entitled Strings in .NET 2.0. I got a kick out of finding that page again today, and seeing my enthusiastic comment from June 2005. Don't bother clicking on the link at the top to the (Word) document, because it's long gone. However, it lives on in this page:- Best Practices for Using Strings in the .NET Framework. Just about every word that Dave wrote in 2005 still applies today. Needless to say, I consider it essential reading for every .NET developer.

You will see that Best Practices article referenced from the MSDN pages of method overloads that should be avoided, e.g. String.EndsWith Method (String). Scroll to Notes to Callers in the Remarks section.

However, I don't blame developers for not noticing, because it isn't very obvious. What got me to notice was passing all code through FxCop / Code Analysis, and investigating what the cryptic message "use an overload that takes an IFormatProvider arg" was all about!

Being relatively new to F#, I'm not sure if FxCop / Code Analysis applies to F# code. And if not, what should be used instead? But it's the next thing on my list to check out.

I'll see if I can come up with some guidance that will help F# developers specifically. But the main part of my guidance will be: read, mark, learn and inwardly digest the guidance that Dave Fetterman came up with 9 years ago and which still applies to this day. Oh, and Michael Kaplan always had interesting things to say in Sorting it all Out. Here's just one little gem.

Yours,
Andrew

Things are not perfect in FSharp.Data in that cultural intent isn't always specified explicitly

Can you please create an issue there with the problems you found? We take care to always either explicitly pass InvariantCulture or use a parameter, if there's still any place using the .Net default of using the current culture it's a bug

Can you please create an issue there with the problems you found? We take care to always either explicitly pass InvariantCulture or use a parameter, if there's still any place using the .Net default of using the current culture it's a bug

Happy to do it. Here's the issue:- fsprojects/FSharp.Data#742

Thanks @andrew-webb, this is a very useful contribution.

We'd be grateful if you could submit a PR to https://github.com/fsharp/fsharp.github.io/blob/master/_posts/2014-09-19-fsharp-libraries.md to add this as a specific recommendation. We can discuss the exact recommendation on the PR thread.

Note that FSharp.Core and F# always uses Ordinal comparison by default for "compare", "=", "<>", "<", ">", Seq.sort, Seq.sortBy, HashIdentity.Structural, ComparisonIdentity.Structural and so on. This was a decision made in F# 1.0-2.0. Upstack F#-specific libraries should follow this design pattern, and explicitly call it out in documentation where necessary. If you could include this information in the guidance PR that would be great.

Cheers!
Don

@dsyme

We'd be grateful if you could submit a PR to https://github.com/fsharp/fsharp.github.io/blob/master/_posts/2014-09-19-fsharp-libraries.md to add this as a specific recommendation

Done.

Andrew