Make Data.String.CodePoints the default
hdgarrood opened this issue · 14 comments
@michaelficarra originally suggested this and I agree; I think Data.String.CodePoints should really be the default. Unless you're certain you won't be working with anything outside the Basic Multilingual Plane, and you've identified string manipulations as a performance bottleneck, you should really be using the functions in Data.String.CodePoints.
For the functions whose type signatures are the same across both modules, like length :: String -> Int, this has the potential to be quite problematic, so I think we need to be quite careful about it. I'd suggest the following:
- In the next breaking release:
- we create a module
Data.String.CodeUnits, with the exact same exports as the currentData.String, - we add a notice at the very top of
Data.String, detailing that the functions within currently operate on code units, not code points; that this will change in the next breaking release; and that you should very probably be usingData.String.CodePointsinstead (unless you are sure you want to operate on code units, in which case you can useData.String.CodeUnits)
- we create a module
- In the breaking release after that one:
- change
Data.Stringso that it re-exports everything fromData.String.CodePoints - remove the notices
- consider deprecating the
Data.String.CodePointsmodule, for removal in a subsequent breaking release?
- change
I've taken a slightly different approach in the 0.12 branch as it is:
Data.StringandData.String.NonEmptyonly export functions that are codepoint/codeunit agnosticData.String.CodeUnitsnow exists, with all the relevant functions moved there (likewise for NES)- the
.CodeUnitsand.CodePointsmodules re-export the agnostic stuff too
This means people will be forced to choose between CodeUnit and CodePoint functions at least, and it avoids the potential future problem of people not noticing the switch if Data.String changed which set of functions it re-exports.
So for most people the migration path now will just be replacing import Data.String with import Data.String.CodeXXX.
Sound good?
I'd prefer it to re-export the functions from Data.String.CodePoints from Data.String to have a sensible default for import Data.String. Codepoints and Codeunits are pretty technical terms that not everyone will know about and we should strive to make it easy to do the right thing (which means using CodePoints in this case).
@garyb While we're breaking things for 0.12, let's also make Data.Char.toUpper and toLower return a string. Sorry for derailing.
Sure! Have you got an example of where that happens? (just curious)
'ß'.toUpperCase()
"SS"
Nice, thanks!
I think we can just drop the Char module actually - doing the case alteration can be done in String form, so having a Char -> String version is only a miniscule ergonomic win. There are no other functions in that module now, since fromCharCode / toCharCode became toEnum / fromEnums instead.
I disagree with
fromCharCode -> toEnumWithDefault bottom top
There's just no good way to discover that.
Hmm maybe... although it can be documented somewhere.
fromCharCode was a lie, that function should always have been returning Maybe Char which is how toEnum works at least.
Now would be a good time to address that and have it return a Maybe Char then, surely? I don’t think redundancy is necessarily bad.
I made the change to have Data.String reexport Data.String.CodePoints in: 1fbc4c0
My understanding of what we have now is:
Data.String.Commoncontains functions which behave in the same way regardless of whether we are considering strings as sequences of code points or code unitsData.String.Code{Points,Units}contain functions whose behaviour differs based on whether we are considering strings as sequences of code points or code unitsData.Stringre-exports the entirety ofData.String.CommonandData.String.CodePoints
If this is correct, I'm happy and I think we can close this?
Yeah, that's right 👍