fortran-lang/stdlib

String handling routines

ivan-pi opened this issue Β· 78 comments

Let's start a discussion on routines for string handling and manipulation. The thread over at j3-fortran already collected some ideas:

  • split - given a separator, splits the string into some form of array
  • upper/lower - convert a character string to all upper/lower case

The discussion also mentioned the proposed iso_varying_string module, which was supposed to include some string routines. I found three distinct implementations of this module:

I also found the following Fortran libraries targeting string handling:

It is likely that several of the tools in the list of popular Fortran projects also contain some tools for working with strings. Given the numerous implementations it seems like this is one of the things where the absence of the standard "... led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes" to borrow the quote of B. Stroustrup in a retrospective of the C++ language.

For comparison here are some links to descriptions of string handling functions in other programming languages:

Obviously, for now we should not aim to cover the full set of features available in other languages. Since the scope is quite big, it might be useful to break this issue into smaller issues for distinct operations (numeric converions, comparisons, finding the occurence of string in a larger string, joining and splitting, regular expressions).

My suggestion would be to start with some of the easy functions like capitalize, count, endswith, startswith, upper, lower, and the conversion routines from numeric types to strings and vice-versa.

This is a great summary! I feel like most of what we need has already been done in these projects (and others), so mainly we need to just gather it all together. Some important things to decide:

  • what do we call the string class? (I vote string).
  • what do we call the various individual methods?
  • Are they going to be OO (s.lower()) or functional (lower(s))?

Should we base it on the ISO_VARYING_STRING module? If so, the class is VARYING_STRING and the procedures are functional (lower(s)).

Should we utilize intrinsic function names? like real(string, [kind, [status]]) and have it stop if the conversion fails and no status variable is provided?

functional-fortran implements several functions on strings:

Further, these functions (and their corresponding operators) are compatible with character strings: complement, empty, head, init, intersection, insert, last, reverse, set, sort, split, tail, and union.

(Caution: split in functional-fortran is not quite what's been discussed at j3-fortran repo. It merely splits the string in two and returns the first or second part)

Thanks for this initiative and listing the current landscape. I think we definitely want stdlib to have good string support.

(For conversion from real/integer numbers to strings, I implemented a function str to be used like this: https://github.com/certik/fortran-utils/blob/b43bd24cd421509a5bc6d3b9c3eeae8ce856ed88/tests/strings/test_str.f90, implemented here and here, so one can do things like "Number i = " // str(i) // ".".)

@ivan-pi do you want to go ahead and create a table of the basic subroutines and let's brainstorm how they should be named, to be consistent with other languages and/or the above various string implementations if possible. And also if they should be functions or subroutines and what arguments to accept.

@jacobwilliams is right about raising the question how to represent the string. We should start with that.

I would recommend (as usual) to have a lowest level API that operates on the standard Fortran (allocatable where appropriate) character. Then, have a higher level API that operates on a string type, and simply calls the lower level API. Regarding a name, see #26, it seems most people agree that the convention to name derived type is to append _t, so it would be string_t.

That way people can use these low level API routines right away. For example in my codes I do not need to modify any data structures and can start using it. The higher level string_t API can then be used by codes that choose to refactor them, or in new codes. If the syntax is not as nice, some people might opt for the lower level API anyway.

I would vote that the low level API be based on functions, pure and elemental where possible and appropriate. I would stick with the Fortran convention of optional status parameters where there is the possibility of things going wrong, and if one is not provided and something goes wrong it crashes. I have tended to use that convention in any routines that go from a string to some intrinsic like:

if (present(status)) then
    read(string, *) result
else
    read(string, *, iostat=status) result
end if

Honestly, I thought the ISO_VARYING_STRING standard did a great job of covering all of the intrinsic functions available for character(len=*) variables, and extending IO to work with that type (put, get, put_line). Aside from the strange interface for split I think it's a great starting point.

My plan was to go through the libraries above and create a table of the most commonly available routines in the next days.

I agree we should consider both low-level routines which work directly on strings of type character(len=*) and a high-level string_t type.

The book Fortran Tools for VAX/VMS and MS-DOS by Jones & Crabtree contains a description of a Fortran string-handling library. Interestingly, they decided to use null-terminated strings like in C, meaning they needed to build a separate set of functions from the intrinsic ones (concatenation operator // and length function). They later used these tools to develop a compiler for a subset of the Fortran language itself! Their conclusion about strings was:

Fortran is often maligned for its lack of facilities for character-oriented processing. ... The apparent deficiency of Fortran for string manipulation is primarily because of the methods traditionally used rather than because of a shortcoming of the language itself. The main shortcoming of Fortran for string handling is the lack of a standard library of routines for often-needed functions. As Fortran programmers we are faced with a choice: we either invest the up-front effort required to create our own standard library or we live with the continuing effort of hacking together a solution each time we are presented with similar problems.

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

I have linked three distinct implementations in the top post. The links from the gfortran compiler pages are dead as well as the link in Modern Fortran Explained by MCR.

Edit: An informal description of the iso_varying_string module for Varying Length Character Strings in Fortran can be found at: http://numat.net/fortran/is1539-2-99.html

@ivan-pi thanks. I like your plan. It looks like the iso_varying_string is in the "high-level" API category, as it operates on a VARYING_STRING derived type. Our low-level API would be similar, but operating directly on character(len=*).

Building the low-level API on character(len=*) variables will be problematic for some operations, since they can't be resized. The high-level API will need to call routines that operate on character(len=:),allocatable variables. So you may end up with two slightly different routines in some cases. So are there really three APIs?

  • character(len=*)
  • character(len=:),allocatable
  • string_t or whatever we call it

That seems complicated to me... but it would cover all the bases...

I think there are two possible APIs here: intrinsic and derived-type one.

For the intrinsic API, character(len=*) works well for input strings. If the function will return a string of known size, you return a charecter(len=something) string. If unknown, you return an allocated character(len=:), allocatable string. User doesn't need to know which one it is.

I also see the intrinsic one as the starting point. Higher-level (derived type) implementation is likely to use the intrinsic API internally.

My understanding, and somebody correct me if I don't have this quite right, is that the ISO_VARYING_STRING standard was created before character(len=:), allocatable (around 2001 I think?), but then when character(len=:), allocatable was added to the standard, it was supposed to function like variable length strings, and so the former was mostly abandoned. However, I have found most compilers to be buggy with their implementation. Memory leaks when used as the return from a function, failure to properly reallocate on assignment, false-positive warnings about accessing uninitialized memory, etc.

If allocatable character actually worked we wouldn't a new derived type for strings. You would just use the intrinsic type and move on. But I think as written in the standard, it probably will never truly work properly in all cases (especially as in read statements, since other allocatable arrays don't and aren't supposed to).

If there is a new type for strings, I don't think a lower level library or API should be exposed, and it should probably not be based on allocatable characters.

As I mentioned above, you can use this trick to return character(len=N) strings from functions. The downside is that the string operation gets executed twice --- once to compute the length in the pure procedure, and second time to actually return it. So we probably don't want to do it that way. What I was thinking is to do what @milancurcic suggested: use character(len=*) as well as character(len=N) where we can, and use character(len=:),allocatable to avoid doing the operation twice as I described above. And that's the low level API. Below I provide two examples: #69 (comment) and #69 (comment), to show one one would decide whether to expose character(len=*) or character(len=:), allocatable.

As @everythingfunctional mentioned, for example GFortran used to have huge problems with allocatable strings and leaked memory. The latest version has improved a lot. Given that this is standard Fortran, and stdlib is a standard library, I think it is ok if we depend on the standard, and if there are compiler bugs, we'll try to workaround them and ensure they are reported. Regarding read statements, see #14 that would handle that. I think we should at least try to create a consistent low level API, not give up without even trying. If it truly cannot be done, only then we'll have to do what you propose, and only expose the string_t type and report the bugs to compilers (and keep the list somewhere) and propose improvements to the language itself, so that it can be done in the future.

I thought you could only use intrinsic procedures in variable declaration statements. Learned something new. That's a neat trick, but like you said, not particularly efficient.

Let's discuss a simple example: upcase.

character(*)

Here is an implementation:

function upcase(s) result(t)
! Returns string 's' in uppercase  
character(*), intent(in) :: s
character(len(s)) :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

When the user wants to use it, he could do this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
allocate(character(len(s)) :: a)
a = upcase(s)
print *, a

which prints:

 Some string
 SOME STRING

The main disadvantage of this approach is that the user needs to know the size ahead of time. In this case he knows --- it's the same size as the original string. Although modern gfortran has reallocatable LHS turned on, so then just this works:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

So I think that would work for upcase.

character(:), allocatable

Here is the implementation using character(:), allocatable

function upcase(s) result(t)
! Returns string 's' in uppercase
character(*), intent(in) :: s
character(:), allocatable :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

It's still used like this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

But since this as an extra allocation inside upcase, I would think that in this case, the character(*) version is better.

Now let's discuss integer to string conversion, the two implementations:

character(*)

pure integer function str_int_len(i) result(sz)
! Returns the length of the string representation of 'i'
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: s
! If 's' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(s, '(i0)') i
sz = len_trim(s)
end function

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
character(len=str_int_len(i)) :: s
write(s, '(i0)') i
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

character(:), allocatable

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: tmp
character(:), allocatable :: s
! If 'tmp' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(tmp, '(i0)') i
s = trim(tmp)
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

Discussion

Unlike in the upcase (see previous comment), here the character(*) version is converting twice, so it is inefficient. The character(:), allocatable version just converts once, and so that would be the preferable API.

(Note: if we implement our own integer to string conversion algorithm, then we avoid the ugly MAX_STR thing and the need to call trim. The above implementation was reused from my codes, where I just use the Fortran intrinsic conversion as part of write so that I save code.)

Here is my proposal for the low level API:

  1. Use character(*) where possible and efficient (see the previous two comments for examples how to decide)
  2. Use character(:), allocatable otherwise
  3. For compilers that cannot compile the code (or leak memory): create a workaround subroutine with a different (less nice or less efficient) API and use that instead for those compilers only, and report the compiler bug and reference it in the code. As a community we have contacts to compiler vendors, and we can communicate this and help get this fixed. The long term goal would be to eventually have no workarounds in 3.

Unfortunately some compilers might leak memory or segfault when such strings are used in derived types. Ultimately, long term, the compilers must be fixed. That's why I think the above proposal is a good one for the long term. In the short term, if we want to provide strings to users that actually work in all today's compilers, it might be that the only way is to create a string_t type not based on allocatable strings, in which case one could still use the low level API with the workarounds 1., 2., and 3., but make a copy of the result into the derived type string_t that is internally represented differently, so that today's compilers do not leak memory. That would be less efficient than providing a separate string implementation, but it's only a short term issue anyway, until compilers catch up. (Alternatively we can have an efficient duplicate implementation based on the internal string_t representation directly if we want better performance until compilers catch up.)

Specifically for the case of integer to string conversion, you could also dynamically allocate a buffer for each integer kind and then trim the result into an allocatable character string:

    function integer_to_string2(i) result(res)
      character(len=:),allocatable :: res
      integer, intent(in) :: i
      character(len=range(i)+2) :: tmp
      write(tmp,'(i0)') i
      res = trim(tmp)
    end function

If we want to avoid internal I/O this function becomes something like

    function integer_to_string1(ival) result(str)
        integer, intent(in) :: ival
        character(len=:), allocatable :: str
        integer, parameter :: ibuffer_len = range(ival)+2
        character(len=ibuffer_len) :: buffer
        integer :: i, sign, n

        if (ival == 0) then
            str = '0'
            return
        end if

        sign = 1
        if (ival < 0) sign = -1

        n = abs(ival)
        buffer = ""

        i = ibuffer_len
        do while (n > 0)
            buffer(i:i) = char(mod(n,10) + ichar('0'))
            n = n/10
            i = i - 1
        end do
        if (sign == -1) then
            buffer(i:i) = '-'
            i = i - 1
        end if

        str = buffer(i+1:ibuffer_len)
    end function

For processing floating point values the functions are much more difficult to develop compared to those using internal read and write statements.

I did some keyword searchs in the list of popular Fortran projects. It seems that most projects use their own set of character conversion and string handling routines for stuff like reading input values from files, parsing command line options, defining settings, etc..

Here are the results of my search of some of the top projects:

Project # of "string" # of "character" # of Fortran files
ElmerFEM 248 1319 2076
WRF 306 966 1668
fds 16 28 41
quantum-Espresso 66 472 1516
fluidity 38 279 747
json-fortran 26 47 49
fortranlib 11 18 38
Nek5000 54 204 336
cp2k 439 1043 1132
nastran-95 85 551 1838
specfem3d 186 404 765
nwchem 323 2768 17214
gtk-fortran 59 77 92
cfl3d 14 216 397
shtools 2 20 113
arpack-ng 1 259 332

The second and third column measure the number of Fortran files that contain the keywords string or character, respectively. This includes both command statements and comments so it may be a bit misleading.

In one of the codebases I even found this comment:

    ! String parsing in Fortran
    ! is such a pain
    ! it's unreal

Casing

The purpose of these functions is to return of copy of a character string ( either character(len=*) or a derived string type) with the case converted . The common variants are uppercase, lowercase, and titlecase.

The libraries cited in the first post contain the following function prototypes:

! functional
function str_upper(str)
function str_lower(str)
function str_swapcase(str)
pure function ucase(input)
pure function lcase(input)
function str_lowercase(str)
function str_uppercase(str)
subroutine str_convert_to_lowercase(str)
subroutine str_convert_to_uppercase(str)
pure elemental function lowercase_string(str)
function uppercase(str)
function lowercase(str)

! object-oriented
procedure, pass(self) :: camelcase
procedure, pass(self) :: capitalize
procedure, pass(self) :: lower
procedure, pass(self) :: snakecase
procedure, pass(self) :: startcase
procedure, pass(self) :: upper
function vstring_tolower(this[,first,last])
function vstring_toupper(this[,first,last])
function vstring_totitle(this[,first,last])

Some versions will return a new string, while some work in place. In at least one of the functions, it did not convert the case of characters enclosed between quotation marks.

These are the similar functions available in other programming languages:

  • Python: capitalize, lower, swapcase, upper
  • Ruby: capitalize, swapcase, upcase, downcase
  • D: toLower, toLowerInPlace, toUpper, toUpperInPlace, asCapitalized, asLowerCase, asUpperCase
  • MATLAB: lower, upper
  • Julia: uppercase, lowercase, titlecase, uppercasefirst, lowercasefirst
  • C++ (Boost): to_upper, to_lower,
  • Rust: to_uppercase, to_ascii_uppercase, to_lowercase, to_ascii_lowercase, make_ascii_uppercase, make_ascii_lowercase

My top three name picks are:

  1. uppercase/lowercase/titlecase
  2. to_upper/to_lower/to_title
  3. upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

I'd like to add to the list of facilities here the overloaded operator * between integers and strings, so that you can do, like in Python:

print *, 3 * 'hello' ! prints 'hellohellohello'
print *, 'world' * 2 ! prints 'worldworld' 

It's easy to make and use. The only downside I can think of is a somewhat weird API when importing it:

use stdlib_experimental_strings, only: operator(*)

Yes, I have seen this kind of usage in one of the above mentioned libraries. I am not sure whether it is not perhaps better to promote the usage of the intrinsic repeat function. As the Zen of Python states: There should be one-- and preferably only one --obvious way to do it.

A benefit of repeat is precisely that you avoid the import statement.

Oops, I didn't know about repeat. Indeed it's the way to go so I withdraw my proposal. I need to brush up on my canonical Fortran. :)

It is really hard to have a day job and keep up with all these threads, so my apologies if I've missed something because I'm just skimming here. A few opinionated notes:

  • Ruby is my favorite language for string processing, and IMO is the best at it. If no one objects (especially @ivan-pi) I'll put links in the first post on this issue
  • I have focussed mostly on some basic string handling (expanded template here) in my ZstdFortranLib project which I started only days before this project started, so I'm of a mind to possibly abandon it or work to integrate some of it here. It has:
    • split: returns an array of characters where each entry is as long as the longest one
    • join: joins an array of characters
    • sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string
    • gsub: Replaces all occurrences with a substring
    • //: Overloaded to allow concatenation of all real, logical, integer kinds with all character kinds
    • to_i: Convert to integer (all kinds)
    • to_r: Convert to real (all kinds)
    • to_l: Convert to logical (all kinds)
    • to_s: Convert anything to a string (all kinds)
    • ANSI formatting stuff that can be turned on/off globally for coloring and styling terminal output (thanks to using one of @szaghi's projects, FACE I believe)

I need to look at the varying string and character array proposals in more detail.

FWIW, I personally prefer the Ruby Python OO approaches with methods because it will make import statements much simpler: Pull in the string class and you get all the methods along with the type/class declaration. Now some operators may need to be pulled in as well if you want to be able to concatenate a real (lhs) with a string (rhs, can't have a TBP operator to the left of the object IIRC).

I was thinking of starting a PR marrying my work on ZstdFortranLib with a UDT/Class approach rather than operating on raw character scalars and arrays which is awkward for things like split(). But now I need to catch up on the myriad of proposals and prior art, so don't hold your breath.

While there is an intrinsic implementation, repeat(), I still like the more concise syntax which has pretty clear meaning for anyone who has ever worked with languages like Python and Ruby. It would be nice if some of these syntactic sugar items were added to the standard rather than a standard library. But until then I would be happy with * overloaded for characters.

@zbeekman I am struggling with all the threads also, but that is good news. It means there is lots of momentum. If you can help us design a good low and high level API for strings (#69 (comment)), that would be great.

@certik k

Now let's discuss integer to string conversion, the two implementations:

I like your first one the most. With integers you can use some math to count up how many digits there are, and if you need a sign on the front, which completely removes the need to declare the max string length AND to do the IO twice. Instead you use integer and floating point math which (hopefully) will be reasonably quick. IIRC, I implemented something to do this in JSON-Fortran but I'll have to look for it.

Also, I don't mean to whine about not being able to keep up, and I agree that it's good, but it's hard to keep track of all the balls in the air.

After a brief search there are at least 3 ways to do this without performing the conversion to a string then counting digits:

  1. Iteration:
    len = 1
    if ( n < 0 ) len = 2
    do
      n = n / 10
      if (n == 0) exit
      len = len + 1
    end do
  2. Tail recursion (same algorithm as above: It will be optimized to code above by compiler or it will be slower if the compiler uses recursion)
  3. "One shot" method using log10
    len = floor( log10( real( abs( n ) ) ) + 1 )
    if ( n < 0 ) len = len + 1

I would guess that 1 is the fastest way to do this, but it may depend on the compiler and hardware. 3 has conversion to a real, then log10 is probably computed iteratively, and it is converted back to an int, so 1. may be faster despite the loop.

@zbeekman great idea. I suspect even faster would be 1. together with putting the digits into the string right away. And thus using the "allocatable" approach. In your approach, you compute the length of the string in 1., but then you have to do a similar loop again when you do the actual string conversion. But your approach is definitely a huge improvement, so we could us my first approach above also with this.

Why not keep going a little here:

elemental function int_str_len(n) result(res)
    integer, value :: n
    integer :: res

	res = merge(1, 2, i >= 0)
	do
        n = n / 10
        if ( n == 0 ) return
        res = res + 1
    end do
end function int_str_len

I suspect even faster would be 1. together with putting the digits into the string right away. And thus using the "allocatable" approach.

I prefer not working with allocatable character function results due to compiler bugs, but whether or not the compiler is implicitly generating a loop for write(s, '(i0)') i may be interesting from a performance perspective...

Doh! Ah, the dangers of drinking from the fire-hose of fortran-lang/stdlib: It appears @ivan-pi has essentially already figured all this out and implemented a similar version above.

See also #69 (comment) how I propose to handle the compiler bugs.

Ruby is my favorite language for string processing, and IMO is the best at it. If no one objects (especially @ivan-pi) I'll put links in the first post on this issue

Feel free to edit the first post. I only listed the languages that came off the top of my head.

Doh! Ah, the dangers of drinking from the fire-hose of fortran-lang/stdlib: It appears @ivan-pi has essentially already figured all this out and implemented a similar version above.

I adapted that version from a 1988 book on Fortran tools but took the "one-shot" approach to declare a sufficiently sized buffer. The range intrinsic essentially returns the value of floor(log10(huge(x)) converting the integer to a real first.

The range intrinsic essentially returns the value of floor(log10(huge(x)) converting the integer to a real first.

Yes I saw that. And then you cleverly work backwards from the temporary array local variable that holds the largest possible integer you could create. Pretty neat!

I think at the end of the day the allocating the right length string issue is where we'll want some benchmarking with different popular compilers. I'm not convinced that internal IO would be slower than this, but I haven't bench marked it and would happily be wrong!

In the book where I found this approach they say:

The reason for not using internal files is mainly to avoid Fortran I/O if possible on microcomputers with limited memory, with the hope that the host compiler won't include Fortran I/O in the executable image; it is also kind of nice to be able to do it on your own. In our tests of the IBM-PC progams that do not use Fortran I/O in the primitives, replacing versions of itoa and atoi using internal files with the ones shown here reduced the size of executable program by 21k bytes.

I cannot say whether this is true also for modern Fortran compilers or, if it even matters given the large amount of memory available to us today.

To follow @gronki's original format of:

  • name of the utility
  • short description
  • does it exist in other languages
  • proposed example of usage

Strip

  • Remove leading and trailing whitespace characters (null, horizontal tab, line feed, vertical tab, form feed, carriage return, space.)
  • Ruby: https://ruby-doc.org/core-2.6/String.html#method-i-strip
    "    hello    ".strip   #=> "hello"
    "\tgoodbye\r\n".strip   #=> "goodbye"
    "\x00\t\n\v\f\r ".strip #=> ""
    "hello".strip           #=> "hello"
  • Fortran:
    strip("\tgoodbye\r\n") ! "goodbye", where \t is a tab char, \n a new line etc.
    strip("   hello     ") ! "hello"
    strip("Hello")         ! "Hello"
    !! Or maybe for a string class
    s%strip() ! returns "hello" from "   hello   ", etc. as above

Chomp

  • Return string with default or optionally specified trailing record separator(s) removed. Think write(*,*) vs write(*,*, ADVANCE="NO").
  • Ruby: https://ruby-doc.org/core-2.6/String.html#method-i-chomp
    "hello".chomp                #=> "hello"
    "hello\n".chomp              #=> "hello"
    "hello\r\n".chomp            #=> "hello"
    "hello\n\r".chomp            #=> "hello\n"
    "hello\r".chomp              #=> "hello"
    "hello \n there".chomp       #=> "hello \n there"
    "hello".chomp("llo")         #=> "he"
    "hello\r\n\r\n".chomp('')    #=> "hello"
    "hello\r\n\r\r\n".chomp('')  #=> "hello\r\n\r"
    !! Or use a TBP on a string class with passed first argument
  • Fortran:
    chomp("hello")                   ! "hello"
    chomp("hello\n")                 ! "hello"
    chomp("hello\r\n")               ! "hello"
    chomp("hello\n\r")               ! "hello\n"
    chomp("hello\r")                 ! "hello"
    chomp("hello \n there")          ! "hello \n there"
    chomp("hello", sep="llo")        ! "he"
    chomp("hello\r\n\r\n", sep="")   ! "hello"
    chomp("hello\r\n\r\r\n", sep="") ! "hello\r\n\r"
    !! Or use a TBP on a string class with passed first argument

Split

  • Return an array of strings. This might be an array of raw characters padded out to the max length of any given element, or a ragged edge array of strings in a string class. Called without an argument or a single space remove leading and trailing whitespace and runs of multiple white space characters.
  • Ruby: https://ruby-doc.org/core-2.6/String.html#method-i-split
    " now's  the time ".split       #=> ["now's", "the", "time"]
    " now's  the time ".split(' ')  #=> ["now's", "the", "time"]
    "mellow yellow".split("ello")   #=> ["m", "w y", "w"]
    "1,2,,3,4,,".split(',')         #=> ["1", "2", "", "3", "4"]
  • Fortran:
    split(" now's  the time ")           ! ["now's", "the", "time"]
    split(" now's  the time ", sep=' ')  ! ["now's", "the", "time"]
    split("mellow yellow", sep="ello")   ! ["m", "w y", "w"]
    split("1,2,,3,4,,", sep=',')         ! ["1", "2", "", "3", "4"]

Join

  • Join an array of characters or string class using optional glue, like python, but I think the arguments are backwards in python for the object oriented case. i.e., list.join(glue) seems to make more sense to me.
  • Python: https://docs.python.org/3.8/library/stdtypes.html#str.join
    s = "-"
    s.join(['1','2','3','4']) # '1-2-3-4'
  • Fortran:
    join(['1','2','3','4'], glue='-') ! '1-2-3-4'
    join(['1','2','3','4'])           ! '1234'
    !! Or for an OO class
    s = split('this, that, the other', sep=',')
    ! this
    ! that
    ! the other
    s%join(',')                       ! 'this, that, the other'
    s%join()                          ! 'thisthatthe other'

Gsub

  • Global substitution. Same as @jacobwilliams' string_replace as far as I can tell. Eventually it would be nice to support regular expressions like Ruby, but, for now, we should start smaller and just make it behave like:
    sed 's/<pattern>/<replacement>/g'
  • Kinda Ruby, Python and sed. Behave like Python without the optional count argument
  • Fortran:
    gsub("hello", "l", "L") ! "heLLo"
    gsub("the quick brown fox jumped over the lazy dog", "the", "a")
        ! "a quick brown fox jumped over a lazy dog"
    !! Or OO approach
    s = "hello"
    s%sub("l","L")              ! "heLLo"

Sub

  • Replace first (or last) instance of one substring with another
  • Again, similarities with Python and Ruby, but not an exact match
  • Fortran:
    sub("hello", "l", "L") ! "heLlo"
    sub("hello", "l", "L", back=.true.) ! "helLo"
    sub("the quick brown fox jumped over the lazy dog", "the", "a")
        ! "a quick brown fox jumped over the lazy dog"
    sub("the quick brown fox jumped over the lazy dog", "the", "a", back=.true.)
        ! "the quick brown fox jumped over a lazy dog"
    !! Or OO approach
    s = "hello"
    s%sub("l","L")              ! "heLlo"
    s%sub("l","L", back=.true.) ! "helLo"

Center

  • Pad and center a string to a given width
  • Ruby: https://ruby-doc.org/core-2.6/String.html#method-i-center
    "hello".center(4)         #=> "hello"
    "hello".center(20)        #=> "       hello        "
    "hello".center(20, '123') #=> "1231231hello12312312"
  • Fortran:
    center("hello",4)        !  "hello"
    center("hello",20) .     ! "       hello        "
    center("hello",20,"123") ! "1231231hello12312312"
    !! Or OO approach with passed first variable as before
    s = "hello"
    s%center(20)       ! "       hello        "
    s%center(20,"123") ! "1231231hello12312312"

Conversion from character to real, integer, complex, logical, etc.

Integer, real, logical, complex conversion to character strings (ideally for all available kinds)

  • Convert using internal io with a read statement under the generic * edit descriptor
  • Like Ruby's to_s/to_str method and Python's .str()
  • Fortran:

Overloaded concatenation

  • Use to_int or whatever the names are to concatenate strings with integers, reals, complex and logical variables and handle the string conversion with the default rules
  • Other languages: probably, but don't know of good examples off the top of my head
  • Fortran:
    "file" // i // ".txt" ! "file1.txt" or "file999.txt" or "file-20.txt"
    100.0_rk // " degrees celsius is when water boils"
    ! "100.00 degrees celsius is when water boils"
    ! This should have a editable edit descriptor to control conversion, probably want to use a "g" edit descriptor as the default with sensible values
    "Test failed? " // .true. ! "Test failed? true"

Most Python and Ruby string methods with idiomatic, and straightforward-ish implementations in Fortran would be great to have. Cases where intrinsics are already present need not be re-written unless we create a string class vs a functional style. Some others have no obvious or straightforward implementation until more infrastructure (like regex handling) is in place. Regex handlers will almost certainly need to use an external library, unless someone knows how to write parsers and lexers using tools that emit Fortran.

I think and important issue for us is to decide:

  1. Functional only approach
  2. OO only approach
  3. Both

I don't think it would be that difficult to provide both a functional interface and an OO interface, and the OO interface could leverage the functional implementation in most places.

Also worth considering: Do we want basename, dirname, name_we, extname type of functions to be part of a string class or a file/os class/module?

Do we want basename, dirname, name_we, extname type of functions to be part of a string class or a file/os class/module?

That should be the OS module I would think.

I think and important issue for us is to decide:

1. Functional only approach

2. OO only approach

3. Both

It looks like 3. it should be, as we can all agree on that one and move on to actually implement this (as opposed to keep discussing whether to do 1. or 2.).

Responding to some of @zbeekman proposals

Strip

Doesn't trim already do exactly this? Why have two functions to do exactly the same thing?

Chomp

I think you need more examples about why this is needed. Is the idea not to remove trailing whitespace if it's not a record separator? I.e. chomp("hello \n") => "hello "

Split

There is some subtleties about how split works that need to be pointed out. In Python (and other languages?, I'd have to double check) the separator argument is taken as a list of possible separators, not as a pattern that must be found. (i.e. split("Hello ,World", ", ") => ["Hello", "World"]).

Also, should empty strings be included in the resulting array? I think probably not, or you'd get something you probably didn't intend from a standard use case like split("A, list, of, words", ", ") => ["A", "", "list", "", "of", "", "words"] due to the multiple separators next to each other.

I've got an implementation of split (splitAt) here that behaves like Python's.

Gsub and sub

The replace function from the ISO_VARYING_STRING module combines these two uses into one interface (replace(string, target, substring, every, back)). I don't know if that's better or worse, just that there is already a precedent for it.

Conversion from character to real, integer, complex, logical, etc.

What should happen if the string can't be converted? Most other languages throw exceptions that can be caught, but Fortran can't do that. Providing an optional iostat argument would seem to fit in with the Fortran "style".

Integer, real, logical, complex conversion to character strings (ideally for all available kinds)

The question really is what format should real and complex numbers take. Use the f format specifier or g, or something else? To what precision? What about trailing zeros? or rounding for stuff like 1.00000000001? I've got an implementation here that picks the shorter string from regular vs scientific notation, removes trailing zeros, and give the full precision available for that kind if a number of significant digits isn't specified, but I'm not certain that's what everyone would prefer.

Like others, I have only skimmed the postings, so I may have missed it, but has the issue of meaningful trailing blanks been discussed? Normally trailing blanks are just an inconvenience and I guess with the allocatable-length strings we have now, the issue is less pressing, but some consideration does seem useful.

@zbeekman mentioned in #69 (comment) parsers and lexers implemented in Fortran. Some years ago I adjusted the SQLite "lemon" parser generator so that it will emit Fortran code - https://sourceforge.net/p/flibs/svncode/HEAD/tree/trunk/src/lemon/. I merely adjusted the code generation parts of the original code and I admit not having used it much, but it could be a starting point. In the same Flibs project: Paul Fossati created an interface for the PCRE library for regexps.

has the issue of meaningful trailing blanks been discussed?

Great question, I don't think we explicitly discussed this yet. I was hoping that we will assume and require that there are no trailing blanks (both in the low level as well as the OO API), unless the user wants them there (but in that case "x" and "x " will be treated as different strings). In particular, no need to call trim all over.

Quite possibly the absence of any magic character - \0 for C-like languages and trailing blanks for Fortran - may automatically resolve the issue, but we will have to take care that such features/quirks do not sneak in ;).

@everythingfunctional: I'm not in favor of unnecessary duplication either, and this is just a personal wishlist to shape discussion

Doesn't trim already do exactly this? Why have two functions to do exactly the same thing?

No, trim only handles trailing blanks which are spaces unless I'm misunderstanding something. Strip handles leading and trailing whitespace of all forms.

Is the idea not to remove trailing whitespace if it's not a record separator?

Yes and to provide flexibility on what that separator might be, e.g., commas after all elements except the last one in a list of items.

There is some subtleties about how split works that need to be pointed out.

Agreed. I was quoting the way Ruby does it straight from Ruby's API docs, and I think the examples I posted spell out the semantics fairly well.

In my very opinionated opinion, Ruby is the best language for string handling, so I like APIs that match, where it makes sense, Rubys. But Python has good string handling too, certainly better than Fortran. Just proposing a way to do it, not saying we have to do it that way.

The replace function from the ISO_VARYING_STRING module combines these two uses into one interface (replace(string, target, substring, every, back)). I don't know if that's better or worse, just that there is already a precedent for it.

I haven't had a chance to look at ISO_VARYING_STRING yet. If people like it and there's broad support (and, even better an implementation) then I would happily defer to replace and any other useful functions/utilities. But since we're making something new, I wanted to share what I like from Ruby.

What should happen if the string can't be converted? Most other languages throw exceptions that can be caught, but Fortran can't do that. Providing an optional iostat argument would seem to fit in with the Fortran "style".

Very good question. Where a function is doing the converting it should have an iostat argument. Typically you can convert from one character set into utf8/unicode, going in the other direction requires a check that the characters being converted from unicode exist in the other character set. For concatenation, you can convert asymmetric concatenation to ISO_10646 if it's present.

The question really is what format should real and complex numbers take. Use the f format specifier or g, or something else? To what precision? What about trailing zeros? or rounding for stuff like 1.00000000001?

Yes, I implemented similar logic in JSON-Fortran (with input and help from @jacobwilliams) I think it makes sense to provide some sane defaults with a function to let the user specify their own default output format for reals, complex, integers, etc.

Paul Fossati created an interface for the PCRE library for regexps.

I was thinking that using this might be wise. But it introduces a (perhaps critical) external dependency, that may be an unpopular opinion.

has the issue of meaningful trailing blanks been discussed?

One of the nice things about most of the ruby APIs is that they give you some flexibility to change semantics for, e.g., repeated blanks, or other patterns. But a lot of this extended capability relies on a regex syntax, which, right now, is probably a non-starter.

In order to move this discussion forward --- it seems the implementations are (relatively) straightforward, the hard part is to agree on an API: a set of functions, their name, arguments for the low level API, and similarly for the high level API. If that's the case, @zbeekman, @ivan-pi do you want to start a document where we'll start discussing the function names and their arguments and functionality?

I don't know if the best way is to open a few draft PRs with the document (I assume there will be a few competing versions at first), or if there is another way to discuss a full document.

Sure. I need to make myself familiar with N1375 and N1379 And see how it sits with what I had anticipated doing. There's no point reinventing the wheel, so if ISO_VARYING_STRING appeals to me and seems congruent with the opinions I've heard in this thread so far it might be a good starting point. @everythingfunctional seems to already have an MIT licensed implementation, so that could even be a good starting point.

ISO_VARYING_STRING operates on a derived type, so that is the high level API. The low level API would operate on character string directly.

I was hoping to also have some document that compares different languages, like @ivan-pi did above, so that we can stay compatible with the API, whenever it makes sense.

I was hoping to also have some document that compares different languages, like @ivan-pi did above, so that we can stay compatible with the API, whenever it makes sense.

I realized there is a page for this on Wikipedia:
https://en.wikipedia.org/wiki/Comparison_of_programming_languages_(string_functions)

As you might notice many of the string functions do not have a Fortran equivalent. Exactly those are the ones we should try and implement here.

I don't know if the best way is to open a few draft PRs with the document (I assume there will be a few competing versions at first), or if there is another way to discuss a full document.

By document are you implying simply a markdown file? (ideally these string function specifications would later become part of the documentation)

@ivan-pi the wikipedia page you posted is perfect, that's what I was hoping to create. Now when this is done, I think all we need is just some markdown document where we start listing the Fortran functions and their names and arguments, and we will consult the wikipedia page to see how other languages do that. Maybe even Wiki would be good enough to start. Do you want to start a wiki page at https://github.com/fortran-lang/stdlib/wiki ?

More basic is that I saw mention of not using internal I/O to do the conversion to a string. That is really important if you want your function to be safely called from an I/O function such as a WRITE, and to be able to use it from within a PURE function. The algorithms to do a floating point value as well as an INTEGER value can be found back as far as the 1968 version of "Software Tools" in Ratfor!

Did you avoid internal I/O in your library M_STRING? I never really thought about is internal I/O pure or not. But if that is really the case I would prefer to do our own string conversions routines.

And I like the name "center" for centering a string, but I would vote for the name ADJUSTC() for a function that works on CHARACTER variables to complete the set (ADJUSTL(), ADJUSTR(), ...).

Would you allow adjustc to have an optional width parameter or not? I like this name idea.

Is the main objective to create a specification for the Fortran standard or to create a working library in the Public Domain (or both?)

I think the initial purpose is to create a working library in the public domain (MIT or BSD license) which hopefully becomes standard in the feature. I guess a similar model is how some of the C++ Boost function libraries later become integrated into the language or the C++ standard library. You can read the related thread j3-fortran/fortran_proposals#104 for more information.

Re regexp: a couple of years ago I found a series of articles by Russ Cox about regular expression engines. The accompanying source code is not too difficult and I made a start rewriting it in Fortran. However, there is a flaw in my code and I never got around to correct it (the code is not too difficult but it is manipulating lists). I picked it up again. The result will not be a full-fledged RE engine Γ  la PCRE, but it will be Fortran only and it should be useable for not entirely trivial tasks.

I never really thought about is internal I/O pure or not. But if that is really the case I would prefer to do our own string conversions routines.

Based on the fact that I was able to write pure toString functions here, internal I/O is pure.

@urbanjost thanks for the post. As @ivan-pi replied, the goal of this stdlib effort is to provide a Fortran Standard Library, i.e., both a library and a specification. See my answer to a similar question. The license is MIT (https://github.com/fortran-lang/stdlib/blob/006bedafc0d40ff381da2bd4455f61b5e11fc2ee/LICENSE), and we will only depend on 3rd party code that is MIT or BSD style licensed. The way we plan to achieve our goal it to have a large community designing the API and a rigorous (high bar) process to get new features in as documented in our WORKFLOW document. And we have been coordinating with the Fortran Standards Committee (this effort started at the J3 committee repository at j3-fortran/fortran_proposals#104), and also we are planning to getting them involved in the step 5 in the workflow (at least informally). Our goal is to get a wide community agreement and acceptance to adopt stdlib as the Fortran Standard Library. We will continue working closely with the Fortran Standard Committee and coordinate with them. I can imagine many arrangements in the future, up to even the Fortran Standard itself specifying a Standard Library; but that is far in the future. Right now our job is to get the community to agree on the APIs and to provide specifications and implementations and to build a community around it.

(Update: we added the motivation into README: https://github.com/fortran-lang/stdlib#goals-and-motivation)

datetime-fortran used internal I/O in a pure function for a long time (since the beginning I think) and this built fine with gfortran and ifort.

https://github.com/wavebitscientific/datetime-fortran/blob/d4683303e6319b6380bbf7717164f7d8f18e0f0d/src/lib/mod_datetime.f90#L1288

After reading through this thread I found subtle issue with the proposed low-level API for character(len=*) variables.

My top three name picks are:

  1. uppercase/lowercase/titlecase
  2. to_upper/to_lower/to_title
  3. upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

Taking just the basic functionality mentioned by @ivan-pi here, I implemented a stdlib_character module as demonstration in #310, the issue becomes quickly apparent once you try to use both stdlib_character and stdlib_ascii in the same scope.

I created an exploratory implementation of a functional string handling at awvwgk/stdlib_string as fpm project. A non-fancy string type is implemented there, which basically provides the same functionality as a deferred length character but can be used in an elemental rather than a pure way. The idea is to have a scaffold for the string type in stdlib which can be extended later but already provides everything we are used to have from the deferred length character without the rough edges.

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

  • there is no assignment from string to character
    • reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS
  • all procedures return a fixed length character rather than a string instance
    • reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
    • drawback: assigning the return value to a string might create a temporary variable on the stack
  • no support for get and put
    • reason: derived type IO is used instead

@awvwgk this would be the high level API that operates on the string_type type.

How would a low level API look? Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

The maybe function can also operate on character(len=:), allocatable it seems. So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

Bad example, the read_formatted procedure defines a user defined derived type input (see #312), which cannot be defined for character(len=:), allocatable since there is already an intrinsic formatted read transfer for character(len=*) types defined.

this would be the high level API that operates on the string_type type.

The idea so far was to provide the intrinsic low level API for a string type, on which later the high level API can be defined.

So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

Exactly, I wanted to explore a common basis of agreed on functions for a future high level string object. The minimal agreed on basis should be easily all the intrinsic procedures defined for character(len=*).

How would a low level API look?

I decided to pick the part of the high-level API that will have no overlap with a potential low-level API. This way the low level API can be explored separately, like in #310

The maybe function can also operate on character(len=:), allocatable it seems.

This one was chosen deliberately to be an internal implementation detail, i.e. it is not part of the public API.

Here is what I mean: awvwgk/stdlib_string#1

In that PR, I implemented a low level version of read_formatted called read_formatted0 that operates on character(len=:), allocatable. It seems to work.

@awvwgk this would be the high level API that operates on the string_type type.

@certik, the procedures in Sebastian's module are in fact equivalents of the intrinsic character procedures already available in Fortran:

    public :: len, len_trim, trim, index, scan, verify, repeat, adjustr, adjustl
    public :: lgt, lge, llt, lle, char, ichar, iachar
    public :: assignment(=)
    public :: operator(.gt.), operator(.ge.), operator(.lt.), operator(.le.)
    public :: operator(.eq.), operator(.ne.), operator(//)
    public :: write(formatted), write(unformatted)
    public :: read(formatted), read(unformatted)

The pull request #310 is the first to propose new procedures (reverse, to_title, to_upper, to_lower) which can operate both on the intrinsic character(len=*) variables, and in a later pull request also on a high-level string type (to be decided).

String processing in Fortran is not that bad, considering the number of procedures already there. If we could add casing, numeric to string convertors (and vice-versa), join and split, and perhaps a few more procedures, I think most usage cases would be covered.

@ivan-pi actually there is genuine new functionality, that I just extracted here: awvwgk/stdlib_string#1 (comment).

@certik I see, you are right the there is new functionality, the new_string_from_chars function extends beyond intrinsic functionality. I would suggest that I will submit it as a separate patch to the existing stdlib_ascii module and remove it from the stdlib_string_type until we have agreed on the low-level API. (new issue at #315)

Regarding the len functionality I find the argument is bit stretched, but I'm willing to follow it for the sake of the discussion. Eventually, the argument boils down to whether the utility function maybe should be exposed as public API. I think it is an implementation detail, because the specs as proposed do not define the internal representation of the character sequences, the string_type could use character(len=1), allocatable :: raw(:) instead of character(len=:), allocatable :: raw like in most iso_varying_string implementations.

The maybe, ok and err functionality from Rust come to mind here and would make a great addition for stdlib as well if we can successfully emulate this kind of behaviour. I don't feel confident that I got a good spin on this kind of features to make it stable enough for an actual addition to stdlib, therefore I don't want to force a maybe implementation for a character(len=:), allocatable yet.

However, I disagree on the low level API for user defined derived type input output, it is strictly a feature that can only be defined for a derived type but not an intrinsic and we won't be able to make use of it to safely read into a character(len=:), allocatable :: dlc with read(unit, *) dlc due to the construction of the Fortran standard.

The gist is, I don't want to introduce new functionality beyond the existing character(len=:), allocatable with the string_type on the first pass. Just one step at a time to allow better focus.

@awvwgk I just saw your comment, my comment here I think replies to yours: awvwgk/stdlib_string#1 (comment).

There is now also a branch at my stdlib fork. There is one really unfortunate thing here, GCC 7 and 8 do not support evaluation of user defined pure procedures in variable declarations. Adopting this string_type will inevitably drop support for GCC 7 and 8.

The solution is to adopt the iso_varying_string strategy to return a string_type instead of a fixed length character, which comes with its own problem that results from stdlib_string_type procedures now must be explicitly cast back to character form.

As promised in #320 (comment) I tried to devise an abstract base class (ABC) for an extendible string class. This one turned out much more difficult to design than a non-extendible functional string type, you can check the base class definition here:

https://github.com/awvwgk/stdlib_string/blob/string-class/src/stdlib_string_class.f90

The class is a bit more bloated than it has to be because I made it compatible with the intrinsic character type and the functional string type as well to ease testing.

One thing that turns out to be very difficult to account for are overloaded intrinsic procedures, you can find two implementation for each intrinsic procedure (except for the lexical comparison where I took a shortcut), one for the overloaded generic interface (len(string)) and a type bound implementation (string%get_len()), with the former invoking the latter. This was necessary to allow using the overloaded intrinsic procedure names while still relying on the runtime resolution of the type bound procedures from the object.

Another problem was returning a class polymorphic object from a procedure (operator(//) or trim), returning class(string_class), allocatable would force users to declare their string objects always as class polymorphic even if they want to use a specific implementation. Therefore, I decided to return a functional string_type instance instead and provide an assignment from string_type to a polymorphic string_class object to hide this fact effectively.

Since we have a whole lot of intrinsic character procedures implementing a string class based on this ABC can become tedious, therefore I designed the ABC to provide mock implementations based on the setter (assignment(=)) and getter (char(self)) functionality which can optionally be overwritten. Only the assignment from a character variable and the three char functionalities are actually deferred and must be provided in a minimal implementation.

While this is not a final specification yet, I wanted to share it as aid for discussion functional vs. object oriented implementation of a string in stdlib. From the above notes you might gather that a truly extendible string class could result in significant performance penalties for the user. Still there might be some value in having a string object available.

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

  • there is no assignment from string to character

    • reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS

If I understand things correctly, the assignment to character should be handled explicitly through the char function? I.e.

type(varying_string) :: varying   ! from iso_varying_string
type(string_type) :: nonfancy     ! from PR #320 
character(len=20) :: flc 
character(len=:), allocatable :: dlc

flc = varying ! works
dlc = varying ! fails, dlc needs to be allocated first

allocate(character(len=len(varying)) :: dlc)
dlc = varying ! works

flc = char(nonfancy) ! works
dlc = char(nonfancy) ! works
  • all procedures return a fixed length character rather than a string instance

    • reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
    • drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

  • no support for get and put

    • reason: derived type IO is used instead

πŸ‘ This is better and more Fortranic IMO. put and get where borrowed from C.

  • all procedures return a fixed length character rather than a string instance

    • reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
    • drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

None, because I had to reconsider this design choice due to missing compiler support.

Does the initial design choice (the one which breaks GCC 7 and 8 support) survive in any of the earlier commits on your private fork? I wonder if you could still pull it off, by moving the functions out of a module...

I still don't fully grasp how the implementation differed. Would for example the repeat(string, ncopies) accept a type(string_type), and use an overloaded pure len function to return a fixed-size result of size len(string)*ncopies?

In any case your pull request is a big step to make string-handling easier.

@ivan-pi See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way.

I updated my stdlib_string project with an abstract base class for a more object-oriented string implementation. As a demonstration of such a string_class I added @robertrueger's ftlString and @szaghi's StringiFor projects as examples to the repository, but based them on string_class rather than having them implement the intrinsic functions themselves. This could allow to make existing string libraries easily compatible with stdlib by allowing them to inherit from string_class (and they would also become compatible with each other).

Not sure if it was linked before, Clive Page wrote a nice summary about character types in Fortran: https://fortran.bcs.org/2015/suggestion_string_handling.pdf

There was also a thread over at the Fortran-FOSS programmers: Fortran-FOSS-Programmers/Fortran-202X-Proposals#4

A link was provided to a WG5 document, which talks about a print() function (page 9) similar to what we have now as to_string(): https://wg5-fortran.org/N1951-N2000/N1972.pdf