String handling routines #69

ivan-pi · 2020-01-02T19:19:33Z

Let's start a discussion on routines for string handling and manipulation. The thread over at j3-fortran already collected some ideas:

split - given a separator, splits the string into some form of array
upper/lower - convert a character string to all upper/lower case

The discussion also mentioned the proposed iso_varying_string module, which was supposed to include some string routines. I found three distinct implementations of this module:

ISO_VARYING_STRING Module by Rich Townsend
iso_varying_string implementation by Brad Richardson (@everythingfunctional)
ISO_VARYING_STRING due to J.L.Schonfelder (author of the iso_varying_string proposal; the module dates back to 1998)

I also found the following Fortran libraries targeting string handling:

Strings For Fortran by Brad Richardson @everythingfunctional
Fortran Character String Utilities by George Benthien
M_STRINGS from Urban Jost @urbanjost (part of General-Purpose Fortran tools)
String_Functions by David Frank
StringiFor by Stefano Szaghi @szaghi
fortranString from @bceverly
fortran-string-utility-module by @tomedunn
fortran-string from @dongli
strings by @jchristopherson
fortran_libstring from @koiking213
fortran-strings by @eengl
flibs by @arjenmarkus contains several modules for handling strings
functional-fortran by @milancurcic implements several functions on strings
ZstdFortranLib by @zbeekman has some conversion to/from other intrinsic kinds, sub, gsub, split, join, and conversion on concatenation. WIP though
LibString, F77 string library by Giulio Vistoli & Alex Pedretti

It is likely that several of the tools in the list of popular Fortran projects also contain some tools for working with strings. Given the numerous implementations it seems like this is one of the things where the absence of the standard "... led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes" to borrow the quote of B. Stroustrup in a retrospective of the C++ language.

For comparison here are some links to descriptions of string handling functions in other programming languages:

Python: String Methods, String constants, custom string formatting, template strings
Ruby: String class & methods
D: std.string, std.utf, std.path, std.regex, std.ascii (see related issue Proposal for ascii #11), std.encoding, std.windows.charset, std.conv
C: C string handling on wikipedia
C++: std::string class, <string> header, C++ string handling on Wikipedia, Boost libraries for string and text processing
Julia: Strings, Common operations
MATLAB: Characters and Strings
Rust: Strings - Rust By Example, Primitive Type str, Struct std::string::String

Obviously, for now we should not aim to cover the full set of features available in other languages. Since the scope is quite big, it might be useful to break this issue into smaller issues for distinct operations (numeric converions, comparisons, finding the occurence of string in a larger string, joining and splitting, regular expressions).

My suggestion would be to start with some of the easy functions like capitalize, count, endswith, startswith, upper, lower, and the conversion routines from numeric types to strings and vice-versa.

The text was updated successfully, but these errors were encountered:

jacobwilliams · 2020-01-02T19:27:57Z

This is a great summary! I feel like most of what we need has already been done in these projects (and others), so mainly we need to just gather it all together. Some important things to decide:

what do we call the string class? (I vote string).
what do we call the various individual methods?
Are they going to be OO (s.lower()) or functional (lower(s))?

everythingfunctional · 2020-01-02T19:32:25Z

Should we base it on the ISO_VARYING_STRING module? If so, the class is VARYING_STRING and the procedures are functional (lower(s)).

Should we utilize intrinsic function names? like real(string, [kind, [status]]) and have it stop if the conversion fails and no status variable is provided?

milancurcic · 2020-01-02T19:33:06Z

functional-fortran implements several functions on strings:

Further, these functions (and their corresponding operators) are compatible with character strings: complement, empty, head, init, intersection, insert, last, reverse, set, sort, split, tail, and union.

(Caution: split in functional-fortran is not quite what's been discussed at j3-fortran repo. It merely splits the string in two and returns the first or second part)

certik · 2020-01-02T19:37:41Z

Thanks for this initiative and listing the current landscape. I think we definitely want stdlib to have good string support.

(For conversion from real/integer numbers to strings, I implemented a function str to be used like this: https://github.com/certik/fortran-utils/blob/b43bd24cd421509a5bc6d3b9c3eeae8ce856ed88/tests/strings/test_str.f90, implemented here and here, so one can do things like "Number i = " // str(i) // ".".)

@ivan-pi do you want to go ahead and create a table of the basic subroutines and let's brainstorm how they should be named, to be consistent with other languages and/or the above various string implementations if possible. And also if they should be functions or subroutines and what arguments to accept.

certik · 2020-01-02T19:42:25Z

@jacobwilliams is right about raising the question how to represent the string. We should start with that.

I would recommend (as usual) to have a lowest level API that operates on the standard Fortran (allocatable where appropriate) character. Then, have a higher level API that operates on a string type, and simply calls the lower level API. Regarding a name, see #26, it seems most people agree that the convention to name derived type is to append _t, so it would be string_t.

That way people can use these low level API routines right away. For example in my codes I do not need to modify any data structures and can start using it. The higher level string_t API can then be used by codes that choose to refactor them, or in new codes. If the syntax is not as nice, some people might opt for the lower level API anyway.

everythingfunctional · 2020-01-02T20:05:41Z

I would vote that the low level API be based on functions, pure and elemental where possible and appropriate. I would stick with the Fortran convention of optional status parameters where there is the possibility of things going wrong, and if one is not provided and something goes wrong it crashes. I have tended to use that convention in any routines that go from a string to some intrinsic like:

if (present(status)) then
    read(string, *) result
else
    read(string, *, iostat=status) result
end if

Honestly, I thought the ISO_VARYING_STRING standard did a great job of covering all of the intrinsic functions available for character(len=*) variables, and extending IO to work with that type (put, get, put_line). Aside from the strange interface for split I think it's a great starting point.

ivan-pi · 2020-01-02T20:15:12Z

My plan was to go through the libraries above and create a table of the most commonly available routines in the next days.

I agree we should consider both low-level routines which work directly on strings of type character(len=*) and a high-level string_t type.

The book Fortran Tools for VAX/VMS and MS-DOS by Jones & Crabtree contains a description of a Fortran string-handling library. Interestingly, they decided to use null-terminated strings like in C, meaning they needed to build a separate set of functions from the intrinsic ones (concatenation operator // and length function). They later used these tools to develop a compiler for a subset of the Fortran language itself! Their conclusion about strings was:

Fortran is often maligned for its lack of facilities for character-oriented processing. ... The apparent deficiency of Fortran for string manipulation is primarily because of the methods traditionally used rather than because of a shortcoming of the language itself. The main shortcoming of Fortran for string handling is the lack of a standard library of routines for often-needed functions. As Fortran programmers we are faced with a choice: we either invest the up-front effort required to create our own standard library or we live with the continuing effort of hacking together a solution each time we are presented with similar problems.

certik · 2020-01-02T20:15:18Z

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

ivan-pi · 2020-01-02T20:17:03Z

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

I have linked three distinct implementations in the top post. The links from the gfortran compiler pages are dead as well as the link in Modern Fortran Explained by MCR.

Edit: An informal description of the iso_varying_string module for Varying Length Character Strings in Fortran can be found at: http://numat.net/fortran/is1539-2-99.html

certik · 2020-01-02T20:24:58Z

@ivan-pi thanks. I like your plan. It looks like the iso_varying_string is in the "high-level" API category, as it operates on a VARYING_STRING derived type. Our low-level API would be similar, but operating directly on character(len=*).

jacobwilliams · 2020-01-02T21:34:50Z

Building the low-level API on character(len=*) variables will be problematic for some operations, since they can't be resized. The high-level API will need to call routines that operate on character(len=:),allocatable variables. So you may end up with two slightly different routines in some cases. So are there really three APIs?

character(len=*)
character(len=:),allocatable
string_t or whatever we call it

That seems complicated to me... but it would cover all the bases...

milancurcic · 2020-01-02T21:54:55Z

I think there are two possible APIs here: intrinsic and derived-type one.

For the intrinsic API, character(len=*) works well for input strings. If the function will return a string of known size, you return a charecter(len=something) string. If unknown, you return an allocated character(len=:), allocatable string. User doesn't need to know which one it is.

I also see the intrinsic one as the starting point. Higher-level (derived type) implementation is likely to use the intrinsic API internally.

everythingfunctional · 2020-01-02T22:14:23Z

My understanding, and somebody correct me if I don't have this quite right, is that the ISO_VARYING_STRING standard was created before character(len=:), allocatable (around 2001 I think?), but then when character(len=:), allocatable was added to the standard, it was supposed to function like variable length strings, and so the former was mostly abandoned. However, I have found most compilers to be buggy with their implementation. Memory leaks when used as the return from a function, failure to properly reallocate on assignment, false-positive warnings about accessing uninitialized memory, etc.

If allocatable character actually worked we wouldn't a new derived type for strings. You would just use the intrinsic type and move on. But I think as written in the standard, it probably will never truly work properly in all cases (especially as in read statements, since other allocatable arrays don't and aren't supposed to).

If there is a new type for strings, I don't think a lower level library or API should be exposed, and it should probably not be based on allocatable characters.

certik · 2020-01-02T23:12:00Z

As I mentioned above, you can use this trick to return character(len=N) strings from functions. The downside is that the string operation gets executed twice --- once to compute the length in the pure procedure, and second time to actually return it. So we probably don't want to do it that way. What I was thinking is to do what @milancurcic suggested: use character(len=*) as well as character(len=N) where we can, and use character(len=:),allocatable to avoid doing the operation twice as I described above. And that's the low level API. Below I provide two examples: #69 (comment) and #69 (comment), to show one one would decide whether to expose character(len=*) or character(len=:), allocatable.

As @everythingfunctional mentioned, for example GFortran used to have huge problems with allocatable strings and leaked memory. The latest version has improved a lot. Given that this is standard Fortran, and stdlib is a standard library, I think it is ok if we depend on the standard, and if there are compiler bugs, we'll try to workaround them and ensure they are reported. Regarding read statements, see #14 that would handle that. I think we should at least try to create a consistent low level API, not give up without even trying. If it truly cannot be done, only then we'll have to do what you propose, and only expose the string_t type and report the bugs to compilers (and keep the list somewhere) and propose improvements to the language itself, so that it can be done in the future.

everythingfunctional · 2020-01-02T23:29:07Z

I thought you could only use intrinsic procedures in variable declaration statements. Learned something new. That's a neat trick, but like you said, not particularly efficient.

certik · 2020-01-02T23:33:29Z

Let's discuss a simple example: upcase.

character(*)

Here is an implementation:

function upcase(s) result(t)
! Returns string 's' in uppercase  
character(*), intent(in) :: s
character(len(s)) :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

When the user wants to use it, he could do this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
allocate(character(len(s)) :: a)
a = upcase(s)
print *, a

which prints:

 Some string
 SOME STRING

The main disadvantage of this approach is that the user needs to know the size ahead of time. In this case he knows --- it's the same size as the original string. Although modern gfortran has reallocatable LHS turned on, so then just this works:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

So I think that would work for upcase.

character(:), allocatable

Here is the implementation using character(:), allocatable

function upcase(s) result(t)
! Returns string 's' in uppercase
character(*), intent(in) :: s
character(:), allocatable :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

It's still used like this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

But since this as an extra allocation inside upcase, I would think that in this case, the character(*) version is better.

certik · 2020-01-02T23:49:42Z

Now let's discuss integer to string conversion, the two implementations:

character(*)

pure integer function str_int_len(i) result(sz)
! Returns the length of the string representation of 'i'
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: s
! If 's' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(s, '(i0)') i
sz = len_trim(s)
end function

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
character(len=str_int_len(i)) :: s
write(s, '(i0)') i
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

character(:), allocatable

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: tmp
character(:), allocatable :: s
! If 'tmp' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(tmp, '(i0)') i
s = trim(tmp)
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

Discussion

Unlike in the upcase (see previous comment), here the character(*) version is converting twice, so it is inefficient. The character(:), allocatable version just converts once, and so that would be the preferable API.

(Note: if we implement our own integer to string conversion algorithm, then we avoid the ugly MAX_STR thing and the need to call trim. The above implementation was reused from my codes, where I just use the Fortran intrinsic conversion as part of write so that I save code.)

certik · 2020-01-03T00:08:44Z

Here is my proposal for the low level API:

Use character(*) where possible and efficient (see the previous two comments for examples how to decide)
Use character(:), allocatable otherwise
For compilers that cannot compile the code (or leak memory): create a workaround subroutine with a different (less nice or less efficient) API and use that instead for those compilers only, and report the compiler bug and reference it in the code. As a community we have contacts to compiler vendors, and we can communicate this and help get this fixed. The long term goal would be to eventually have no workarounds in 3.

Unfortunately some compilers might leak memory or segfault when such strings are used in derived types. Ultimately, long term, the compilers must be fixed. That's why I think the above proposal is a good one for the long term. In the short term, if we want to provide strings to users that actually work in all today's compilers, it might be that the only way is to create a string_t type not based on allocatable strings, in which case one could still use the low level API with the workarounds 1., 2., and 3., but make a copy of the result into the derived type string_t that is internally represented differently, so that today's compilers do not leak memory. That would be less efficient than providing a separate string implementation, but it's only a short term issue anyway, until compilers catch up. (Alternatively we can have an efficient duplicate implementation based on the internal string_t representation directly if we want better performance until compilers catch up.)

ivan-pi · 2020-01-03T12:48:29Z

Specifically for the case of integer to string conversion, you could also dynamically allocate a buffer for each integer kind and then trim the result into an allocatable character string:

    function integer_to_string2(i) result(res)
      character(len=:),allocatable :: res
      integer, intent(in) :: i
      character(len=range(i)+2) :: tmp
      write(tmp,'(i0)') i
      res = trim(tmp)
    end function

If we want to avoid internal I/O this function becomes something like

    function integer_to_string1(ival) result(str)
        integer, intent(in) :: ival
        character(len=:), allocatable :: str
        integer, parameter :: ibuffer_len = range(ival)+2
        character(len=ibuffer_len) :: buffer
        integer :: i, sign, n

        if (ival == 0) then
            str = '0'
            return
        end if

        sign = 1
        if (ival < 0) sign = -1

        n = abs(ival)
        buffer = ""

        i = ibuffer_len
        do while (n > 0)
            buffer(i:i) = char(mod(n,10) + ichar('0'))
            n = n/10
            i = i - 1
        end do
        if (sign == -1) then
            buffer(i:i) = '-'
            i = i - 1
        end if

        str = buffer(i+1:ibuffer_len)
    end function

For processing floating point values the functions are much more difficult to develop compared to those using internal read and write statements.

ivan-pi · 2020-01-03T16:57:10Z

I did some keyword searchs in the list of popular Fortran projects. It seems that most projects use their own set of character conversion and string handling routines for stuff like reading input values from files, parsing command line options, defining settings, etc..

Here are the results of my search of some of the top projects:

Project	# of "string"	# of "character"	# of Fortran files
ElmerFEM	248	1319	2076
WRF	306	966	1668
fds	16	28	41
quantum-Espresso	66	472	1516
fluidity	38	279	747
json-fortran	26	47	49
fortranlib	11	18	38
Nek5000	54	204	336
cp2k	439	1043	1132
nastran-95	85	551	1838
specfem3d	186	404	765
nwchem	323	2768	17214
gtk-fortran	59	77	92
cfl3d	14	216	397
shtools	2	20	113
arpack-ng	1	259	332

The second and third column measure the number of Fortran files that contain the keywords string or character, respectively. This includes both command statements and comments so it may be a bit misleading.

In one of the codebases I even found this comment:

    ! String parsing in Fortran
    ! is such a pain
    ! it's unreal

ivan-pi · 2020-01-03T18:01:21Z

Casing

The purpose of these functions is to return of copy of a character string ( either character(len=*) or a derived string type) with the case converted . The common variants are uppercase, lowercase, and titlecase.

The libraries cited in the first post contain the following function prototypes:

! functional
function str_upper(str)
function str_lower(str)
function str_swapcase(str)
pure function ucase(input)
pure function lcase(input)
function str_lowercase(str)
function str_uppercase(str)
subroutine str_convert_to_lowercase(str)
subroutine str_convert_to_uppercase(str)
pure elemental function lowercase_string(str)
function uppercase(str)
function lowercase(str)

! object-oriented
procedure, pass(self) :: camelcase
procedure, pass(self) :: capitalize
procedure, pass(self) :: lower
procedure, pass(self) :: snakecase
procedure, pass(self) :: startcase
procedure, pass(self) :: upper
function vstring_tolower(this[,first,last])
function vstring_toupper(this[,first,last])
function vstring_totitle(this[,first,last])

Some versions will return a new string, while some work in place. In at least one of the functions, it did not convert the case of characters enclosed between quotation marks.

These are the similar functions available in other programming languages:

Python: capitalize, lower, swapcase, upper
Ruby: capitalize, swapcase, upcase, downcase
D: toLower, toLowerInPlace, toUpper, toUpperInPlace, asCapitalized, asLowerCase, asUpperCase
MATLAB: lower, upper
Julia: uppercase, lowercase, titlecase, uppercasefirst, lowercasefirst
C++ (Boost): to_upper, to_lower,
Rust: to_uppercase, to_ascii_uppercase, to_lowercase, to_ascii_lowercase, make_ascii_uppercase, make_ascii_lowercase

My top three name picks are:

uppercase/lowercase/titlecase
to_upper/to_lower/to_title
upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

milancurcic · 2020-01-03T18:04:30Z

I'd like to add to the list of facilities here the overloaded operator * between integers and strings, so that you can do, like in Python:

print *, 3 * 'hello' ! prints 'hellohellohello'
print *, 'world' * 2 ! prints 'worldworld'

It's easy to make and use. The only downside I can think of is a somewhat weird API when importing it:

use stdlib_experimental_strings, only: operator(*)

ivan-pi · 2020-01-03T18:23:25Z

Yes, I have seen this kind of usage in one of the above mentioned libraries. I am not sure whether it is not perhaps better to promote the usage of the intrinsic repeat function. As the Zen of Python states: There should be one-- and preferably only one --obvious way to do it.

A benefit of repeat is precisely that you avoid the import statement.

milancurcic · 2020-01-03T19:31:26Z

Oops, I didn't know about repeat. Indeed it's the way to go so I withdraw my proposal. I need to brush up on my canonical Fortran. :)

zbeekman · 2020-01-03T20:34:32Z

It is really hard to have a day job and keep up with all these threads, so my apologies if I've missed something because I'm just skimming here. A few opinionated notes:

Ruby is my favorite language for string processing, and IMO is the best at it. If no one objects (especially @ivan-pi) I'll put links in the first post on this issue
I have focussed mostly on some basic string handling (expanded template here) in my ZstdFortranLib project which I started only days before this project started, so I'm of a mind to possibly abandon it or work to integrate some of it here. It has:
- split: returns an array of characters where each entry is as long as the longest one
- join: joins an array of characters
- sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string
- gsub: Replaces all occurrences with a substring
- //: Overloaded to allow concatenation of all real, logical, integer kinds with all character kinds
- to_i: Convert to integer (all kinds)
- to_r: Convert to real (all kinds)
- to_l: Convert to logical (all kinds)
- to_s: Convert anything to a string (all kinds)
- ANSI formatting stuff that can be turned on/off globally for coloring and styling terminal output (thanks to using one of @szaghi's projects, FACE I believe)

I need to look at the varying string and character array proposals in more detail.

FWIW, I personally prefer the Ruby Python OO approaches with methods because it will make import statements much simpler: Pull in the string class and you get all the methods along with the type/class declaration. Now some operators may need to be pulled in as well if you want to be able to concatenate a real (lhs) with a string (rhs, can't have a TBP operator to the left of the object IIRC).

I was thinking of starting a PR marrying my work on ZstdFortranLib with a UDT/Class approach rather than operating on raw character scalars and arrays which is awkward for things like split(). But now I need to catch up on the myriad of proposals and prior art, so don't hold your breath.

zbeekman · 2020-01-03T20:37:13Z

While there is an intrinsic implementation, repeat(), I still like the more concise syntax which has pretty clear meaning for anyone who has ever worked with languages like Python and Ruby. It would be nice if some of these syntactic sugar items were added to the standard rather than a standard library. But until then I would be happy with * overloaded for characters.

certik · 2020-01-03T20:42:17Z

@zbeekman I am struggling with all the threads also, but that is good news. It means there is lots of momentum. If you can help us design a good low and high level API for strings (#69 (comment)), that would be great.

zbeekman · 2020-01-03T20:45:47Z

@certik k

Now let's discuss integer to string conversion, the two implementations:

I like your first one the most. With integers you can use some math to count up how many digits there are, and if you need a sign on the front, which completely removes the need to declare the max string length AND to do the IO twice. Instead you use integer and floating point math which (hopefully) will be reasonably quick. IIRC, I implemented something to do this in JSON-Fortran but I'll have to look for it.

Also, I don't mean to whine about not being able to keep up, and I agree that it's good, but it's hard to keep track of all the balls in the air.

zbeekman · 2020-01-03T20:58:56Z

After a brief search there are at least 3 ways to do this without performing the conversion to a string then counting digits:

Iteration:

len = 1
if ( n < 0 ) len = 2
do
  n = n / 10
  if (n == 0) exit
  len = len + 1
end do

Tail recursion (same algorithm as above: It will be optimized to code above by compiler or it will be slower if the compiler uses recursion)

"One shot" method using log10

len = floor( log10( real( abs( n ) ) ) + 1 )
if ( n < 0 ) len = len + 1

I would guess that 1 is the fastest way to do this, but it may depend on the compiler and hardware. 3 has conversion to a real, then log10 is probably computed iteratively, and it is converted back to an int, so 1. may be faster despite the loop.

awvwgk · 2021-02-04T21:56:45Z

After reading through this thread I found subtle issue with the proposed low-level API for character(len=*) variables.

My top three name picks are:

uppercase/lowercase/titlecase

to_upper/to_lower/to_title

upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

Taking just the basic functionality mentioned by @ivan-pi here, I implemented a stdlib_character module as demonstration in #310, the issue becomes quickly apparent once you try to use both stdlib_character and stdlib_ascii in the same scope.

awvwgk · 2021-02-05T21:35:19Z

I created an exploratory implementation of a functional string handling at awvwgk/stdlib_string as fpm project. A non-fancy string type is implemented there, which basically provides the same functionality as a deferred length character but can be used in an elemental rather than a pure way. The idea is to have a scaffold for the string type in stdlib which can be extended later but already provides everything we are used to have from the deferred length character without the rough edges.

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

there is no assignment from string to character
- reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS
all procedures return a fixed length character rather than a string instance
- reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
- drawback: assigning the return value to a string might create a temporary variable on the stack
no support for get and put
- reason: derived type IO is used instead

certik · 2021-02-10T16:45:41Z

@awvwgk this would be the high level API that operates on the string_type type.

How would a low level API look? Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

The maybe function can also operate on character(len=:), allocatable it seems. So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

awvwgk · 2021-02-10T16:58:01Z

Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

Bad example, the read_formatted procedure defines a user defined derived type input (see #312), which cannot be defined for character(len=:), allocatable since there is already an intrinsic formatted read transfer for character(len=*) types defined.

this would be the high level API that operates on the string_type type.

The idea so far was to provide the intrinsic low level API for a string type, on which later the high level API can be defined.

So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

Exactly, I wanted to explore a common basis of agreed on functions for a future high level string object. The minimal agreed on basis should be easily all the intrinsic procedures defined for character(len=*).

How would a low level API look?

I decided to pick the part of the high-level API that will have no overlap with a potential low-level API. This way the low level API can be explored separately, like in #310

The maybe function can also operate on character(len=:), allocatable it seems.

This one was chosen deliberately to be an internal implementation detail, i.e. it is not part of the public API.

certik · 2021-02-10T17:36:44Z

Here is what I mean: awvwgk/stdlib_string#1

In that PR, I implemented a low level version of read_formatted called read_formatted0 that operates on character(len=:), allocatable. It seems to work.

ivan-pi · 2021-02-10T17:43:38Z

@awvwgk this would be the high level API that operates on the string_type type.

@certik, the procedures in Sebastian's module are in fact equivalents of the intrinsic character procedures already available in Fortran:

    public :: len, len_trim, trim, index, scan, verify, repeat, adjustr, adjustl
    public :: lgt, lge, llt, lle, char, ichar, iachar
    public :: assignment(=)
    public :: operator(.gt.), operator(.ge.), operator(.lt.), operator(.le.)
    public :: operator(.eq.), operator(.ne.), operator(//)
    public :: write(formatted), write(unformatted)
    public :: read(formatted), read(unformatted)

The pull request #310 is the first to propose new procedures (reverse, to_title, to_upper, to_lower) which can operate both on the intrinsic character(len=*) variables, and in a later pull request also on a high-level string type (to be decided).

String processing in Fortran is not that bad, considering the number of procedures already there. If we could add casing, numeric to string convertors (and vice-versa), join and split, and perhaps a few more procedures, I think most usage cases would be covered.

certik · 2021-02-10T18:52:55Z

@ivan-pi actually there is genuine new functionality, that I just extracted here: awvwgk/stdlib_string#1 (comment).

awvwgk · 2021-02-10T19:23:50Z

@certik I see, you are right the there is new functionality, the new_string_from_chars function extends beyond intrinsic functionality. I would suggest that I will submit it as a separate patch to the existing stdlib_ascii module and remove it from the stdlib_string_type until we have agreed on the low-level API. (new issue at #315)

Regarding the len functionality I find the argument is bit stretched, but I'm willing to follow it for the sake of the discussion. Eventually, the argument boils down to whether the utility function maybe should be exposed as public API. I think it is an implementation detail, because the specs as proposed do not define the internal representation of the character sequences, the string_type could use character(len=1), allocatable :: raw(:) instead of character(len=:), allocatable :: raw like in most iso_varying_string implementations.

The maybe, ok and err functionality from Rust come to mind here and would make a great addition for stdlib as well if we can successfully emulate this kind of behaviour. I don't feel confident that I got a good spin on this kind of features to make it stable enough for an actual addition to stdlib, therefore I don't want to force a maybe implementation for a character(len=:), allocatable yet.

However, I disagree on the low level API for user defined derived type input output, it is strictly a feature that can only be defined for a derived type but not an intrinsic and we won't be able to make use of it to safely read into a character(len=:), allocatable :: dlc with read(unit, *) dlc due to the construction of the Fortran standard.

The gist is, I don't want to introduce new functionality beyond the existing character(len=:), allocatable with the string_type on the first pass. Just one step at a time to allow better focus.

certik · 2021-02-10T20:57:37Z

@awvwgk I just saw your comment, my comment here I think replies to yours: awvwgk/stdlib_string#1 (comment).

awvwgk · 2021-02-13T10:38:26Z

There is now also a branch at my stdlib fork. There is one really unfortunate thing here, GCC 7 and 8 do not support evaluation of user defined pure procedures in variable declarations. Adopting this string_type will inevitably drop support for GCC 7 and 8.

The solution is to adopt the iso_varying_string strategy to return a string_type instead of a fixed length character, which comes with its own problem that results from stdlib_string_type procedures now must be explicitly cast back to character form.

awvwgk · 2021-02-16T23:23:06Z

As promised in #320 (comment) I tried to devise an abstract base class (ABC) for an extendible string class. This one turned out much more difficult to design than a non-extendible functional string type, you can check the base class definition here:

https://github.com/awvwgk/stdlib_string/blob/string-class/src/stdlib_string_class.f90

The class is a bit more bloated than it has to be because I made it compatible with the intrinsic character type and the functional string type as well to ease testing.

One thing that turns out to be very difficult to account for are overloaded intrinsic procedures, you can find two implementation for each intrinsic procedure (except for the lexical comparison where I took a shortcut), one for the overloaded generic interface (len(string)) and a type bound implementation (string%get_len()), with the former invoking the latter. This was necessary to allow using the overloaded intrinsic procedure names while still relying on the runtime resolution of the type bound procedures from the object.

Another problem was returning a class polymorphic object from a procedure (operator(//) or trim), returning class(string_class), allocatable would force users to declare their string objects always as class polymorphic even if they want to use a specific implementation. Therefore, I decided to return a functional string_type instance instead and provide an assignment from string_type to a polymorphic string_class object to hide this fact effectively.

Since we have a whole lot of intrinsic character procedures implementing a string class based on this ABC can become tedious, therefore I designed the ABC to provide mock implementations based on the setter (assignment(=)) and getter (char(self)) functionality which can optionally be overwritten. Only the assignment from a character variable and the three char functionalities are actually deferred and must be provided in a minimal implementation.

While this is not a final specification yet, I wanted to share it as aid for discussion functional vs. object oriented implementation of a string in stdlib. From the above notes you might gather that a truly extendible string class could result in significant performance penalties for the user. Still there might be some value in having a string object available.

ivan-pi · 2021-03-05T11:35:55Z

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

there is no assignment from string to character

reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS

If I understand things correctly, the assignment to character should be handled explicitly through the char function? I.e.

type(varying_string) :: varying   ! from iso_varying_string
type(string_type) :: nonfancy     ! from PR #320 
character(len=20) :: flc 
character(len=:), allocatable :: dlc

flc = varying ! works
dlc = varying ! fails, dlc needs to be allocated first

allocate(character(len=len(varying)) :: dlc)
dlc = varying ! works

flc = char(nonfancy) ! works
dlc = char(nonfancy) ! works

all procedures return a fixed length character rather than a string instance

reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment

drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

no support for get and put

reason: derived type IO is used instead

👍 This is better and more Fortranic IMO. put and get where borrowed from C.

awvwgk · 2021-03-05T12:00:27Z

all procedures return a fixed length character rather than a string instance

reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment

drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

None, because I had to reconsider this design choice due to missing compiler support.

ivan-pi · 2021-03-05T12:44:05Z

Does the initial design choice (the one which breaks GCC 7 and 8 support) survive in any of the earlier commits on your private fork? I wonder if you could still pull it off, by moving the functions out of a module...

I still don't fully grasp how the implementation differed. Would for example the repeat(string, ncopies) accept a type(string_type), and use an overloaded pure len function to return a fixed-size result of size len(string)*ncopies?

In any case your pull request is a big step to make string-handling easier.

awvwgk · 2021-03-05T12:48:09Z

@ivan-pi See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way.

gronki · 2021-03-05T14:56:43Z

I have recently learned that overloading an assignment operator is a mistake in most cases. For example, appending one element to an allocatable array using the notation: string_array = [string_array, string("new_string")] will not work. With this design flaw of a language, I'd argue that overloading assignments should be avoided at all cost. Dominik pt., 5 mar 2021 o 13:48 Sebastian Ehlert <[email protected]> napisał(a):

…

@ivan-pi <https://github.com/ivan-pi> See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC4NA3N5NJ3COJ2CJF3KNWLTCDHJRANCNFSM4KCFW35Q> .

awvwgk · 2021-03-11T22:31:15Z

I updated my stdlib_string project with an abstract base class for a more object-oriented string implementation. As a demonstration of such a string_class I added @robertrueger's ftlString and @szaghi's StringiFor projects as examples to the repository, but based them on string_class rather than having them implement the intrinsic functions themselves. This could allow to make existing string libraries easily compatible with stdlib by allowing them to inherit from string_class (and they would also become compatible with each other).

ivan-pi · 2021-04-28T11:25:15Z

Not sure if it was linked before, Clive Page wrote a nice summary about character types in Fortran: https://fortran.bcs.org/2015/suggestion_string_handling.pdf

There was also a thread over at the Fortran-FOSS programmers: Fortran-FOSS-Programmers/Fortran-202X-Proposals#4

A link was provided to a WG5 document, which talks about a print() function (page 9) similar to what we have now as to_string(): https://wg5-fortran.org/N1951-N2000/N1972.pdf

certik mentioned this issue Jan 3, 2020

Proposal for high level I/O #14

Open

awvwgk mentioned this issue Feb 13, 2021

Implement non-fancy functional string type #320

Merged

ivan-pi mentioned this issue Feb 14, 2021

Extend stdlib_ascii logical functions to character strings #321

Open

agoetz mentioned this issue Mar 9, 2021

Failure to read input file merzlab/QUICK#127

Closed

This was referenced Mar 11, 2021

Provide abstract base class for a string object #333

Closed

Functional vs. object-oriented string handling #334

Open

This was referenced Mar 12, 2021

Routines to convert real/complex values to character values #337

Open

"Stable" naming conventions (when and how) #342

Open

Implement strip and chomp as supplement to trim #343

Merged

awvwgk mentioned this issue Mar 27, 2021

Three-way comparison function for strings #364

Open

ivan-pi mentioned this issue Mar 27, 2021

Replace substring in character/string_type #366

Open

ivan-pi mentioned this issue Apr 10, 2021

Array of strings j3-fortran/fortran_proposals#24

Open

awvwgk mentioned this issue Apr 10, 2021

Add functions to convert integer/logical values to character values #336

Merged

String handling routines #69

String handling routines #69

Comments

ivan-pi commented Jan 2, 2020 • edited Loading

jacobwilliams commented Jan 2, 2020

everythingfunctional commented Jan 2, 2020

milancurcic commented Jan 2, 2020 • edited Loading

certik commented Jan 2, 2020

certik commented Jan 2, 2020

everythingfunctional commented Jan 2, 2020

ivan-pi commented Jan 2, 2020

certik commented Jan 2, 2020

ivan-pi commented Jan 2, 2020 • edited Loading

certik commented Jan 2, 2020

jacobwilliams commented Jan 2, 2020

milancurcic commented Jan 2, 2020

everythingfunctional commented Jan 2, 2020

certik commented Jan 2, 2020 • edited Loading

everythingfunctional commented Jan 2, 2020

certik commented Jan 2, 2020 • edited Loading

character(*)

character(:), allocatable

certik commented Jan 2, 2020 • edited Loading

character(*)

character(:), allocatable

Discussion

certik commented Jan 3, 2020

ivan-pi commented Jan 3, 2020

ivan-pi commented Jan 3, 2020

ivan-pi commented Jan 3, 2020 • edited Loading

Casing

milancurcic commented Jan 3, 2020 • edited Loading

ivan-pi commented Jan 3, 2020

milancurcic commented Jan 3, 2020 • edited Loading

zbeekman commented Jan 3, 2020

zbeekman commented Jan 3, 2020

certik commented Jan 3, 2020

zbeekman commented Jan 3, 2020

zbeekman commented Jan 3, 2020

awvwgk commented Feb 4, 2021

awvwgk commented Feb 5, 2021 • edited Loading

certik commented Feb 10, 2021

awvwgk commented Feb 10, 2021 • edited Loading

certik commented Feb 10, 2021

ivan-pi commented Feb 10, 2021

certik commented Feb 10, 2021

awvwgk commented Feb 10, 2021 • edited Loading

certik commented Feb 10, 2021

awvwgk commented Feb 13, 2021 • edited Loading

awvwgk commented Feb 16, 2021 • edited Loading

ivan-pi commented Mar 5, 2021 • edited Loading

awvwgk commented Mar 5, 2021

ivan-pi commented Mar 5, 2021

awvwgk commented Mar 5, 2021

gronki commented Mar 5, 2021 via email

awvwgk commented Mar 11, 2021 • edited Loading

ivan-pi commented Apr 28, 2021 • edited Loading

ivan-pi commented Jan 2, 2020 •

edited

Loading

milancurcic commented Jan 2, 2020 •

edited

Loading

ivan-pi commented Jan 2, 2020 •

edited

Loading

certik commented Jan 2, 2020 •

edited

Loading

certik commented Jan 2, 2020 •

edited

Loading

certik commented Jan 2, 2020 •

edited

Loading

ivan-pi commented Jan 3, 2020 •

edited

Loading

milancurcic commented Jan 3, 2020 •

edited

Loading

milancurcic commented Jan 3, 2020 •

edited

Loading

awvwgk commented Feb 5, 2021 •

edited

Loading

awvwgk commented Feb 10, 2021 •

edited

Loading

awvwgk commented Feb 10, 2021 •

edited

Loading

awvwgk commented Feb 13, 2021 •

edited

Loading

awvwgk commented Feb 16, 2021 •

edited

Loading

ivan-pi commented Mar 5, 2021 •

edited

Loading

awvwgk commented Mar 11, 2021 •

edited

Loading

ivan-pi commented Apr 28, 2021 •

edited

Loading