Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String handling routines #69

Open
ivan-pi opened this issue Jan 2, 2020 · 78 comments
Open

String handling routines #69

ivan-pi opened this issue Jan 2, 2020 · 78 comments
Labels
topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...

Comments

@ivan-pi
Copy link
Member

ivan-pi commented Jan 2, 2020

Let's start a discussion on routines for string handling and manipulation. The thread over at j3-fortran already collected some ideas:

  • split - given a separator, splits the string into some form of array
  • upper/lower - convert a character string to all upper/lower case

The discussion also mentioned the proposed iso_varying_string module, which was supposed to include some string routines. I found three distinct implementations of this module:

I also found the following Fortran libraries targeting string handling:

It is likely that several of the tools in the list of popular Fortran projects also contain some tools for working with strings. Given the numerous implementations it seems like this is one of the things where the absence of the standard "... led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes" to borrow the quote of B. Stroustrup in a retrospective of the C++ language.

For comparison here are some links to descriptions of string handling functions in other programming languages:

Obviously, for now we should not aim to cover the full set of features available in other languages. Since the scope is quite big, it might be useful to break this issue into smaller issues for distinct operations (numeric converions, comparisons, finding the occurence of string in a larger string, joining and splitting, regular expressions).

My suggestion would be to start with some of the easy functions like capitalize, count, endswith, startswith, upper, lower, and the conversion routines from numeric types to strings and vice-versa.

@jacobwilliams
Copy link
Member

This is a great summary! I feel like most of what we need has already been done in these projects (and others), so mainly we need to just gather it all together. Some important things to decide:

  • what do we call the string class? (I vote string).
  • what do we call the various individual methods?
  • Are they going to be OO (s.lower()) or functional (lower(s))?

@everythingfunctional
Copy link
Member

Should we base it on the ISO_VARYING_STRING module? If so, the class is VARYING_STRING and the procedures are functional (lower(s)).

Should we utilize intrinsic function names? like real(string, [kind, [status]]) and have it stop if the conversion fails and no status variable is provided?

@milancurcic
Copy link
Member

milancurcic commented Jan 2, 2020

functional-fortran implements several functions on strings:

Further, these functions (and their corresponding operators) are compatible with character strings: complement, empty, head, init, intersection, insert, last, reverse, set, sort, split, tail, and union.

(Caution: split in functional-fortran is not quite what's been discussed at j3-fortran repo. It merely splits the string in two and returns the first or second part)

@certik
Copy link
Member

certik commented Jan 2, 2020

Thanks for this initiative and listing the current landscape. I think we definitely want stdlib to have good string support.

(For conversion from real/integer numbers to strings, I implemented a function str to be used like this: https://github.com/certik/fortran-utils/blob/b43bd24cd421509a5bc6d3b9c3eeae8ce856ed88/tests/strings/test_str.f90, implemented here and here, so one can do things like "Number i = " // str(i) // ".".)

@ivan-pi do you want to go ahead and create a table of the basic subroutines and let's brainstorm how they should be named, to be consistent with other languages and/or the above various string implementations if possible. And also if they should be functions or subroutines and what arguments to accept.

@certik
Copy link
Member

certik commented Jan 2, 2020

@jacobwilliams is right about raising the question how to represent the string. We should start with that.

I would recommend (as usual) to have a lowest level API that operates on the standard Fortran (allocatable where appropriate) character. Then, have a higher level API that operates on a string type, and simply calls the lower level API. Regarding a name, see #26, it seems most people agree that the convention to name derived type is to append _t, so it would be string_t.

That way people can use these low level API routines right away. For example in my codes I do not need to modify any data structures and can start using it. The higher level string_t API can then be used by codes that choose to refactor them, or in new codes. If the syntax is not as nice, some people might opt for the lower level API anyway.

@everythingfunctional
Copy link
Member

I would vote that the low level API be based on functions, pure and elemental where possible and appropriate. I would stick with the Fortran convention of optional status parameters where there is the possibility of things going wrong, and if one is not provided and something goes wrong it crashes. I have tended to use that convention in any routines that go from a string to some intrinsic like:

if (present(status)) then
    read(string, *) result
else
    read(string, *, iostat=status) result
end if

Honestly, I thought the ISO_VARYING_STRING standard did a great job of covering all of the intrinsic functions available for character(len=*) variables, and extending IO to work with that type (put, get, put_line). Aside from the strange interface for split I think it's a great starting point.

@ivan-pi
Copy link
Member Author

ivan-pi commented Jan 2, 2020

My plan was to go through the libraries above and create a table of the most commonly available routines in the next days.

I agree we should consider both low-level routines which work directly on strings of type character(len=*) and a high-level string_t type.

The book Fortran Tools for VAX/VMS and MS-DOS by Jones & Crabtree contains a description of a Fortran string-handling library. Interestingly, they decided to use null-terminated strings like in C, meaning they needed to build a separate set of functions from the intrinsic ones (concatenation operator // and length function). They later used these tools to develop a compiler for a subset of the Fortran language itself! Their conclusion about strings was:

Fortran is often maligned for its lack of facilities for character-oriented processing. ... The apparent deficiency of Fortran for string manipulation is primarily because of the methods traditionally used rather than because of a shortcoming of the language itself. The main shortcoming of Fortran for string handling is the lack of a standard library of routines for often-needed functions. As Fortran programmers we are faced with a choice: we either invest the up-front effort required to create our own standard library or we live with the continuing effort of hacking together a solution each time we are presented with similar problems.

@certik
Copy link
Member

certik commented Jan 2, 2020

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

@ivan-pi
Copy link
Member Author

ivan-pi commented Jan 2, 2020

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

I have linked three distinct implementations in the top post. The links from the gfortran compiler pages are dead as well as the link in Modern Fortran Explained by MCR.

Edit: An informal description of the iso_varying_string module for Varying Length Character Strings in Fortran can be found at: http://numat.net/fortran/is1539-2-99.html

@certik
Copy link
Member

certik commented Jan 2, 2020

@ivan-pi thanks. I like your plan. It looks like the iso_varying_string is in the "high-level" API category, as it operates on a VARYING_STRING derived type. Our low-level API would be similar, but operating directly on character(len=*).

@jacobwilliams
Copy link
Member

Building the low-level API on character(len=*) variables will be problematic for some operations, since they can't be resized. The high-level API will need to call routines that operate on character(len=:),allocatable variables. So you may end up with two slightly different routines in some cases. So are there really three APIs?

  • character(len=*)
  • character(len=:),allocatable
  • string_t or whatever we call it

That seems complicated to me... but it would cover all the bases...

@milancurcic
Copy link
Member

I think there are two possible APIs here: intrinsic and derived-type one.

For the intrinsic API, character(len=*) works well for input strings. If the function will return a string of known size, you return a charecter(len=something) string. If unknown, you return an allocated character(len=:), allocatable string. User doesn't need to know which one it is.

I also see the intrinsic one as the starting point. Higher-level (derived type) implementation is likely to use the intrinsic API internally.

@everythingfunctional
Copy link
Member

My understanding, and somebody correct me if I don't have this quite right, is that the ISO_VARYING_STRING standard was created before character(len=:), allocatable (around 2001 I think?), but then when character(len=:), allocatable was added to the standard, it was supposed to function like variable length strings, and so the former was mostly abandoned. However, I have found most compilers to be buggy with their implementation. Memory leaks when used as the return from a function, failure to properly reallocate on assignment, false-positive warnings about accessing uninitialized memory, etc.

If allocatable character actually worked we wouldn't a new derived type for strings. You would just use the intrinsic type and move on. But I think as written in the standard, it probably will never truly work properly in all cases (especially as in read statements, since other allocatable arrays don't and aren't supposed to).

If there is a new type for strings, I don't think a lower level library or API should be exposed, and it should probably not be based on allocatable characters.

@certik
Copy link
Member

certik commented Jan 2, 2020

As I mentioned above, you can use this trick to return character(len=N) strings from functions. The downside is that the string operation gets executed twice --- once to compute the length in the pure procedure, and second time to actually return it. So we probably don't want to do it that way. What I was thinking is to do what @milancurcic suggested: use character(len=*) as well as character(len=N) where we can, and use character(len=:),allocatable to avoid doing the operation twice as I described above. And that's the low level API. Below I provide two examples: #69 (comment) and #69 (comment), to show one one would decide whether to expose character(len=*) or character(len=:), allocatable.

As @everythingfunctional mentioned, for example GFortran used to have huge problems with allocatable strings and leaked memory. The latest version has improved a lot. Given that this is standard Fortran, and stdlib is a standard library, I think it is ok if we depend on the standard, and if there are compiler bugs, we'll try to workaround them and ensure they are reported. Regarding read statements, see #14 that would handle that. I think we should at least try to create a consistent low level API, not give up without even trying. If it truly cannot be done, only then we'll have to do what you propose, and only expose the string_t type and report the bugs to compilers (and keep the list somewhere) and propose improvements to the language itself, so that it can be done in the future.

@everythingfunctional
Copy link
Member

I thought you could only use intrinsic procedures in variable declaration statements. Learned something new. That's a neat trick, but like you said, not particularly efficient.

@certik
Copy link
Member

certik commented Jan 2, 2020

Let's discuss a simple example: upcase.

character(*)

Here is an implementation:

function upcase(s) result(t)
! Returns string 's' in uppercase  
character(*), intent(in) :: s
character(len(s)) :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

When the user wants to use it, he could do this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
allocate(character(len(s)) :: a)
a = upcase(s)
print *, a

which prints:

 Some string
 SOME STRING

The main disadvantage of this approach is that the user needs to know the size ahead of time. In this case he knows --- it's the same size as the original string. Although modern gfortran has reallocatable LHS turned on, so then just this works:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

So I think that would work for upcase.

character(:), allocatable

Here is the implementation using character(:), allocatable

function upcase(s) result(t)
! Returns string 's' in uppercase
character(*), intent(in) :: s
character(:), allocatable :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

It's still used like this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

But since this as an extra allocation inside upcase, I would think that in this case, the character(*) version is better.

@certik
Copy link
Member

certik commented Jan 2, 2020

Now let's discuss integer to string conversion, the two implementations:

character(*)

pure integer function str_int_len(i) result(sz)
! Returns the length of the string representation of 'i'
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: s
! If 's' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(s, '(i0)') i
sz = len_trim(s)
end function

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
character(len=str_int_len(i)) :: s
write(s, '(i0)') i
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

character(:), allocatable

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: tmp
character(:), allocatable :: s
! If 'tmp' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(tmp, '(i0)') i
s = trim(tmp)
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

Discussion

Unlike in the upcase (see previous comment), here the character(*) version is converting twice, so it is inefficient. The character(:), allocatable version just converts once, and so that would be the preferable API.

(Note: if we implement our own integer to string conversion algorithm, then we avoid the ugly MAX_STR thing and the need to call trim. The above implementation was reused from my codes, where I just use the Fortran intrinsic conversion as part of write so that I save code.)

@certik
Copy link
Member

certik commented Jan 3, 2020

Here is my proposal for the low level API:

  1. Use character(*) where possible and efficient (see the previous two comments for examples how to decide)
  2. Use character(:), allocatable otherwise
  3. For compilers that cannot compile the code (or leak memory): create a workaround subroutine with a different (less nice or less efficient) API and use that instead for those compilers only, and report the compiler bug and reference it in the code. As a community we have contacts to compiler vendors, and we can communicate this and help get this fixed. The long term goal would be to eventually have no workarounds in 3.

Unfortunately some compilers might leak memory or segfault when such strings are used in derived types. Ultimately, long term, the compilers must be fixed. That's why I think the above proposal is a good one for the long term. In the short term, if we want to provide strings to users that actually work in all today's compilers, it might be that the only way is to create a string_t type not based on allocatable strings, in which case one could still use the low level API with the workarounds 1., 2., and 3., but make a copy of the result into the derived type string_t that is internally represented differently, so that today's compilers do not leak memory. That would be less efficient than providing a separate string implementation, but it's only a short term issue anyway, until compilers catch up. (Alternatively we can have an efficient duplicate implementation based on the internal string_t representation directly if we want better performance until compilers catch up.)

@ivan-pi
Copy link
Member Author

ivan-pi commented Jan 3, 2020

Specifically for the case of integer to string conversion, you could also dynamically allocate a buffer for each integer kind and then trim the result into an allocatable character string:

    function integer_to_string2(i) result(res)
      character(len=:),allocatable :: res
      integer, intent(in) :: i
      character(len=range(i)+2) :: tmp
      write(tmp,'(i0)') i
      res = trim(tmp)
    end function

If we want to avoid internal I/O this function becomes something like

    function integer_to_string1(ival) result(str)
        integer, intent(in) :: ival
        character(len=:), allocatable :: str
        integer, parameter :: ibuffer_len = range(ival)+2
        character(len=ibuffer_len) :: buffer
        integer :: i, sign, n

        if (ival == 0) then
            str = '0'
            return
        end if

        sign = 1
        if (ival < 0) sign = -1

        n = abs(ival)
        buffer = ""

        i = ibuffer_len
        do while (n > 0)
            buffer(i:i) = char(mod(n,10) + ichar('0'))
            n = n/10
            i = i - 1
        end do
        if (sign == -1) then
            buffer(i:i) = '-'
            i = i - 1
        end if

        str = buffer(i+1:ibuffer_len)
    end function

For processing floating point values the functions are much more difficult to develop compared to those using internal read and write statements.

@ivan-pi
Copy link
Member Author

ivan-pi commented Jan 3, 2020

I did some keyword searchs in the list of popular Fortran projects. It seems that most projects use their own set of character conversion and string handling routines for stuff like reading input values from files, parsing command line options, defining settings, etc..

Here are the results of my search of some of the top projects:

Project # of "string" # of "character" # of Fortran files
ElmerFEM 248 1319 2076
WRF 306 966 1668
fds 16 28 41
quantum-Espresso 66 472 1516
fluidity 38 279 747
json-fortran 26 47 49
fortranlib 11 18 38
Nek5000 54 204 336
cp2k 439 1043 1132
nastran-95 85 551 1838
specfem3d 186 404 765
nwchem 323 2768 17214
gtk-fortran 59 77 92
cfl3d 14 216 397
shtools 2 20 113
arpack-ng 1 259 332

The second and third column measure the number of Fortran files that contain the keywords string or character, respectively. This includes both command statements and comments so it may be a bit misleading.

In one of the codebases I even found this comment:

    ! String parsing in Fortran
    ! is such a pain
    ! it's unreal

@ivan-pi
Copy link
Member Author

ivan-pi commented Jan 3, 2020

Casing

The purpose of these functions is to return of copy of a character string ( either character(len=*) or a derived string type) with the case converted . The common variants are uppercase, lowercase, and titlecase.

The libraries cited in the first post contain the following function prototypes:

! functional
function str_upper(str)
function str_lower(str)
function str_swapcase(str)
pure function ucase(input)
pure function lcase(input)
function str_lowercase(str)
function str_uppercase(str)
subroutine str_convert_to_lowercase(str)
subroutine str_convert_to_uppercase(str)
pure elemental function lowercase_string(str)
function uppercase(str)
function lowercase(str)

! object-oriented
procedure, pass(self) :: camelcase
procedure, pass(self) :: capitalize
procedure, pass(self) :: lower
procedure, pass(self) :: snakecase
procedure, pass(self) :: startcase
procedure, pass(self) :: upper
function vstring_tolower(this[,first,last])
function vstring_toupper(this[,first,last])
function vstring_totitle(this[,first,last])

Some versions will return a new string, while some work in place. In at least one of the functions, it did not convert the case of characters enclosed between quotation marks.

These are the similar functions available in other programming languages:

  • Python: capitalize, lower, swapcase, upper
  • Ruby: capitalize, swapcase, upcase, downcase
  • D: toLower, toLowerInPlace, toUpper, toUpperInPlace, asCapitalized, asLowerCase, asUpperCase
  • MATLAB: lower, upper
  • Julia: uppercase, lowercase, titlecase, uppercasefirst, lowercasefirst
  • C++ (Boost): to_upper, to_lower,
  • Rust: to_uppercase, to_ascii_uppercase, to_lowercase, to_ascii_lowercase, make_ascii_uppercase, make_ascii_lowercase

My top three name picks are:

  1. uppercase/lowercase/titlecase
  2. to_upper/to_lower/to_title
  3. upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

@milancurcic
Copy link
Member

milancurcic commented Jan 3, 2020

I'd like to add to the list of facilities here the overloaded operator * between integers and strings, so that you can do, like in Python:

print *, 3 * 'hello' ! prints 'hellohellohello'
print *, 'world' * 2 ! prints 'worldworld' 

It's easy to make and use. The only downside I can think of is a somewhat weird API when importing it:

use stdlib_experimental_strings, only: operator(*)

@ivan-pi
Copy link
Member Author

ivan-pi commented Jan 3, 2020

Yes, I have seen this kind of usage in one of the above mentioned libraries. I am not sure whether it is not perhaps better to promote the usage of the intrinsic repeat function. As the Zen of Python states: There should be one-- and preferably only one --obvious way to do it.

A benefit of repeat is precisely that you avoid the import statement.

@milancurcic
Copy link
Member

milancurcic commented Jan 3, 2020

Oops, I didn't know about repeat. Indeed it's the way to go so I withdraw my proposal. I need to brush up on my canonical Fortran. :)

@zbeekman
Copy link
Member

zbeekman commented Jan 3, 2020

It is really hard to have a day job and keep up with all these threads, so my apologies if I've missed something because I'm just skimming here. A few opinionated notes:

  • Ruby is my favorite language for string processing, and IMO is the best at it. If no one objects (especially @ivan-pi) I'll put links in the first post on this issue
  • I have focussed mostly on some basic string handling (expanded template here) in my ZstdFortranLib project which I started only days before this project started, so I'm of a mind to possibly abandon it or work to integrate some of it here. It has:
    • split: returns an array of characters where each entry is as long as the longest one
    • join: joins an array of characters
    • sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string
    • gsub: Replaces all occurrences with a substring
    • //: Overloaded to allow concatenation of all real, logical, integer kinds with all character kinds
    • to_i: Convert to integer (all kinds)
    • to_r: Convert to real (all kinds)
    • to_l: Convert to logical (all kinds)
    • to_s: Convert anything to a string (all kinds)
    • ANSI formatting stuff that can be turned on/off globally for coloring and styling terminal output (thanks to using one of @szaghi's projects, FACE I believe)

I need to look at the varying string and character array proposals in more detail.

FWIW, I personally prefer the Ruby Python OO approaches with methods because it will make import statements much simpler: Pull in the string class and you get all the methods along with the type/class declaration. Now some operators may need to be pulled in as well if you want to be able to concatenate a real (lhs) with a string (rhs, can't have a TBP operator to the left of the object IIRC).

I was thinking of starting a PR marrying my work on ZstdFortranLib with a UDT/Class approach rather than operating on raw character scalars and arrays which is awkward for things like split(). But now I need to catch up on the myriad of proposals and prior art, so don't hold your breath.

@zbeekman
Copy link
Member

zbeekman commented Jan 3, 2020

While there is an intrinsic implementation, repeat(), I still like the more concise syntax which has pretty clear meaning for anyone who has ever worked with languages like Python and Ruby. It would be nice if some of these syntactic sugar items were added to the standard rather than a standard library. But until then I would be happy with * overloaded for characters.

@certik
Copy link
Member

certik commented Jan 3, 2020

@zbeekman I am struggling with all the threads also, but that is good news. It means there is lots of momentum. If you can help us design a good low and high level API for strings (#69 (comment)), that would be great.

@zbeekman
Copy link
Member

zbeekman commented Jan 3, 2020

@certik k

Now let's discuss integer to string conversion, the two implementations:

I like your first one the most. With integers you can use some math to count up how many digits there are, and if you need a sign on the front, which completely removes the need to declare the max string length AND to do the IO twice. Instead you use integer and floating point math which (hopefully) will be reasonably quick. IIRC, I implemented something to do this in JSON-Fortran but I'll have to look for it.

Also, I don't mean to whine about not being able to keep up, and I agree that it's good, but it's hard to keep track of all the balls in the air.

@zbeekman
Copy link
Member

zbeekman commented Jan 3, 2020

After a brief search there are at least 3 ways to do this without performing the conversion to a string then counting digits:

  1. Iteration:
    len = 1
    if ( n < 0 ) len = 2
    do
      n = n / 10
      if (n == 0) exit
      len = len + 1
    end do
  2. Tail recursion (same algorithm as above: It will be optimized to code above by compiler or it will be slower if the compiler uses recursion)
  3. "One shot" method using log10
    len = floor( log10( real( abs( n ) ) ) + 1 )
    if ( n < 0 ) len = len + 1

I would guess that 1 is the fastest way to do this, but it may depend on the compiler and hardware. 3 has conversion to a real, then log10 is probably computed iteratively, and it is converted back to an int, so 1. may be faster despite the loop.

@awvwgk
Copy link
Member

awvwgk commented Feb 4, 2021

After reading through this thread I found subtle issue with the proposed low-level API for character(len=*) variables.

My top three name picks are:

  1. uppercase/lowercase/titlecase
  2. to_upper/to_lower/to_title
  3. upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

Taking just the basic functionality mentioned by @ivan-pi here, I implemented a stdlib_character module as demonstration in #310, the issue becomes quickly apparent once you try to use both stdlib_character and stdlib_ascii in the same scope.

@awvwgk
Copy link
Member

awvwgk commented Feb 5, 2021

I created an exploratory implementation of a functional string handling at awvwgk/stdlib_string as fpm project. A non-fancy string type is implemented there, which basically provides the same functionality as a deferred length character but can be used in an elemental rather than a pure way. The idea is to have a scaffold for the string type in stdlib which can be extended later but already provides everything we are used to have from the deferred length character without the rough edges.

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

  • there is no assignment from string to character
    • reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS
  • all procedures return a fixed length character rather than a string instance
    • reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
    • drawback: assigning the return value to a string might create a temporary variable on the stack
  • no support for get and put
    • reason: derived type IO is used instead

@certik
Copy link
Member

certik commented Feb 10, 2021

@awvwgk this would be the high level API that operates on the string_type type.

How would a low level API look? Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

The maybe function can also operate on character(len=:), allocatable it seems. So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

@awvwgk
Copy link
Member

awvwgk commented Feb 10, 2021

Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

Bad example, the read_formatted procedure defines a user defined derived type input (see #312), which cannot be defined for character(len=:), allocatable since there is already an intrinsic formatted read transfer for character(len=*) types defined.

this would be the high level API that operates on the string_type type.

The idea so far was to provide the intrinsic low level API for a string type, on which later the high level API can be defined.

So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

Exactly, I wanted to explore a common basis of agreed on functions for a future high level string object. The minimal agreed on basis should be easily all the intrinsic procedures defined for character(len=*).

How would a low level API look?

I decided to pick the part of the high-level API that will have no overlap with a potential low-level API. This way the low level API can be explored separately, like in #310

The maybe function can also operate on character(len=:), allocatable it seems.

This one was chosen deliberately to be an internal implementation detail, i.e. it is not part of the public API.

@certik
Copy link
Member

certik commented Feb 10, 2021

Here is what I mean: awvwgk/stdlib_string#1

In that PR, I implemented a low level version of read_formatted called read_formatted0 that operates on character(len=:), allocatable. It seems to work.

@ivan-pi
Copy link
Member Author

ivan-pi commented Feb 10, 2021

@awvwgk this would be the high level API that operates on the string_type type.

@certik, the procedures in Sebastian's module are in fact equivalents of the intrinsic character procedures already available in Fortran:

    public :: len, len_trim, trim, index, scan, verify, repeat, adjustr, adjustl
    public :: lgt, lge, llt, lle, char, ichar, iachar
    public :: assignment(=)
    public :: operator(.gt.), operator(.ge.), operator(.lt.), operator(.le.)
    public :: operator(.eq.), operator(.ne.), operator(//)
    public :: write(formatted), write(unformatted)
    public :: read(formatted), read(unformatted)

The pull request #310 is the first to propose new procedures (reverse, to_title, to_upper, to_lower) which can operate both on the intrinsic character(len=*) variables, and in a later pull request also on a high-level string type (to be decided).

String processing in Fortran is not that bad, considering the number of procedures already there. If we could add casing, numeric to string convertors (and vice-versa), join and split, and perhaps a few more procedures, I think most usage cases would be covered.

@certik
Copy link
Member

certik commented Feb 10, 2021

@ivan-pi actually there is genuine new functionality, that I just extracted here: awvwgk/stdlib_string#1 (comment).

@awvwgk
Copy link
Member

awvwgk commented Feb 10, 2021

@certik I see, you are right the there is new functionality, the new_string_from_chars function extends beyond intrinsic functionality. I would suggest that I will submit it as a separate patch to the existing stdlib_ascii module and remove it from the stdlib_string_type until we have agreed on the low-level API. (new issue at #315)

Regarding the len functionality I find the argument is bit stretched, but I'm willing to follow it for the sake of the discussion. Eventually, the argument boils down to whether the utility function maybe should be exposed as public API. I think it is an implementation detail, because the specs as proposed do not define the internal representation of the character sequences, the string_type could use character(len=1), allocatable :: raw(:) instead of character(len=:), allocatable :: raw like in most iso_varying_string implementations.

The maybe, ok and err functionality from Rust come to mind here and would make a great addition for stdlib as well if we can successfully emulate this kind of behaviour. I don't feel confident that I got a good spin on this kind of features to make it stable enough for an actual addition to stdlib, therefore I don't want to force a maybe implementation for a character(len=:), allocatable yet.

However, I disagree on the low level API for user defined derived type input output, it is strictly a feature that can only be defined for a derived type but not an intrinsic and we won't be able to make use of it to safely read into a character(len=:), allocatable :: dlc with read(unit, *) dlc due to the construction of the Fortran standard.

The gist is, I don't want to introduce new functionality beyond the existing character(len=:), allocatable with the string_type on the first pass. Just one step at a time to allow better focus.

@certik
Copy link
Member

certik commented Feb 10, 2021

@awvwgk I just saw your comment, my comment here I think replies to yours: awvwgk/stdlib_string#1 (comment).

@awvwgk
Copy link
Member

awvwgk commented Feb 13, 2021

There is now also a branch at my stdlib fork. There is one really unfortunate thing here, GCC 7 and 8 do not support evaluation of user defined pure procedures in variable declarations. Adopting this string_type will inevitably drop support for GCC 7 and 8.

The solution is to adopt the iso_varying_string strategy to return a string_type instead of a fixed length character, which comes with its own problem that results from stdlib_string_type procedures now must be explicitly cast back to character form.

@awvwgk
Copy link
Member

awvwgk commented Feb 16, 2021

As promised in #320 (comment) I tried to devise an abstract base class (ABC) for an extendible string class. This one turned out much more difficult to design than a non-extendible functional string type, you can check the base class definition here:

https://github.com/awvwgk/stdlib_string/blob/string-class/src/stdlib_string_class.f90

The class is a bit more bloated than it has to be because I made it compatible with the intrinsic character type and the functional string type as well to ease testing.

One thing that turns out to be very difficult to account for are overloaded intrinsic procedures, you can find two implementation for each intrinsic procedure (except for the lexical comparison where I took a shortcut), one for the overloaded generic interface (len(string)) and a type bound implementation (string%get_len()), with the former invoking the latter. This was necessary to allow using the overloaded intrinsic procedure names while still relying on the runtime resolution of the type bound procedures from the object.

Another problem was returning a class polymorphic object from a procedure (operator(//) or trim), returning class(string_class), allocatable would force users to declare their string objects always as class polymorphic even if they want to use a specific implementation. Therefore, I decided to return a functional string_type instance instead and provide an assignment from string_type to a polymorphic string_class object to hide this fact effectively.

Since we have a whole lot of intrinsic character procedures implementing a string class based on this ABC can become tedious, therefore I designed the ABC to provide mock implementations based on the setter (assignment(=)) and getter (char(self)) functionality which can optionally be overwritten. Only the assignment from a character variable and the three char functionalities are actually deferred and must be provided in a minimal implementation.

While this is not a final specification yet, I wanted to share it as aid for discussion functional vs. object oriented implementation of a string in stdlib. From the above notes you might gather that a truly extendible string class could result in significant performance penalties for the user. Still there might be some value in having a string object available.

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 5, 2021

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

  • there is no assignment from string to character

    • reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS

If I understand things correctly, the assignment to character should be handled explicitly through the char function? I.e.

type(varying_string) :: varying   ! from iso_varying_string
type(string_type) :: nonfancy     ! from PR #320 
character(len=20) :: flc 
character(len=:), allocatable :: dlc

flc = varying ! works
dlc = varying ! fails, dlc needs to be allocated first

allocate(character(len=len(varying)) :: dlc)
dlc = varying ! works

flc = char(nonfancy) ! works
dlc = char(nonfancy) ! works
  • all procedures return a fixed length character rather than a string instance

    • reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
    • drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

  • no support for get and put

    • reason: derived type IO is used instead

👍 This is better and more Fortranic IMO. put and get where borrowed from C.

@awvwgk
Copy link
Member

awvwgk commented Mar 5, 2021

  • all procedures return a fixed length character rather than a string instance

    • reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
    • drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

None, because I had to reconsider this design choice due to missing compiler support.

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 5, 2021

Does the initial design choice (the one which breaks GCC 7 and 8 support) survive in any of the earlier commits on your private fork? I wonder if you could still pull it off, by moving the functions out of a module...

I still don't fully grasp how the implementation differed. Would for example the repeat(string, ncopies) accept a type(string_type), and use an overloaded pure len function to return a fixed-size result of size len(string)*ncopies?

In any case your pull request is a big step to make string-handling easier.

@awvwgk
Copy link
Member

awvwgk commented Mar 5, 2021

@ivan-pi See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way.

@gronki
Copy link

gronki commented Mar 5, 2021 via email

@awvwgk
Copy link
Member

awvwgk commented Mar 11, 2021

I updated my stdlib_string project with an abstract base class for a more object-oriented string implementation. As a demonstration of such a string_class I added @robertrueger's ftlString and @szaghi's StringiFor projects as examples to the repository, but based them on string_class rather than having them implement the intrinsic functions themselves. This could allow to make existing string libraries easily compatible with stdlib by allowing them to inherit from string_class (and they would also become compatible with each other).

@ivan-pi
Copy link
Member Author

ivan-pi commented Apr 28, 2021

Not sure if it was linked before, Clive Page wrote a nice summary about character types in Fortran: https://fortran.bcs.org/2015/suggestion_string_handling.pdf

There was also a thread over at the Fortran-FOSS programmers: Fortran-FOSS-Programmers/Fortran-202X-Proposals#4

A link was provided to a WG5 document, which talks about a print() function (page 9) similar to what we have now as to_string(): https://wg5-fortran.org/N1951-N2000/N1972.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...
Projects
None yet
Development

No branches or pull requests