Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend stdlib_ascii logical functions to character strings #321

Open
awvwgk opened this issue Feb 14, 2021 · 12 comments
Open

Extend stdlib_ascii logical functions to character strings #321

awvwgk opened this issue Feb 14, 2021 · 12 comments
Labels
idea Proposition of an idea and opening an issue to discuss it topic: strings String processing

Comments

@awvwgk
Copy link
Member

awvwgk commented Feb 14, 2021

Currently most routines provided by stdlib_ascii are only applicable to single characters. The conversion (to_lower and to_upper) routines were straight-forward to extend to character strings (see #310). For the logical functions it is a bit more complicated.

For each of the logical functions (is_ascii, ...) two possible reductions apply: any and all. To follow the Fortran naming of the verify and scan intrinsic` one could implement three versions for each logical function:

  • single character version: is_ascii
  • fixed length character version with any reduction: verify_ascii (empty string returns .false.)
  • fixed length character version with all reduction: scan_ascii (empty string returns .true.)

Alternatively, to be more in line with the verify and scan function, the index of the first matching character could be returned, which makes it easy to create the respective logical by a simple comparison.

@awvwgk awvwgk added topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... idea Proposition of an idea and opening an issue to discuss it labels Feb 14, 2021
@aslozada
Copy link
Member

It may be interesting to apply the is_blank function to delete multiple white spaces in character strings, e.g,

change the string

Some white space     in      this text should be     deleted.

by

Some white space in this text should be deleted.

Could this composite functionality be contained in a standard library?

@ivan-pi
Copy link
Member

ivan-pi commented Feb 14, 2021

@aslozada , thanks for the suggestion. I think the functionality you are interested could be done with a combintation of split/join functions. These have been discussed previously here: #69 (comment). A separate issue exists for split already: #241. If this new proposal by @awvwgk was accepted you could also use the scan_blank or related functionality to achieve what you describe. I would however suggest opening a new issue to brainstorm and agree on the precise behavior. Currently it is not clear to me if you want to remove blanks in-place, or a function returning a new string.


I never realized that verify and scan are just all and any in disguise, with the added functionality of findloc on top. Great way of understanding these two functions. 👍

Effectively the functions you are proposing are just specializations of the intrinsic verify and set functions where the set dummy variable is fixed? Is that correct? For example verify_alpha is equal to verify(string, letters[, back[,kind]]), where letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"? The different fixed character sets can be presented nicely as a table.

I might be mistaken, but I believe that within the limits of standard Fortran there is no way to replicate the kind argument of the intrinsic functions. (I do kind of doubt the actual usefulness of this argument; I have never had the need to verify or scan a string exceeding the range of default integers).

Would it be possible to auto-generate these procedures? Maintaining them manually seems very tedious.


Edit: Sometimes I long for the (obsolescent) Fortran 77 statement functions. Then one could easily implement verify_alpha as verify_alpha(string) = verify(string,letters) at the point of use. Perhaps a new construct roughly equivalent to

associate(verify_alpha(string,back=.false.) => verify(string,letters,back))

could return some sort of of lambda expression or procedure pointer equivalent. It could save some interface typing...

@awvwgk
Copy link
Member Author

awvwgk commented Feb 14, 2021

Effectively the functions you are proposing are just specializations of the intrinsic verify and set functions where the set dummy variable is fixed? Is that correct?

Basically, yes, but I would implement them by iterating over the string using the corresponding is_* function. This could be done almost automated by fypp with two templates. But we can also try using scan and verify internally to see which approach is faster, would be very intrigued if there is an overhead from either approach or if we can end up even.

I might be mistaken, but I believe that within the limits of standard Fortran there is no way to replicate the kind argument of the intrinsic functions

You are right, there is currently no way to replicate this functionality with Fortran. I would vote for ignoring it as well.

@aslozada
Copy link
Member

[...]Currently it is not clear to me if you want to remove blanks in-place, or a function returning a new string.[...]

Thanks, @ivan-pi,

the idea is to delete extra blank spaces in a character string.

I have read about mixing procedures to perform this task, but my comment and question is about the use of a is_blank extension to obtain the same result, and if such a procedure can be considered as a functionality of a standard library.

@awvwgk
Copy link
Member Author

awvwgk commented Feb 14, 2021

@aslozada Proposing a function like compact that implements the described behaviour would be more appropriate than extending the is_* functionality of the stdlib_ascii module with this behaviour. The implementation of a potential compact function might rely on the is_blank function from stdlib_ascii or use the split / join combination internally to get the final compact string.

@ivan-pi
Copy link
Member

ivan-pi commented Feb 14, 2021

Basically, yes, but I would implement them by iterating over the string using the corresponding is_* function.

Indeed this implementation has the lowest memory footprint. For the scan you are also able to exit the loop on the first .false. value.

But we can also try using scan and verify internally to see which approach is faster, would be very intrigued if there is an overhead from either approach or if we can end up even.

I would guess that internally the verify and scan intrinsics use the set argument to generate a custom bit-mask which then calls into the C standard library which already contains a static lookup table for ASCII characters. I've got an equivalent lookup table implementation here: https://github.com/ivan-pi/fortran-ascii/blob/master/fortran_ascii_bit.f90

@ivan-pi
Copy link
Member

ivan-pi commented Feb 14, 2021

I might be mistaken, but I believe that within the limits of standard Fortran there is no way to replicate the kind argument of the intrinsic functions

You are right, there is currently no way to replicate this functionality with Fortran. I would vote for ignoring it as well.

I recall there was a similar development precedent in the actual Fortran standard, where the kind argument was added to an intrinsic situation only later upon revision.

@aman-godara
Copy link
Member

Alternatively, to be more in line with the verify and scan function, the index of the first matching character could be returned, which makes it easy to create the respective logical by a simple comparison.

Please correct me if I understood it wrong.
verify_* returns the index of the the first character which doesn't meet requirement * whereas scan_* returns the index of the first character which meets requirement *.
Based on this assumption, both verify_* and scan_* will return 0 (or -1) for an empty string (Since fortran uses 1-based indexing) to indicate that the function couldn't find the asked index.

  • fixed length character version with any reduction: verify_ascii (empty string returns .false.)
  • fixed length character version with all reduction: scan_ascii (empty string returns .true.)

Shouldn't we say that when verify_* returns 0 it should be interpreted as .true. indicating that all characters in the string meet the requirement *?

@ivan-pi
Copy link
Member

ivan-pi commented Apr 3, 2021

I believe the any/all analogy in the original post is actually the opposite way round. From the description of the intrinsic routines:

  • verify: Verifies that all the characters in STRING belong to the set of characters in SET.
  • scan: Scans a STRING for any of the characters in a SET of characters.

The intrinsic verify routine will return the position of the first character which is not in SET. If all characters of STRING are found in SET, the result is zero. So if verify_<set> == 0 is .true. the string is only composed of characters in <set> (or the tested string is empty).

The intrinsic scan routine returns the position of the first character of STRING that is in SET. If no character of SET is found in STRING, the result is zero. So if scan_<set> == 0 is .false., it means that some character from <set> has been found.

@aman-godara
Copy link
Member

So if verify_<set> == 0 is .true. the string is only composed of characters in <set> (or the tested string is empty).

Yeah, I actually deleted my last comment because I realised that I interpreted them wrong and also posted a new one. I agree with what you have said.

So if scan_<set> == 0 is .false., it means that some character from <set> has been found.

I want to make it clearer for me that when scan_<set> == 0 it implies that all the characters in the input belong to the <set>?

@ivan-pi
Copy link
Member

ivan-pi commented Apr 3, 2021

I want to make it clearer for me that when scan_<set> == 0 it implies that all the characters in the input belong to the <set>?

No, it implies the opposite. When scan_<set> == 0 is .true. it means no character of SET is found in STRING (or the string is empty to begin with). If scan_<set> .gt. 0 than a character from SET has been found.

@ivan-pi
Copy link
Member

ivan-pi commented Apr 25, 2021

At least for the most common charater sequences (letters and digits) we already provide the constants (uppercase, lowercase, ...) which can be used with the intrisics. I imagine a good tutorial or user guide, would be enough to nudge users towards using the pattern:

use stdlib_ascii, only: lowercase, uppercase, to_lower
character(len=32) :: input

read(*,*) input

! convert  to lowercase
if (scan(input,uppercase) > 0) then
  input = to_lower(input)
end if

For comprehension I think a name like any_upper (or any_is_upper) would be better than scan or verify.

@awvwgk awvwgk added topic: strings String processing and removed topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... labels Sep 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Proposition of an idea and opening an issue to discuss it topic: strings String processing
Projects
None yet
Development

No branches or pull requests

4 participants