-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add char8_t mode #529
base: master
Are you sure you want to change the base?
Add char8_t mode #529
Conversation
char8 mode, enabled using the PUGIXML_CHAR8_MODE macro, uses C++20 char8_t instead of char for the UTF-8 interface. This makes use of pugixml safer when char is otherwise used for the system codepage. Stream-based methods received an additional overload, since the char overload may be used to represent arbitrary bytes, and the char8_t overload may be used by string streams. An additional typedef u8char_t, which represents the type pugixml uses for a UTF-8 code unit, was added for the conversion functions. Most changes had to be done in the test code. Representing raw bytes as string literals does not work for UTF-8 literals, since hex escape codes are interpreted as a Unicode character. Affected places either received a branch with a u8 literal or use a new RAW() macro which smuggles in UTF-8 code points using chars.
Just a question on the last note. (The rest sound good to me, thanks) |
@@ -217,6 +217,8 @@ PUGI__NS_BEGIN | |||
|
|||
#ifdef PUGIXML_WCHAR_MODE | |||
return wcslen(s); | |||
#elif defined(PUGIXML_CHAR8_MODE) | |||
return strlen(reinterpret_cast<const char*>(s)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is strlength not implemented in terms of std::char_traits<pugi::char_t>::length
, like you did for the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char_traits
are only available when compiling with STL-support. The tests don't have that restriction.
Is there any chance of this PR ever progressing? |
While the changes in the PR itself feel reasonable, I'd like to better understand how widely this will be used. This adds a new compilation mode; for thorough testing it would require increasing number of tested build configurations by 50%. In some sense it's fine, in some other sense it's a notable increase in complexity. It's a C++20 feature so it's not obvious how widely available it would be - I would be more comfortable with some indication from multiple people about this being very valuable in their work. I think something like this could be merged if there's sufficient demand for a new mode. If there's any way to reduce the differences in tests that would also be nice - this doesn't matter that much for the PR itself, but when adding new tests the wchar_t mode is already a problem because people tend to forget STR macros, so if any of the new macros could be replaced somehow that would be nice. It's not a big deal if they are necessary. |
# endif | ||
inline const char8_t* char_cast(const char* bytes) | ||
{ | ||
ALIASING_BARRIER(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this addressing a specific problem? It's really hard to tell without any comments.
I guess in a sense C++20 is a mixed blessing - the GHA builds will get noticeably longer but AppVeyor builds are much longer and are dominated by all old VS versions that won't support this anyway. Probably the main thing that would make me comfortable with merging this would be more indication that this is going to be helpful for various users. |
A little related although that one needs a separate fix to also avoid the footgun in regular mode: #378 |
For one, it would make pugixml future proof and help adoption of On Windows the situation is worse since
|
Overview
This PR adds a new mode to pugixml which uses C++20
char8_t
instead ofchar
for the UTF-8 interface. I've dubbed it char8 mode and it can be enabled using thePUGIXML_CHAR8_MODE
option/macro.Motivation
Using pugixml in a mixed ANSI/Unicode Windows codebase is quite errorprone.
char
is already used for the system codepage and it can easily happen that a UTF-8-as-char string slips through without conversion. I've foundchar8_t
it to be a huge help in migrating stuff to UTF-8.Implementation
Not much has to be changed. Since pugixml already calls C string functions with UTF-8, we can do the same with
char8_t
by just casting in a few places.Stream-based methods received an additional overload, since the
char
overload may be used to represent arbitrary bytes, and thechar8_t
overload may be used by string streams.An additional typedef
u8char_t
, which represents the type pugixml uses for a UTF-8 code unit, was added for the conversion functions.Most changes had to be done in the test code. Representing raw bytes as string literals does not work for UTF-8 literals, since hex escape codes are interpreted as a Unicode character (and replaced by multiple bytes). Affected places either received a branch with a u8 literal or use a new
RAW()
macro which smuggles in UTF-8 code points using chars.Notes
PUGIXML_CHAR8_MODE
modifies the UTF-8 interface, which meansPUGIXML_WCHAR_MODE
takes precedence if both are defined (no diagnostic).char8_t
file paths should be added as well?