-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ragel 7.0.0.12 fails to generate valid code for harbuzz 2.7.4 on PPC64 (big endian) #61
Comments
Thanks @KungFuJesus. I'm not quite sure how to debug this, without access to a ppc64 machine. Does gentoo have any common ones for this purpose? |
I'm not quite sure if they do. Adding to this complexity is the fact that the PPC64 machine I'm using is explicitly big endian, whereas a lot of the modern variants of PPC64 are bi-endian and tend to prefer little endian. Given that the code generator is just vanilla C and not emitting anything like inline assembly, I'd say the issue is likely endianness. If you need to throw me a patch to test, I should able to. It might even be possible that the bug is endian independent but little endian happens to mask it (this issue resembles an issue with substrings where an index is cut short somewhere). I've seen unioned types with mismatch sizes create similar issues like this, where the compiler lobs off the wrong bytes when the union'd type gets aliased. |
when i read the gentoo bug report a couple days ago, it appeared the issue might also happen on x86_64, fwiw. anyone tested that already ? |
I have an x64 machine and have not been able to reproduce it on there, so it's almost certainly endianness related. I don't know if it's a compiler bug or not, but I've observed when you have mismatched union types, the code that gets generated when casted to the larger type seems to drop the wrong bytes. One such example I can point to directly: I could see this interpreting a 32 bit integer as 0 by taking the wrong half of the word. |
in that case the best approach might be if you try to git bisect the ragel commits between 6.10 and master on your machine, looking at the specific commit that introduced the issue would probably allow an educated guess where the culprit is. what makes this a lot harder than it should be is that ragel depends on colm which is in a separate repo. an alternative might be to try to get an adelie linux iso for ppc64be and boot it in qemu. there's also the GCC buildfarm developers can get access to which has a huge variety of build hosts. |
Not sure it helps or not but in Gentoo the colm ebuild is separated from ragel, and is fixed in both builds to version 0.13.0.7. I should be able to keep that variable constant. |
Ok, so I did a bisect (for some of the commits the tree was in a very strange state, seemingly just wiki documentation), but I finally arrived at what I think is the troublesome commit: |
I should also add that's the first revision where the result for that line became uncompilable. I noticed some form of string corruption in form of the spacing around the equals disappear in a revision before that one (which maybe is the real source of the bug). Maybe after work I could bisect to find when that started occurring. |
Hi @KungFuJesus thank you for the investigation! A lot of the time people don't follow up on these issues, so I was happy to see some activity here. And yup, the ragel codebase has gone through some upheaval between 6 and 7. It's a bit of a long story :) So that commit is the one that defaults to the colm-based frontend. I'm suspecting the bug is purely in colm. Would you be able to run the colm test suite on master? Depending on the result of that, I should also be able to craft some colm programs that exercise the bug. But want to see the test results first. I can also imagine some hacks to ragel that might reveal the bug, but need to leave that for tomorrow. (I'm PST). |
|
Hmm, colm.d directory appears to be missing under test. |
Oh it's in the colm repo, I'll check that out. I'm using the distro provided ebuild for that, but I'm sure I can grab the exact version from a tag and try out the test. Except...the one version that the distro provides (0.13.0.7) seems to have no tag...hmm. I'll try the version just before it and hope they are close enough. |
Ok, so there wasn't a colm.d directory but I did do runtests, attached is the output. |
Here's the output from the master branch, with surprisingly one segfault: Here are the resulting diffs for those: |
Looks as though the segfault is due to a linked list corruption that happens prior to a "print_kid" call:
|
Oh looks like something is badly broken somewhere. I'm suspecting there is something in the parsing that isn't friendly to big-endian. It's resulting in corrupt trees that sometimes don't print properly and sometimes segfault. I'm wondering if there are other big-endian machines we can try this on. I used to use university machines to do that testing. Fastest way to get this fixed is for me to get on a machine where it exhibits. |
Well...to get a little bit more diversity, perhaps, I can give you an account to a Solaris 11 based SPARC-T4 I have running. However, it's quite loud and has the added complexity that usually ensues when trying to build packages on Solaris, no matter how few dependencies (usually some frustration in autotools/automake/autoconf + GCC). Perhaps I can try to see if it builds on it, first, then I can get you a shell on it |
as mentioned there's GCC compile farm. https://gcc.gnu.org/wiki/CompileFarm but you could also use a cheap mips device like a router, or qemu-system-mips running aboriginal linux system images, for example. |
I'll register with the compile farm, thanks for the tip! |
As for the MIPS route, they may not produce all endianness symptoms with big endian, as I believe a lot of embedded stuff is 32 bit. The linked issue above from sm64ex does not effect ppc32 big endian (not that I think we're seeing that issue here, just an example). |
Just got access to the GCC compile farm and the problem exhibits on both SPARC and PPC64, exactly as you have shown above in the testcase output log. |
@KungFuJesus @rofl0r Fixed the bug over in colm, and I confirmed that it fixes the issue generating the harfbuzz source. I will release a new version of ragel 7.0.3 that expects the version of colm with the fix. Thank you both for your work reporting and investigating this! |
for the record, this is the commit fixing it: adrian-thurston/colm@472a890 |
Ahh, so a similar class of issue to the automatic type casting and aliasing with non-uniform union members. Basically it's lobbing off the wrong half of the word. I do sort of wonder if this is a gray area in the C specification or if this a very platform specific behavior defined by the ABI (similar to how all characters are assumed to be unsigned rather than signed on POWER). |
Thank you for the work here, all. FWIW, I'm happy (on behalf of Gentoo) to try facilitate access to exotic architectures if it's needed for upstreams! |
this was just plain and simple UB (undefined behaviour). actually the current typecast the fix uses smells like UB too... but then i didn't read the actual type definitions. |
@rofl0r there is a problem if we try to push/pop something bigger in size than (void*). Perhaps an debug-mode assert somewhere would be nice defensive move. |
So it appears that 6.10 works fine, but 7.0.0.12 seems to drop a "= 1" after a variable declaration. Please see this gentoo bug report:
https://bugs.gentoo.org/763621
The text was updated successfully, but these errors were encountered: