LWJGL Forum

Programming => Bug Reports / RFE => Topic started by: cpw on January 30, 2019, 02:12:14

Title: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on January 30, 2019, 02:12:14
Hi
So this is quite the problem I seem to have. It seems 99% likely to be an issue with LWJGL 3.1.6 at this point - I can't try newer because I'm tied to what the game is using.

I am cpw, one of the forge platform developers for Minecraft. For the past few weeks, I've been tackling the upgrade to 1.13 of Minecraft, and I am repeatedly encountering JVM aborts and game crashes, with a wide variety of different behaviours.

I've encountered an actual "Stack smash" error on one occasion, numerous SEGV/SIGBUS and SEGV/SIGABRT errors, captured multiple core dumps (the only commonality is some REALLY weird thread traces in them), unmodifieable objects that are suddenly null, methods that fail with NPE at the end of the method (no functional code?!).

The computer I'm running on has passed all memtests I've thrown at it, and is quite capable of playing any other graphically intensive game going, including older versions of Minecraft, so although I initially thought "hardware (failing memory)" as the problem, I think I can safely discount that. It seems to me that the 1.13 update, with my rather wider scoped use of threading, seems to have caused LWJGL to overwrite parts of the stack somehow? I've read about the debugging of the new MemoryUtil functions, and turning on debugging seems to reduce the incidence of the issue somewhat - but it does not completely remove it. Furthermore, no errors seem to be reported by any of the debugging I have enabled, and yet the error persists. Perhaps your reporting functions aren't durable to a JVM SEGV?

What I would like is guidance on how I can possibly debug this problem. I'm happy to try building local copies with additional debugging enabled, but I am not familiar with the Memory code.

For a view of what's happening, I would recommend my recent twitch streams - it happens pretty frequently, usually accompanied by a bout of swearing. This evening I dedicated an hour long stream to investigating this, with no real results. https://twitch.tv/cpw. I've created a collection, where I will gather highlights showing the crashes occurring (so they don't get deleted by twitch after 30 days): https://www.twitch.tv/collections/_20aCTl-fhUW6g

Thanks for your time!
cpw
verification:
echo -n "I am cpw on the lwjgl forums, I signed up on 29 Jan 2019" | sha256sum
https://twitter.com/voxcpw/status/1090432302528806912
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: spasi on January 30, 2019, 10:37:34
Hey cpw,

I'm sorry to hear about the troubles you're having. I must say though that it's unlikely this kind of crashing is caused by LWJGL itself. The code in LWJGL is so simple, that there's simply not much room for weird bugs. It doesn't spawn internal threads or anything like that and even utilities like the memory leak detection is the most stupid simple and inefficient code you can think of (this is all on purpose btw). The two places with some kind of sophistication are a) the shared library loader, which has matured over time and is considered stable and working as expected and b) the nasty business with OpenGL current context tracking and ThreadLocalUtil (read the comments in that class for details).

There were a couple of known issues with LWJGL 3.1.6 that have been resolved in subsequent releases but, if Minecraft was affected, there would be many more reports about them. So, the most likely cause would be: 1. a bug in one of the libraries Minecraft uses 2. a bug in Minecraft itself and 3. something wrong with the user's machine/environment. We can reasonably assume that it's not 2 (more users would be affected) or 3 (you don't have issues with other applications).

That leaves us with 1. Before going deeper into this, my first guess would be that you're hitting a bug in jemalloc. LWJGL 3.1.6 ships with jemalloc 5.0.1, which has caused troubles in many applications. LWJGL 3.2.0 updated jemalloc to the much more stable 5.1.0. Some quick things you could try:

- Run Minecraft with -Dorg.lwjgl.system.allocator=system
- Delete/move the jemalloc shared library so that LWJGL can't find it (equivalent to the first solution, LWJGL will fall back to the system allocator automatically)
- Build jemalloc 5.1.0 or current head and replace the shared library that comes with Minecraft. (you could also download a recent build from https://www.lwjgl.org/browse/nightly/linux/x64)
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: spasi on January 30, 2019, 10:43:08
If that doesn't help, next step would be visiting https://builds.shipilev.net/ and downloading a fastdebug JDK build. Triggering a crash with it may provide more useful information.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on January 31, 2019, 00:24:10
Thanks for those tips. I wholeheartedly agree that this is 100% a corner case - I know I'm in a teeny tiny minority, daring to run Minecraft on Linux, but it's worked for 8 years, why stop now?

I'm not super familiar with the new memory architecture that is being used in LWJGL 3, so I think your analysis of jemalloc seems highly plausible. Certainly, one of the "big enhancements" I've been adding is that all modloading is MUCH more parallel than previously - concurrency bugs were always my top thought, but somehow LWJGL, or it's associated libs, seemed likely the culprit.

What I will try is moving to the system allocator first and foremost. I gather the jemalloc allocator is of value, for performance reasons, so I will probably try the 5.1 version of same as well, and let you know outcomes.

Finally, if neither of those works, I'll grab that JDK.

Thanks again for the feedback.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: illy on January 31, 2019, 01:37:23
Wanted to chime in and say I have been able to recreate cpw's setup and can confirm that this is a bug I am running into I'll try your suggestions and come back with updates.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on January 31, 2019, 03:38:34
Hi, So attached to this post are a couple of hs_err files from tonights stream where we looked into your suggestions. 4364 is captured using fastdebug, 14558 is captured using oracle j8u201. None of your suggestions seemed to have any effect - the game persistently crashes whenever it is run from any kind of development environment (gradle or intellij). I even upgraded to LWJGL 3.2.1 (latest?) which didn't stop the occurrence of problems.

As you can also see, this problem isn't exclusive to my computer either - Illy above has the same problem on Linux as well. I am going to try and get a cohort of Linux gamers to give this a go in the desktop client as well, over the next few days, to see if the problem exists there as well.

I am honestly at a loss as to how we can further debug this issue at present. Nothing seems to have had a significant effect.

Also attached is a "stack smash" error I managed to get one time, the other day.

Thanks for listening. I hope we can figure out a way to figure out what's going on here.

cpw
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: spasi on January 31, 2019, 13:48:45
Quick update: I'm able to reproduce this with MinecraftForge (branch 1.13-pre), JDK 8u201, Ubuntu 18.10.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: jojomodding on January 31, 2019, 15:16:32
I can also reproduce this, using updated Arch Linux, Java 8 and Intel iGPU.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: spasi on January 31, 2019, 15:52:43
Workaround that seems to eliminate the crash for me: net.minecraft.client.MainWindow.java:297, replace GLFW.glfwWaitEventsTimeout(d0 - d1) with GLFW.glfwPollEvents().

I don't fully understand the reason behind this yet (using GLFW.glfwWaitEvents() without a timeout crashes too), but could you please check if this change makes a difference for you?
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on January 31, 2019, 15:58:24
I'll do that and report back. That's really fascinating.

I've gotten widespread evidence that the game, as launched from the native launcher, does not generally seem to crash with this problem, but anyone running any variant of the same code in an IDE or launched from gradle, is easily able to reproduce this problem. This again is really curious evidence.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on January 31, 2019, 16:16:24
Just got this, with the change (see in the window above). It seems that it didn't help..  :'(
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: ichttt on January 31, 2019, 17:04:19
I found a way to reproduce this every time:
Add -Xcomp to the JVM flags to force every class to be compiled. It is slow AF, but it always crashed on
Code: [Select]
C  [liblwjgl.so+0x21d2a]  Java_org_lwjgl_system_JNI_callPV__JIIIIJZ+0x1aFull header here: https://pastebin.com/eXiwYJRW
Tested in a dev enviroment and in a compiled production enviroment, this flag always causes this
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: ichttt on January 31, 2019, 18:22:08
Well, nevermind that last response. Just found out that this is a bug with the compiler fixed in j10, and using -XX:-CriticalJNINatives prevents the issue.
But using -XX:-CriticalJNINatives and -Xcomp fixes the issue for me
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on January 31, 2019, 22:39:04
Some research updates. The vanilla launcher does not trigger this crash, seemingly ever. It seems to be exclusive to development type environments, where the game is launched from either an IDE or gradle.

It seems to be much harder (impossible) to trigger the crash if G1GC is enabled, instead of the default. But that's not definitive yet.

We've been running with -verbose:jni and -Xcheck:jni enabled. We can see that whenever it crashes, a small cluster of the same JNI methods are being bound.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: spasi on February 01, 2019, 08:59:43
Just found out that this is a bug with the compiler fixed in j10, and using -XX:-CriticalJNINatives prevents the issue.

LWJGL needs to support JDK 8, so it skips critical natives for functions affected by JDK-8167409 (https://bugs.openjdk.java.net/browse/JDK-8167409) on Linux & macOS. Unfortunately it missed certain functions, they will be fixed in 3.2.2. This issue is unrelated to the crash we're investigating though.

Some research updates. The vanilla launcher does not trigger this crash, seemingly ever. It seems to be exclusive to development type environments, where the game is launched from either an IDE or gradle.

It seems to be much harder (impossible) to trigger the crash if G1GC is enabled, instead of the default. But that's not definitive yet.

We've been running with -verbose:jni and -Xcheck:jni enabled. We can see that whenever it crashes, a small cluster of the same JNI methods are being bound.

Using G1GC or SerialGC didn't help, I'm still getting crashes.

The only thing that completely eliminates the issue for me is what I said above, replacing glfwWaitEvents with glfwPollEvents. With that change, I cannot reproduce the crash anymore. Neither with the IntelliJ run configuration, nor with Gradle's forge:runclient from the terminal.

Just got this, with the change (see in the window above). It seems that it didn't help..  :'(

I have a feeling that the stack smashing crash may also be a separate issue. When it happens, it happens very early in the program execution, whereas the "usual" crashes happen after the engine has finished loading (after the splash screen). It may have something to do with how IntelliJ launches the application, has it ever happened to you when launched from Gradle?
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: cpw on February 01, 2019, 16:59:54

LWJGL needs to support JDK 8, so it skips critical natives for functions affected by JDK-8167409 (https://bugs.openjdk.java.net/browse/JDK-8167409) on Linux & macOS. Unfortunately it missed certain functions, they will be fixed in 3.2.2. This issue is unrelated to the crash we're investigating though.
Are you sure it doesn't affect it? It seems that running with JDK 10 has made the problem completely disappear as well. So maybe there is a relationship here? JDK 10 has that issue fixed, as I understand it.

Using G1GC or SerialGC didn't help, I'm still getting crashes.
Interesting, I could not recreate the issue at all with it. Something about the vanilla launcher seems to prevent it. Perhaps -Xss (seems unlikely, and fiddling didn't change outcomes as far as I can tell).

The only thing that completely eliminates the issue for me is what I said above, replacing glfwWaitEvents with glfwPollEvents. With that change, I cannot reproduce the crash anymore. Neither with the IntelliJ run configuration, nor with Gradle's forge:runclient from the terminal.
That's curious. I tried that, and it didn't eliminate it for me. It still crashed, less frequently, but still crashing.

I have a feeling that the stack smashing crash may also be a separate issue. When it happens, it happens very early in the program execution, whereas the "usual" crashes happen after the engine has finished loading (after the splash screen). It may have something to do with how IntelliJ launches the application, has it ever happened to you when launched from Gradle?

I have had it happen twice. It seems to be at the point where LWJGL is trying to load it's code for me, which is why I felt it was not uncorrelated. It is a lot rarer, as a problem though.

Anyway, for right now, my workaround is to use J10, for running the game. It fixes the problem for me, as far as I can tell. 20+ runs without this crash seems pretty definitive. We still build against J8, and J8 is the default setup everyone will be using, it's just a dev-time workaround for now.
Title: Re: Probable stack smash in LWJGL 3.1.6 on Linux with J8
Post by: spasi on February 01, 2019, 18:16:09
Are you sure it doesn't affect it? It seems that running with JDK 10 has made the problem completely disappear as well. So maybe there is a relationship here? JDK 10 has that issue fixed, as I understand it.

If it was affecting anything, simply running with -XX:-CriticalJNINatives ( on JDK 8 ) would eliminate the crashes. Also, G1GC is the default GC since JDK 9, so maybe JDK 10 isn't crashing for the same reason JDK8+UseG1GC isn't crashing for you.

So, we still haven't identified a universal fix for this. The issue seemingly goes away when the performance characteristics of the execution change (-Xcomp, poll vs wait, G1GC vs parallel, etc), which suggests a nasty race somewhere. But I've no idea what to blame (IntelliJ/Gradle? Minecraft/Forge? LWJGL? The JVM?).