Probable stack smash in LWJGL 3.1.6 on Linux with J8

Started by cpw, January 30, 2019, 02:12:14

Previous topic - Next topic

cpw

Hi
So this is quite the problem I seem to have. It seems 99% likely to be an issue with LWJGL 3.1.6 at this point - I can't try newer because I'm tied to what the game is using.

I am cpw, one of the forge platform developers for Minecraft. For the past few weeks, I've been tackling the upgrade to 1.13 of Minecraft, and I am repeatedly encountering JVM aborts and game crashes, with a wide variety of different behaviours.

I've encountered an actual "Stack smash" error on one occasion, numerous SEGV/SIGBUS and SEGV/SIGABRT errors, captured multiple core dumps (the only commonality is some REALLY weird thread traces in them), unmodifieable objects that are suddenly null, methods that fail with NPE at the end of the method (no functional code?!).

The computer I'm running on has passed all memtests I've thrown at it, and is quite capable of playing any other graphically intensive game going, including older versions of Minecraft, so although I initially thought "hardware (failing memory)" as the problem, I think I can safely discount that. It seems to me that the 1.13 update, with my rather wider scoped use of threading, seems to have caused LWJGL to overwrite parts of the stack somehow? I've read about the debugging of the new MemoryUtil functions, and turning on debugging seems to reduce the incidence of the issue somewhat - but it does not completely remove it. Furthermore, no errors seem to be reported by any of the debugging I have enabled, and yet the error persists. Perhaps your reporting functions aren't durable to a JVM SEGV?

What I would like is guidance on how I can possibly debug this problem. I'm happy to try building local copies with additional debugging enabled, but I am not familiar with the Memory code.

For a view of what's happening, I would recommend my recent twitch streams - it happens pretty frequently, usually accompanied by a bout of swearing. This evening I dedicated an hour long stream to investigating this, with no real results. https://twitch.tv/cpw. I've created a collection, where I will gather highlights showing the crashes occurring (so they don't get deleted by twitch after 30 days): https://www.twitch.tv/collections/_20aCTl-fhUW6g

Thanks for your time!
cpw
verification:
echo -n "I am cpw on the lwjgl forums, I signed up on 29 Jan 2019" | sha256sum
https://twitter.com/voxcpw/status/1090432302528806912

spasi

Hey cpw,

I'm sorry to hear about the troubles you're having. I must say though that it's unlikely this kind of crashing is caused by LWJGL itself. The code in LWJGL is so simple, that there's simply not much room for weird bugs. It doesn't spawn internal threads or anything like that and even utilities like the memory leak detection is the most stupid simple and inefficient code you can think of (this is all on purpose btw). The two places with some kind of sophistication are a) the shared library loader, which has matured over time and is considered stable and working as expected and b) the nasty business with OpenGL current context tracking and ThreadLocalUtil (read the comments in that class for details).

There were a couple of known issues with LWJGL 3.1.6 that have been resolved in subsequent releases but, if Minecraft was affected, there would be many more reports about them. So, the most likely cause would be: 1. a bug in one of the libraries Minecraft uses 2. a bug in Minecraft itself and 3. something wrong with the user's machine/environment. We can reasonably assume that it's not 2 (more users would be affected) or 3 (you don't have issues with other applications).

That leaves us with 1. Before going deeper into this, my first guess would be that you're hitting a bug in jemalloc. LWJGL 3.1.6 ships with jemalloc 5.0.1, which has caused troubles in many applications. LWJGL 3.2.0 updated jemalloc to the much more stable 5.1.0. Some quick things you could try:

- Run Minecraft with -Dorg.lwjgl.system.allocator=system
- Delete/move the jemalloc shared library so that LWJGL can't find it (equivalent to the first solution, LWJGL will fall back to the system allocator automatically)
- Build jemalloc 5.1.0 or current head and replace the shared library that comes with Minecraft. (you could also download a recent build from https://www.lwjgl.org/browse/nightly/linux/x64)

spasi

If that doesn't help, next step would be visiting https://builds.shipilev.net/ and downloading a fastdebug JDK build. Triggering a crash with it may provide more useful information.

cpw

Thanks for those tips. I wholeheartedly agree that this is 100% a corner case - I know I'm in a teeny tiny minority, daring to run Minecraft on Linux, but it's worked for 8 years, why stop now?

I'm not super familiar with the new memory architecture that is being used in LWJGL 3, so I think your analysis of jemalloc seems highly plausible. Certainly, one of the "big enhancements" I've been adding is that all modloading is MUCH more parallel than previously - concurrency bugs were always my top thought, but somehow LWJGL, or it's associated libs, seemed likely the culprit.

What I will try is moving to the system allocator first and foremost. I gather the jemalloc allocator is of value, for performance reasons, so I will probably try the 5.1 version of same as well, and let you know outcomes.

Finally, if neither of those works, I'll grab that JDK.

Thanks again for the feedback.

illy

Wanted to chime in and say I have been able to recreate cpw's setup and can confirm that this is a bug I am running into I'll try your suggestions and come back with updates.

cpw

Hi, So attached to this post are a couple of hs_err files from tonights stream where we looked into your suggestions. 4364 is captured using fastdebug, 14558 is captured using oracle j8u201. None of your suggestions seemed to have any effect - the game persistently crashes whenever it is run from any kind of development environment (gradle or intellij). I even upgraded to LWJGL 3.2.1 (latest?) which didn't stop the occurrence of problems.

As you can also see, this problem isn't exclusive to my computer either - Illy above has the same problem on Linux as well. I am going to try and get a cohort of Linux gamers to give this a go in the desktop client as well, over the next few days, to see if the problem exists there as well.

I am honestly at a loss as to how we can further debug this issue at present. Nothing seems to have had a significant effect.

Also attached is a "stack smash" error I managed to get one time, the other day.

Thanks for listening. I hope we can figure out a way to figure out what's going on here.

cpw

spasi

Quick update: I'm able to reproduce this with MinecraftForge (branch 1.13-pre), JDK 8u201, Ubuntu 18.10.

jojomodding

I can also reproduce this, using updated Arch Linux, Java 8 and Intel iGPU.

spasi

Workaround that seems to eliminate the crash for me: net.minecraft.client.MainWindow.java:297, replace GLFW.glfwWaitEventsTimeout(d0 - d1) with GLFW.glfwPollEvents().

I don't fully understand the reason behind this yet (using GLFW.glfwWaitEvents() without a timeout crashes too), but could you please check if this change makes a difference for you?

cpw

I'll do that and report back. That's really fascinating.

I've gotten widespread evidence that the game, as launched from the native launcher, does not generally seem to crash with this problem, but anyone running any variant of the same code in an IDE or launched from gradle, is easily able to reproduce this problem. This again is really curious evidence.

cpw

Just got this, with the change (see in the window above). It seems that it didn't help..  :'(

ichttt

I found a way to reproduce this every time:
Add -Xcomp to the JVM flags to force every class to be compiled. It is slow AF, but it always crashed on
C  [liblwjgl.so+0x21d2a]  Java_org_lwjgl_system_JNI_callPV__JIIIIJZ+0x1a

Full header here: https://pastebin.com/eXiwYJRW
Tested in a dev enviroment and in a compiled production enviroment, this flag always causes this

ichttt

Well, nevermind that last response. Just found out that this is a bug with the compiler fixed in j10, and using -XX:-CriticalJNINatives prevents the issue.
But using -XX:-CriticalJNINatives and -Xcomp fixes the issue for me

cpw

Some research updates. The vanilla launcher does not trigger this crash, seemingly ever. It seems to be exclusive to development type environments, where the game is launched from either an IDE or gradle.

It seems to be much harder (impossible) to trigger the crash if G1GC is enabled, instead of the default. But that's not definitive yet.

We've been running with -verbose:jni and -Xcheck:jni enabled. We can see that whenever it crashes, a small cluster of the same JNI methods are being bound.

spasi

Quote from: ichttt on January 31, 2019, 18:22:08Just found out that this is a bug with the compiler fixed in j10, and using -XX:-CriticalJNINatives prevents the issue.

LWJGL needs to support JDK 8, so it skips critical natives for functions affected by JDK-8167409 on Linux & macOS. Unfortunately it missed certain functions, they will be fixed in 3.2.2. This issue is unrelated to the crash we're investigating though.

Quote from: cpw on January 31, 2019, 22:39:04Some research updates. The vanilla launcher does not trigger this crash, seemingly ever. It seems to be exclusive to development type environments, where the game is launched from either an IDE or gradle.

It seems to be much harder (impossible) to trigger the crash if G1GC is enabled, instead of the default. But that's not definitive yet.

We've been running with -verbose:jni and -Xcheck:jni enabled. We can see that whenever it crashes, a small cluster of the same JNI methods are being bound.

Using G1GC or SerialGC didn't help, I'm still getting crashes.

The only thing that completely eliminates the issue for me is what I said above, replacing glfwWaitEvents with glfwPollEvents. With that change, I cannot reproduce the crash anymore. Neither with the IntelliJ run configuration, nor with Gradle's forge:runclient from the terminal.

Quote from: cpw on January 31, 2019, 16:16:24Just got this, with the change (see in the window above). It seems that it didn't help..  :'(

I have a feeling that the stack smashing crash may also be a separate issue. When it happens, it happens very early in the program execution, whereas the "usual" crashes happen after the engine has finished loading (after the splash screen). It may have something to do with how IntelliJ launches the application, has it ever happened to you when launched from Gradle?