Probable stack smash in LWJGL 3.1.6 on Linux with J8

Started by cpw, January 30, 2019, 02:12:14

Previous topic - Next topic

cpw

Quote from: spasi on February 01, 2019, 08:59:43

LWJGL needs to support JDK 8, so it skips critical natives for functions affected by JDK-8167409 on Linux & macOS. Unfortunately it missed certain functions, they will be fixed in 3.2.2. This issue is unrelated to the crash we're investigating though.
Are you sure it doesn't affect it? It seems that running with JDK 10 has made the problem completely disappear as well. So maybe there is a relationship here? JDK 10 has that issue fixed, as I understand it.

Quote from: spasi on February 01, 2019, 08:59:43
Using G1GC or SerialGC didn't help, I'm still getting crashes.
Interesting, I could not recreate the issue at all with it. Something about the vanilla launcher seems to prevent it. Perhaps -Xss (seems unlikely, and fiddling didn't change outcomes as far as I can tell).

Quote from: spasi on February 01, 2019, 08:59:43
The only thing that completely eliminates the issue for me is what I said above, replacing glfwWaitEvents with glfwPollEvents. With that change, I cannot reproduce the crash anymore. Neither with the IntelliJ run configuration, nor with Gradle's forge:runclient from the terminal.
That's curious. I tried that, and it didn't eliminate it for me. It still crashed, less frequently, but still crashing.

Quote from: spasi on February 01, 2019, 08:59:43
I have a feeling that the stack smashing crash may also be a separate issue. When it happens, it happens very early in the program execution, whereas the "usual" crashes happen after the engine has finished loading (after the splash screen). It may have something to do with how IntelliJ launches the application, has it ever happened to you when launched from Gradle?

I have had it happen twice. It seems to be at the point where LWJGL is trying to load it's code for me, which is why I felt it was not uncorrelated. It is a lot rarer, as a problem though.

Anyway, for right now, my workaround is to use J10, for running the game. It fixes the problem for me, as far as I can tell. 20+ runs without this crash seems pretty definitive. We still build against J8, and J8 is the default setup everyone will be using, it's just a dev-time workaround for now.

spasi

Quote from: cpw on February 01, 2019, 16:59:54Are you sure it doesn't affect it? It seems that running with JDK 10 has made the problem completely disappear as well. So maybe there is a relationship here? JDK 10 has that issue fixed, as I understand it.

If it was affecting anything, simply running with -XX:-CriticalJNINatives ( on JDK 8 ) would eliminate the crashes. Also, G1GC is the default GC since JDK 9, so maybe JDK 10 isn't crashing for the same reason JDK8+UseG1GC isn't crashing for you.

So, we still haven't identified a universal fix for this. The issue seemingly goes away when the performance characteristics of the execution change (-Xcomp, poll vs wait, G1GC vs parallel, etc), which suggests a nasty race somewhere. But I've no idea what to blame (IntelliJ/Gradle? Minecraft/Forge? LWJGL? The JVM?).

princec

Thought I'd better join in on this thread as it now seems I've got the same issue with 3.1.6.

So far, no combination of switches has eliminated the problem - it comes and goes randomly. Today, everything just stopped working, after several days of no problems at all. This is on Windows 10 / OpenJDK 11.

Cas :)

CoDi

Are there any debug builds of LWJGL's native libraries or jemalloc to possibly get more verbose crash logs?

Also, I've had some success locating memory corruption / use-after-free crashes with Microsoft Application Verifier. I never tried to use it with a Java application though.

princec

Ok, riddle me this - what's wrong with this piece of code:
try (GLFWImage.Buffer imageBuffer = GLFWImage.create(images.length)) {
			for (GLFWImage image : images) {
				imageBuffer.put(image);
			}
			imageBuffer.flip();
			glfwSetWindowIcon(window, imageBuffer);
			for (GLFWImage image : images) {
				image.free();
			}
		}


Cas :)

princec

Well, I shall let on:
java.lang.IllegalStateException: The memory address specified is not being tracked
	at org.lwjgl.system.MemoryManage$DebugAllocator.untrack(MemoryManage.java:192)
	at org.lwjgl.system.MemoryManage$DebugAllocator.free(MemoryManage.java:153)
	at org.lwjgl.system.MemoryUtil.nmemFree(MemoryUtil.java:254)
	at org.lwjgl.system.CustomBuffer.free(CustomBuffer.java:63)
	at org.lwjgl.system.NativeResource.close(NativeResource.java:20)

with
-Dorg.lwjgl.util.DebugAllocator=true
-Dorg.lwjgl.util.DebugAllocator.internal=true

Strikes me as being incorrect; a NativeResource such as GLFWImage.Buffer implements AutoClosable.

Cas :)

princec

What's interesting is the game stops crashing when I move that GLFWImage.Buffer out into a static and don't allocate it in a try-with-resources (and thus don't attempt to free it).

Cas :)

spasi

Quote from: princec on February 22, 2019, 16:45:13Ok, riddle me this - what's wrong with this piece of code:
try (GLFWImage.Buffer imageBuffer = GLFWImage.create(images.length)) {
			for (GLFWImage image : images) {
				imageBuffer.put(image);
			}
			imageBuffer.flip();
			glfwSetWindowIcon(window, imageBuffer);
			for (GLFWImage image : images) {
				image.free();
			}
		}

edit: removed previous reply because I misunderstood the code.

The problem is that GLFWImage.create uses ByteBuffer.allocateDirect to do the allocation. I.e. it's memory tracked by the JVM/GC, and should not be used in a try-with-resources block, you cannot free it explicitly. If you change it to GLFWImage.malloc, it will work.

From previous reply: Also, you don't need a Java array of GLFWImages AND a GLFWImage.Buffer. It's extra allocations and data copies for no reason. For example you could do this:

ByteBuffer icon16;
ByteBuffer icon32;
try {
    icon16 = ioResourceToByteBuffer("lwjgl16.png", 2048);
    icon32 = ioResourceToByteBuffer("lwjgl32.png", 4096);
} catch (Exception e) {
    throw new RuntimeException(e);
}

try (GLFWImage.Buffer icons = GLFWImage.malloc(2)) {
    ByteBuffer pixels16 = Objects.requireNonNull(stbi_load_from_memory(icon16, w, h, comp, 4));
    icons
        .get(0)
        .width(w.get(0))
        .height(h.get(0))
        .pixels(pixels16);

    ByteBuffer pixels32 = Objects.requireNonNull(stbi_load_from_memory(icon32, w, h, comp, 4));
    icons
        .get(1)
        .width(w.get(0))
        .height(h.get(0))
        .pixels(pixels32);

    glfwSetWindowIcon(window, icons);

    stbi_image_free(pixels32);
    stbi_image_free(pixels16);
}

spasi

Hey Cas,

Did my reply above help explain what was happening? Have the crashes disappeared completely now? If yes, could you also please try LWJGL 3.2.1 or even the current 3.2.2 snapshot? I'm losing sleep over this thing and want to make sure there's nothing seriously wrong with the library.

princec

I will try with 3.2.2 tonight. Although I would be pleasantly surprised if the problems went away I am rather sceptical, as one of the first things I did to try and nail down the root cause of the bug was to remove the icon setting code, and it didn't help.

Regarding that bit of code... it does and doesn't help. To my mind I'm using the API in the correct way: GLFWImage.Buffer implements Autoclosable, therefore, it must be usable in a try-with-resources block. If this is not the case, it must not implement Autoclosable. Even if this is explicitly stated in the Javadoc, it still fails because code itself can't read Javadocs and does what it is supposed to do when confronted by interfaces with a specific contract. The API as it stands (and anywhere else this may be occurring) is simply asking for big trouble, and looks like it's found it too.

Even so - there should be actual runtime exception checks that are always on for management of this sort of thing in LWJGL I think. Correctness IMO always trumps performance. If I were worried about pure performance, I'd be using C.

Cas :)

spasi

Quote from: princec on February 25, 2019, 10:09:57
Regarding that bit of code... it does and doesn't help. To my mind I'm using the API in the correct way: GLFWImage.Buffer implements Autoclosable, therefore, it must be usable in a try-with-resources block. If this is not the case, it must not implement Autoclosable. Even if this is explicitly stated in the Javadoc, it still fails because code itself can't read Javadocs and does what it is supposed to do when confronted by interfaces with a specific contract. The API as it stands (and anywhere else this may be occurring) is simply asking for big trouble, and looks like it's found it too.

Even so - there should be actual runtime exception checks that are always on for management of this sort of thing in LWJGL I think. Correctness IMO always trumps performance. If I were worried about pure performance, I'd be using C.

I totally get your point of view and the concerns you raise have been carefully considered... years ago.

The chosen solution for correctness was the DebugAllocator. The exception you were seeing ("The memory address specified is not being tracked") was saying that someone tried to free a pointer that was never allocated by an explicit allocation API. It also triggers on a double-free. It is not enabled by default because of the performance impact. The current implementation uses a global ConcurrentHashMap from Long to Allocation, where Allocation holds the allocation size, thread ID and stacktrace where the allocation happened. Even with the new stack walking API in Java 9, this is very expensive. Best we could do is make the DebugAllocator implementation configurable and have an alternative implementation (per-thread state, primitive collections, no call stack tracking) that might be more reasonable to have always enabled during development (e.g. enabled-by-default in Debug mode, without a separate switch).

One might think that doing some kind of tracking or detection per buffer/struct would be ideal. First of all, buffers come from the JDK and there's no way to add extra state, so immediately we don't have a general solution, we're left with structs only. Most importantly though, this is virtually an impossible problem to solve: The backing buffer might be sliced/duplicated. A Struct might have come from a StructBuffer. The StructBuffer itself might be sliced/duplicated. We'd have to recurse through the chain, doing instanceof checks and then we'd still need to examine private NIO buffer data. And we'd still not be 100% sure about ownership, we'd have to consider the API used: We just got a pointer from an API, are we responsible for freeing it or will the API take care of it? It becomes too much. And it's not like C does anything better about this, tracking pointer ownership is the developer's responsibility. Rust is the only language with robust ownership tracking, but we can't expect LWJGL to be comparable to Rust, can we?

Anyway, I think getting used to the API more will help. Quick reminder of how the "what to use for allocation" algorithm goes:

1. Is it a short-lived, small-sized allocation? Use the MemoryStack.
    * Anything allocated via the MemoryStack must NOT be explicitly freed.
    * You manipulate the stack instead, a pop will free anything allocated after the last push. This can be done either automatically in a try-with-resources block, or by calling push/pop explicitly.
    * Yes, the MemoryStack is AutoCloseable too and close() delegates to pop().

2. Is it a long-lived and/or big allocation and it has a clearly defined life-cycle? Use the explicit allocation/deallocation API.
    * If it's called malloc/calloc/realloc, it's an explicit allocation that must be freed manually.
    * This is where you'd use try-with-resources to do the free safely/automatically.

3. Weird life-cycle or too much hassle to track liveness? Use BufferUtils.
    * If it starts with "create", it uses ByteBuffer.allocateDirect under the hood. It must NOT be explicitly freed.
    * Same semantics as every allocation in an LWJGL 2-based application, the allocation will be freed by the GC.

princec

Then I think that it had better not implement Autoclosable if it is not actually closeable (and doubly so if it causes a crash!) The actual returned implementation could implement it, and I could conceivably cast it to Autoclosable to get try-with-resources to look a bit prettier. I don't think it is at all sensible to break the contract of an API, especially not one that's actually now built-in to the JLS.

So the way around it is, for things that are created with calloc/malloc, have the returned things implement something like
interface Calloced extends AutoCloseable {
   default void free() { close(); }
}


and have things created with BufferUtils simply... not implement that interface. The abstract base class of the allocated object clearly then can't implement a free() or close() method, so you'll have to generate specialised concrete subclasses, eg GLFWImageCalloced or somesuch. A bit of a mouthful maybe. But look what a mess it gets everything in to when it gets accidentally used wrong, which it will I imagine quite a lot, and the symptoms are every bit as unhelpful as a C program exploding rendering the use of Java a bit pointless when one of the main reasons to use it is program correctness.

It'll all be moot when Valhalla turns up anyway so I can't help but wondering if this is perhaps just rearranging deckchairs on the Titanic anyway, saving a few cycles here and there for a very tiny subset of programs that will ever exist before it's all replaced by something more straightforward anyway in due course.

I know I should have stuck my oar in years ago but I was too busy at the time... anyway, back to bug reproduction...

Cas :)

princec

With 3.2.1, and tweaked icon code... currently not seeing a crash. So that's hopeful.

Cas :)

spasi

Quote from: princec on February 25, 2019, 21:23:50So the way around it is, for things that are created with calloc/malloc, have the returned things implement something like
interface Calloced extends AutoCloseable {
   default void free() { close(); }
}


and have things created with BufferUtils simply... not implement that interface. The abstract base class of the allocated object clearly then can't implement a free() or close() method, so you'll have to generate specialised concrete subclasses, eg GLFWImageCalloced or somesuch. A bit of a mouthful maybe. But look what a mess it gets everything in to when it gets accidentally used wrong, which it will I imagine quite a lot, and the symptoms are every bit as unhelpful as a C program exploding rendering the use of Java a bit pointless when one of the main reasons to use it is program correctness.

I'd much rather remove AutoCloseable from structs in LWJGL 3.3. As I tried to explain, there's no scheme we could come up with that will cover all cases. It's going to be ugly and a waste of time for minimal gain.

Quote from: princec on February 25, 2019, 21:23:50It'll all be moot when Valhalla turns up anyway so I can't help but wondering if this is perhaps just rearranging deckchairs on the Titanic anyway, saving a few cycles here and there for a very tiny subset of programs that will ever exist before it's all replaced by something more straightforward anyway in due course.

Valhalla is going to help with value types, but what we're waiting for is Panama (and of course its integration with Valhalla). You'd still need to pass a pointer to the struct data to native code, how this ends up being represented in Panama is going to matter... a lot. Is it going to be zero-copy? Is it going to be as convenient as in C? The current prototype uses Scope in try-with-resources for this and is still not close to providing satisfying answers.

AutoCloseable is not about performance, it's just a convenience. It did come into the picture because of a performance concern (using malloc/calloc instead of BufferUtils), but that's justified on its own. We're talking much more than a few cycles and having instances that do not automatically escape (which is what happens with ByteBuffer.allocateDirect).

Btw, not sure if you noticed already, not all structs are AutoCloseable. Only those that are "mallocable" and used as input to some API. So, GLFWImage is AutoCloseable, but GLFWVidMode is not (only returned by GLFW).

Finally, one more thing I'd like to ask: how did you end up using try-with-resources when you first wrote this? My guess is, you didn't really inspect LWJGL's source code (AutoCloseable is hidden behind the NativeResource interface), but it was Eclipse that warned you about it? Not solving the real problem, but does it help if you disable these warnings?

- Code style -> Resource not managed via try-with-resources (1.7 or higher)
- Potential programming problems -> Potential resource leak

(they were called like that back in 2016, not sure about now)

Quote from: princec on February 25, 2019, 21:32:00With 3.2.1, and tweaked icon code... currently not seeing a crash. So that's hopeful.

Great, thanks!

princec

I agree it's not worth changing course with LWJGL3.x at this point, and wait-and-see what Valhalla/Panama bring to the table before doing another ground-up redesign for LWJGL4.

I tend to keep most warning enabled in Eclipse - and indeed I'd never have thought to put it in try-with-resources if Eclipse hadn't warned me about it. But consider an API (of admittedly obscure function) that consumes AutoCloseables and closes them without regard to knowledge of what it is actually closing - the contract of AutoCloseable says in no uncertain terms how it should be behaving under any circumstances and here's a circumstance where it clearly isn't following the party line. Given the amount of hassle it's clearly already caused - and the sheer embuggerance of tracking it down by the nature of the blowup it caused - it would definitely be best if the close() method was indeed safely wrapped up with a check and a proper exception that are explicitly turned off by a flag to gain performance, rather than turned on by a flag to gain debugging.

There could be many more such instances of this going awry in the wild and it just results in more support, more headaches, more hair loss, and buggier software, so if there was one thing I would like to have fixed in LWJGL, it's this, if not for me then for the hundreds of programmers and customers who come after me and stumble across the same behaviour (I really like the new API otherwise!)

Cas :)