[SOLVED] Troublshoot strange heap corruption

Started by tlf30, June 22, 2022, 20:47:19

Previous topic - Next topic

tlf30

Hello,

I am running into an issue where I am getting an exit code 0xC0000374 with no error (causing a hard crash of the JVM without any hs_err_pid log)
The application crashes every time, sometimes it is just seconds into the run, others it is several minutes after running.

On occasion I get a different crash, where I get a strange validation error:
[2022-06-22 12:25:36] [SEVERE ] [validation] Validation Error: [ VUID-vkCmdDrawIndexed-None-02859 ] Object 0: handle = 0x1c6a7ff99c0, name = Frame 2 command buffer, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x3f56950000000224, name = Shader pipeline 'phong-shader' for mesh 'meshes[0]-48, type = VK_OBJECT_TYPE_PIPELINE; | MessageID = 0x93e69b0a | vkCmdDrawIndexed(): VkPipeline 0x3f56950000000224[Shader pipeline 'phong-shader' for mesh 'meshes[0]-48] doesn't set up VK_DYNAMIC_STATE_VIEWPORT|VK_DYNAMIC_STATE_SCISSOR, but it calls the related dynamic state setting commands The Vulkan spec states: There must not have been any calls to dynamic state setting commands for any state not specified as dynamic in the VkPipeline object bound to the pipeline bind point used by this command, since that pipeline was bound (https://vulkan.lunarg.com/doc/view/1.2.176.1/windows/1.2-extensions/vkspec.html#VUID-vkCmdDrawIndexed-None-02859) 


Followed by an EXCEPTION_ACCESS_VIOLATION

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffeabdf162e, pid=1780, tid=26764
#
# JRE version: OpenJDK Runtime Environment Temurin-18+36 (18.0+36) (build 18+36)
# Java VM: OpenJDK 64-Bit Server VM Temurin-18+36 (18+36, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, windows-amd64)
# Problematic frame:
# V  [jvm.dll+0x1162e]
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\Users\Trevor\Desktop\outside\dist\client_dist\hs_err_pid1780.log


https://gist.github.com/tlf30/6629105179ce6dea56312484c2c45de4. Which is even more puzzling as the stack trace in it only shows a single internal jvm native call.

Yet, I am always setting the dynamic state:
VkPipelineDynamicStateCreateInfo dynamicState = VkPipelineDynamicStateCreateInfo.calloc(stack)
                    .sType$Default()
                    .pDynamicStates(stack.ints(
                                    VK_DYNAMIC_STATE_VIEWPORT,
                                    VK_DYNAMIC_STATE_SCISSOR
                            )
                    );


This leads me to believe that there is indeed memory corruption occurring. I have enabled all LWJGL debugging, and there are no errors or warnings being logged.
Configuration.DEBUG.set(true);
Configuration.DEBUG_FUNCTIONS.set(true);
Configuration.DEBUG_STREAM.set(true);
Configuration.DEBUG_LOADER.set(true);
Configuration.DEBUG_STACK.set(true);
Configuration.DEBUG_MEMORY_ALLOCATOR.set(true);
Configuration.DISABLE_CHECKS.set(false);
Configuration.DISABLE_FUNCTION_CHECKS.set(false);


I'm running JDK 18 from Adoptium. I have tried both 18.0.1+10 and 18+36.
I am running NVIDIA GTX 1080 TI (driver 516.40, the latest at the time of this post), and also tested on a mobile NVIDIA M2200.
Tested on Windows 10 and 11.
I have tested with both LWJGL 3.3.1 and 3.3.2-SNAPSHOT.

Does anyone have any ideas on troubleshooting this?
I have been looking at everything in RenderDoc, but so far there are no errors or warnings, and there are no validation errors using VK_LAYER_KHRONOS_validation.

Any help is greatly appreciated.
Thank you,
Trevor

tlf30

OK, for anyone else who runs into something similar, I found the issue. It was caused by using a direct allocated pointer buffer instead of using the memory util to allocate it.
In this instance it was very difficult to troubleshoot as the code with the issue was in how I was mapping VkBuffers, and was being run thousands of times without any issue.
To find the issue, I used the LWJGLX debug agent. Even though this was a vulkan project, the agent will still look at buffer usage, and notified me that a buffer had been freed that was not a tracked buffer from LWJGL. This was amazing as it pointed me directly to the buffer in question and it was a simple fix.