Direct buffer putting

jakethesnake · October 16, 2018, 08:23:53

I'm dynamically rendering a 2d scene in my app, using VBO's that are dynamically streamed each render pass. Everything is working great, except that my main bottleneck is putting stuff in the direct bytebuffer that is created via lwjgl.

The app peaks at ~30000 drawn sprites. that's 30000 invokations of the following:

buffer = BufferUtils.createByteBuffer(BUFFER_SIZE);

void render(TextureCoords t, TextureCoords to, int x1, int x2, int y1, int y2, COLOR color, OPACITY opacity) {
		
		buffer.putShort((short) x1).putShort((short) y2);
		buffer.putShort(t.x1()).putShort(t.y2());
		buffer.putShort(to.x1()).putShort(to.y2());
		buffer.put(color.red()).put(color.green()).put(color.blue()).put(opacity.get());
		buffer.putShort((short) x2).putShort((short) y2);
		buffer.putShort(t.x2()).putShort(t.y2());
		buffer.putShort(to.x2()).putShort(to.y2());
		buffer.put(color.red()).put(color.green()).put(color.blue()).put(opacity.get());
		buffer.putShort((short) x1).putShort((short) y1);
		buffer.putShort(t.x1()).putShort(t.y1());
		buffer.putShort(to.x1()).putShort(to.y1());
		buffer.put(color.red()).put(color.green()).put(color.blue()).put(opacity.get());
		buffer.putShort((short) x2).putShort((short) y1);
		buffer.putShort(t.x2()).putShort(t.y1());
		buffer.putShort(to.x2()).putShort(to.y1());
		buffer.put(color.red()).put(color.green()).put(color.blue()).put(opacity.get());

		count++;
	}

I'm calling this 30000 times, 60 times per second and it eats roughly 30% of the capacity of my thread. This is fast, don't get me wrong, but I'm wondering if it can be made faster.

I've tried batching my vertices in an array in JVM memory and then put it all into the buffer, but that didn't help.

I'm curious about bound checks and endian-conversions, as I believe specifically the endian conversion can be quite expensive.

Any tips of how to speed things up?

Cornix · October 16, 2018, 08:59:51

Individual puts are much more costly than bulk operations.
Consider writing to a regular array and calling the put method with the array as an argument. The system will do a much more efficient memory copy with a lower overhead.

The other option is to reduce the amount of data you have to re-submit at every render pass. Perhaps there are things you can keep stored in a separate VBO which never (or rarely) have to change.

jakethesnake · October 16, 2018, 09:17:05

Quote from: Cornix on October 16, 2018, 08:59:51
Individual puts are much more costly than bulk operations.
Consider writing to a regular array and calling the put method with the array as an argument. The system will do a much more efficient memory copy with a lower overhead.

The other option is to reduce the amount of data you have to re-submit at every render pass. Perhaps there are things you can keep stored in a separate VBO which never (or rarely) have to change.

Thank you. I've already tried the array approach and to my suprise didn't get any notable performance boost. I haven't seen the source code, but I have read somewhere that this simply results in an iteration of the single put method. Might help the JVM hotspot, but I think the origional put method has already been optimized by the hotspot.

And regarding the amount of data, there is nothing I can make static. I have been thinking if I could put only one vertex and have opengl generate the 3 others in a shader, but I don't know if this is possible, or how it can be done. This would reduce the size of a quad from 64bits to 25bits.

spasi · October 16, 2018, 09:39:09

- BufferUtils.createByteBuffer: do not use this, it's very inefficient. See Memory management in LWJGL 3 for details. Switch to memAlloc/memFree and try to reuse the buffers if possible. If the data is small enough, you may also want to try MemoryStack.

- Switch to separate buffers per vertex attribute. This will let you use typed NIO buffers instead of ByteBuffer (i.e. ShortBuffer.put is often more efficient than ByteBuffer.putShort). Interleaving vertex data does not have a performance advantage on modern GPUs and generally complicates things. Also, you cannot easily drop vertex attributes when you don't need them (e.g. when doing a geometry-only pass).

- Do not use relative indexing when reading/writing from/to buffers. It's not terrible, but keep in mind that relative indexing mutates the buffer instance (the current .position() is updated on every put/get) and that can have a negative effect on performance. You also have to worry about flip/reset/etc, which is error-prone.

- Bulk put/get is indeed more efficient, it is mapped to memcpy in almost all code paths. The individual put/get in a loop is just the reference implementation. The problem is that you have to pay the price of putting data to a Java array first, so it costs double the bandwidth. Even more so if you have to allocate the Java array every time.

- The endian conversion when writing to direct buffers does not cost anything (the JDK uses Unsafe to do it).

- If you write clean put/get loops, the bounds check cost is negligible.

jakethesnake · October 16, 2018, 10:33:24

Quote from: spasi on October 16, 2018, 09:39:09
- BufferUtils.createByteBuffer: do not use this, it's very inefficient. See Memory management in LWJGL 3 for details. Switch to memAlloc/memFree and try to reuse the buffers if possible. If the data is small enough, you may also want to try MemoryStack.

- Switch to separate buffers per vertex attribute. This will let you use typed NIO buffers instead of ByteBuffer (i.e. ShortBuffer.put is often more efficient than ByteBuffer.putShort). Interleaving vertex data does not have a performance advantage on modern GPUs and generally complicates things. Also, you cannot easily drop vertex attributes when you don't need them (e.g. when doing a geometry-only pass).

- Do not use relative indexing when reading/writing from/to buffers. It's not terrible, but keep in mind that relative indexing mutates the buffer instance (the current .position() is updated on every put/get) and that can have a negative effect on performance. You also have to worry about flip/reset/etc, which is error-prone.

- Bulk put/get is indeed more efficient, it is mapped to memcpy in almost all code paths. The individual put/get in a loop is just the reference implementation. The problem is that you have to pay the price of putting data to a Java array first, so it costs double the bandwidth. Even more so if you have to allocate the Java array every time.

- The endian conversion when writing to direct buffers does not cost anything (the JDK uses Unsafe to do it).

- If you write clean put/get loops, the bounds check cost is negligible.

Thanks. I should clarify that the allocation of the buffer is done outside of method in question, and not a performance issue. it is done only once and reused throughout the lifecycle of the application and only one is used to house the 30000*4 vertices. What do you mean the Edianess is free? Even if its done through Unsafe, it still needs to be done, right? Also, I tried the java array approach and it didn't give me the boost justifying the extra code / memory consumption. I didn't look carefully, but it couldn't have been more than a few percent. I might try sperate buffers though, but I doubt it will yield me the boosts I'm looking for.

I think I'm going to try generating my triangles from a point using a geometry shader. I just found out it can be done:

https://learnopengl.com/Advanced-OpenGL/Geometry-Shader

That way I can go from 64 bytes worth of puts per method call to 28.

spasi · October 16, 2018, 11:15:02

Quote from: jakethesnake on October 16, 2018, 10:33:24What do you mean the Edianess is free? Even if its done through Unsafe, it still needs to be done, right?

Like bulk get/put, there's often the misconception that the reference implementation is what actually happens at runtime. This is not the case. It may look like there's always an endianness flip when going from Java to native and vice-versa, but in practice it is never necessary when working with direct buffers. The reason is that, even though Java bytecode is big-endian, the JVM stores all data in-memory in the native byte order (i.e. little-endian on x86/64 CPUs). When you do a .putInt(<java int>), there's no byte-reversal going on, Unsafe will write the integer directly.

Exceptions: 1. if you change the buffer's order to != ByteOrder.nativeOrder() 2. if you read/write data from unaligned offsets on architectures that do not support unaligned memory access (e.g. ARM).

Quote from: jakethesnake on October 16, 2018, 10:33:24Also, I tried the java array approach and it didn't give me the boost justifying the extra code / memory consumption. I didn't look carefully, but it couldn't have been more than a few percent.

Yes, as I said, it's very rarely worth it. Usually when you're doing computations on arrays and the data is already in there.

jakethesnake · October 16, 2018, 18:43:36

I can confirm that using a geometry shader to transform a point into a triangle-strip with 4 vertices cut this bottleneck with 50%

It was quite expected, since the amount of puts were halved. I'm very happy with that.

News:

Direct buffer putting

jakethesnake

Cornix

jakethesnake

spasi

jakethesnake

spasi

jakethesnake