Buffer allocations in glMap[Named]Buffer

Started by Kai, December 28, 2009, 11:41:22

Previous topic - Next topic

Kai

Hi,

while profiling my LWJGL application, I recognized that most of the allocations come from
static inline jobject safeNewBufferCached(JNIEnv *env, void *p, int size, jobject old_buffer)

in "common_tools.h".

I heavily use the buffer mapping functions, for instance "glMapBuffer", to update uniform blocks in a shader each frame via Uniform Buffer Objects, or vertex data in VBOs.
For the mapping of the client memory to "Java memory" ByteBuffers are used. The methods "glMapBuffer" or "glMapNamedBuffer" therefore accept an additional parameter "old_buffer" that holds a previously allocated ByteBuffer in which the mapped memory will be held, so that no new direct buffer needs to  be allocated.

Up to now, I gave "null" as this "old_buffer" parameter, which (naturally) resulted in the allocation of new direct buffers each time the mapping operations were invoked.
Now, I wanted to use a cached ByteBuffer in my application. I have the need to use many differently sized buffers, because of different VBO sizes/uniform block sizes. If I supply my cached ByteBuffer as the "old_buffer" parameter to the mapping functions, the function "safeNewBufferCached" in "common_tools.h" checks whether the capacity of that buffer is exactly equal to the requested result_size (which is the recorded size of the (Vertex) Buffer Object) that LWJGL records internally and is queried via "GLChecks.getNamedBufferObjectSize").
So, when giving my cached ByteBuffer as the "old_buffer" parameter, I have to make sure that it has the same capacity as the Buffer Object that I want to map, but these can each differ in size.

To the point: It would be nice if either:
a) the "result_size" parameter of the mapBuffer functions are exposed as an additional parameter to the LWJGL methods; or

b) the function "safeNewBufferCached" (in "common_tools.h") does not check for exact capacity, but for minimal capacity
So instead of the check for equal capacity:
if (old_buffer_address == p && capacity == size)
			return old_buffer;

the check should look like:
if (old_buffer_address == p && capacity >= size)
			return old_buffer;

This would dramatically improve performance due to no new buffer allocations.

Personally, I would prefer b), because it keeps the interfaces as they are and it allows to reuse cached ByteBuffers that have a greater capacity than needed (greater than the size of the VBO) so that no new direct buffer needs to be created with that exact capacity.

I hope, you get my point.

spasi

Unfortunately, you can't use MapBuffer like that. You can't give an arbitrary pointer (a ByteBuffer in LWJGL) to the driver and ask it to map the buffer object there. MapBuffer does the opposite, you ask the driver to map a buffer object and it returns the pointer to you (which is exposed as a ByteBuffer in LWJGL, the result of a JNI NewDirectByteBuffer call). This means that different MapBuffer calls return different pointers, so new ByteBuffers are allocated. These ByteBuffers are direct, they wrap the mapped memory the driver is exposing so you aren't wasting any mem. Except of course for the internal fields in the java.nio.DirectByteBuffer implementation (I guess around 32 bytes per instance) and then there's the overhead of garbage collecting if you're doing too many MapBuffer calls per frame.

So, the problem isn't only the capacity check, but mainly the old_buffer_address check. Mapping different uniform blocks or VBOs would return a different pointer for each one.

There are a couple of options to improve the situation:

a) Cache the mapped ByteBuffer for specific VBOs or uniform blocks. In theory, mapping the same VBO multiple times (once per frame for example) should return the same pointer. It depends on how you're using the buffer object, the usage hints you've specified and the GL implementation of course, but it should work like that in most scenarios. It might not even work for the first few frames and start working when the driver detects what you're doing with the buffer object. Anyway, for frequently mapped objects I'd cache the MapBuffer result and reuse it. It has to be a ByteBuffer per buffer object though, you can't share it.

b) Don't use MapBuffer. Tbh, when I was working on Marathon I don't think I had used VBO mapping more than once. I think I dropped it after a while too. Especially for uniform buffer objects I'd use BufferSubData instead, I think it will be much faster. Mapped buffers are too sensitive to implementation details, it's quite hard to get the driver to behave and provide the expected performance. Also, doing it in Java is kinda awkward (any API that returns void pointers is bound to be).

While investigating this, I came across 2 other issues:

a) The glMapBufferRange implementation was exposing result_size (which isn't even needed actually). I've committed a fix for this, the signature should be correct now.

b) GLChecks.get(Named)BufferObjectSize which is called on every glMapBuffer call, is actually doing a GL readback every time (a glGetBufferParameter to retrieve the buffer object size). I wouldn't expect this to cause a pipeline flush like other readbacks, but you never know, it might actually be the reason you're getting bad performance with many MapBuffer calls per frame. The problem is, I'm not really sure how to avoid calling it. In glMapBufferRange's case, the driver checks the offset and length parameters so there's no risk exposing those to the client. glMapBuffer on the other hand maps the entire buffer object data store and since there's no driver check, you can't really trust the client to pass the correct "result_size" parameter (if it were exposed). Some options:

1) Leave it like that.

2) Do a glGetBufferParameter only when the old_buffer argument is null. Subsequent calls to glMapBuffer with non-null old_buffer argument will assume that the user knows what they're doing and skip the capacity check (will only check the ByteBuffer's pointer). The problem with this is that the buffer object may be re-initialized between glMapBuffer calls, with a different size. In that case the cached old_buffer wouldn't match. On the other hand, re-initializing the buffer object would also result in a different pointer returned from the next glMapBuffer call, forcing a ByteBuffer reallocation anyway, regardless of size. Can we really be sure of that though?

3) Track buffer object sizes inside LWJGL. Hmm. I don't really feel comfortable doing that, it really sounds like something the engine on top of LWJGL should be doing. It would be unnecessary overhead for users that don't need mapping and it's also not trivial to implement when we have multiple shared contexts etc.

4) Expose the "result_size" argument. This would be the simplest option with the best performance, but it will cause native crashes when used incorrectly.

I would appreciate any feedback on this matter.

Kai

Thanks for you elaborate answer!

I want to explain my situation to you in more depth:

I currently track all buffer object's sizes myself in the engine, so that no incorrect sized cached buffers are no problem for me.
I also keep a separate instance of a cached ByteBuffer for each buffer object.

QuoteMapping different uniform blocks [...] would return a different pointer for each one.
Actually, I did not happen to think about that issue. You are right, of course. For different uniform blocks, caching one single ByteBuffer for one UBO would not make sense.

a)
QuoteIn theory, mapping the same VBO multiple times (once per frame for example) should return the same pointer.
Yes, I could verify that with on my side. Of course, one cannot be sure that it will always be the same address for the whole application's lifetime.

b)
QuoteDon't use MapBuffer.
After profiling a while and reading the relevant OpenGL specs, I noticed that with standard glMapBuffer or with glMapBufferRegion (with synchronization) OpenGL flushes the pipeline up to the point where there are no more usages of the mapped buffer before actually mapping the buffer. That gave me a 1 ms. stall for each invocation of the mapping functions.
QuoteEspecially for uniform buffer objects I'd use BufferSubData instead, I think it will be much faster.
I most certainly can confirm that, now!

In the engine, I wanted to support both methods but since mapping now seems to be such a heavy operation, I might leave it and stick with bufferData.

Quote
While investigating this, I came across 2 other issues:

a) The glMapBufferRange implementation was exposing result_size (which isn't even needed actually). I've committed a fix for this, the signature should be correct now.
Hm, that too bad, because I used this method with the explicit "result_size" parameter to circumvent for the missing parameter in the normal glMapBuffer method :-). As a result, I had a ByteBuffer with an initial capacity of 0, and whenever I wanted to map a given region, I first checked if the region size is greater than the cached ByteBuffer's capacity (in which case I would give "null" as the parameter to glMapBufferRange). Then, I used the result of glMapBufferRange as the new cached ByteBuffer and whenever I wanted to map a region (with the same offset as before) but with a smaller size, I could reuse that cached buffer.
But while writing the above paragraph, I am seeing the problem that you mentioned before: The buffer's address need not be the same and so, caching might not be useful, because the cache would never be used and always be recreated.

b)

Quote2) Do a glGetBufferParameter only when the old_buffer argument is null. Subsequent calls to glMapBuffer with non-null old_buffer argument will assume that the user knows what they're doing and skip the capacity check (will only check the ByteBuffer's pointer). The problem with this is that the buffer object may be re-initialized between glMapBuffer calls, with a different size. In that case the cached old_buffer wouldn't match. On the other hand, re-initializing the buffer object would also result in a different pointer returned from the next glMapBuffer call, forcing a ByteBuffer reallocation anyway, regardless of size. Can we really be sure of that though?
Okay, that seems to be tough. It seems we have no choice but to readback the size of the buffer object even though the client supplied an oldBuffer as a parameter, because either or both the addresses or the sizes do not match.

Quote3) Track buffer object sizes inside LWJGL.
I, too feel uncomfortable with that, since there could be multiple unforeseeable sources of errors. I think, LWJGL should be a simple and fast! OpenGL/OpenAL wrapper.

QuoteExpose the "result_size" argument.
This would be the best solution for me, because I know the size of a buffer at every time, because I do not allow direct OpenGL method calls, rather than through higher-level classes. But, for clients that do allow OpenGL calls and do not track the buffer sizes themselves, this would mean that they have to readback the buffer size from OpenGL themselves.

Conclusion:

Do not have LWJGL track any state that the client could track as well. If the client track the size of a buffer object in the application logic, then they know what they're doing. If not, they have to query that from OpenGL (which LWJGL was doing until now).
But because LWJGL was doing the size query, there was no chance that a client, who track the size themself, could improve performance by not querying the size from OpenGL again via LWJGL.
So I would like to have the ability to specify the "result_size" as a parameter to have the best performance possible. Even though the "oldBuffer" was not specified and left "null", the client should also have to set "result_size" without query of the buffer size via LWJGL, then.

spasi

I think I'll add additional methods. The existing methods will work as before, but the new ones will accept an explicit result size.

Kai

QuoteI think I'll add additional methods. The existing methods will work as before, but the new ones will accept an explicit result size.
That's fine by me.

spasi

Sorry for the long delay, this has been committed now. I named the result size argument "length", so that it matches the same argument in glMapBufferRange. I also improved the documentation on all buffer object mapping functions.

Happy new year!

Kai

Happy new year to you, too!

As always: Thanks for your support!