Batch rendering and model matrices - Looking for advice

Started by Lugen, July 23, 2015, 19:13:47

Previous topic - Next topic

Lugen

So I'm making a 2D engine which must be able to draw meshes with an arbitrary amount of vertices.
Each "object" being drawn has its own transformation (model) matrix.

I want to be able to render 10.000 objects while maintaining high frame-rate. This amount should be more than enough for my needs.

I'm making fairly good progress. Vertices and indices are put together in buffers and flushed whenever it's time to switch texture binding.

What I'm missing is applying the model matrices.

I'm not sure how to approach this. Did some reading around on whatever I could find but can't seem to find any particular solution for this scenario other than doing all the matrix calculation on the CPU. For me it boils down to theses questions.

1. Is it possible to put the matrices in a buffer (Uniform Buffer perhaps?) and have the vertex shader somehow pick the correct matrix to apply on whenever vertex it's processing at the time?

2. Is there some other way to solve this so that all matrix calculation is still being performed on the GPU?

3. Or may I just as well do all the model matrix calculations on the CPU, transforming the vertex data before putting them in the buffer. Generally speaking, would I loose any performance gains that the GPU could potentially give?

abcdef

The general way is to either

1)  have a uniform in your shader to represent the model matrix and to then update the value of this uniform for ever "object" you draw. You would also do the same for the projection matrix (same for all objects so no need to update per object) and the view matrix (same for all objects so no need to update per object)

2) multiply your projection matrix, view matrix and model matrix on the CPU per object and then update a single uniform in the shader.

Cornix

10.000 2D sprites really isnt as much as one would think. I have fairly bad hardware and I am able to render 10.000 sprites with immediate mode at 60 FPS without problems.
I'd say you try to optimize too early. I doubt you will run into any problems no matter which way you choose.

I personally would update positions on the CPU side and put everything into a VBO. I wouldnt bother to specify a model matrix for a 2D sprite.

elect

Quote from: Lugen on July 23, 2015, 19:13:47
So I'm making a 2D engine which must be able to draw meshes with an arbitrary amount of vertices.
Each "object" being drawn has its own transformation (model) matrix.

I want to be able to render 10.000 objects while maintaining high frame-rate. This amount should be more than enough for my needs.

I'm making fairly good progress. Vertices and indices are put together in buffers and flushed whenever it's time to switch texture binding.

What I'm missing is applying the model matrices.

I'm not sure how to approach this. Did some reading around on whatever I could find but can't seem to find any particular solution for this scenario other than doing all the matrix calculation on the CPU. For me it boils down to theses questions.

1. Is it possible to put the matrices in a buffer (Uniform Buffer perhaps?) and have the vertex shader somehow pick the correct matrix to apply on whenever vertex it's processing at the time?

2. Is there some other way to solve this so that all matrix calculation is still being performed on the GPU?

3. Or may I just as well do all the model matrix calculations on the CPU, transforming the vertex data before putting them in the buffer. Generally speaking, would I loose any performance gains that the GPU could potentially give?

Which version are you targetting? How often are you changing model matrix per object?

Lugen

Thanks for your replies!

Quote from: abcdef on July 24, 2015, 08:01:05
The general way is to either

1)  have a uniform in your shader to represent the model matrix and to then update the value of this uniform for ever "object" you draw. You would also do the same for the projection matrix (same for all objects so no need to update per object) and the view matrix (same for all objects so no need to update per object)

2) multiply your projection matrix, view matrix and model matrix on the CPU per object and then update a single uniform in the shader.

I see. Alternative 1 was the way I had it before which still made a state change for each object. Looks like I'll go for transforming the vertices on the CPU before putting them in the batch. Seems like the least complicated way to do it.

Quote from: Cornix on July 24, 2015, 08:22:24
10.000 2D sprites really isnt as much as one would think. I have fairly bad hardware and I am able to render 10.000 sprites with immediate mode at 60 FPS without problems.
I'd say you try to optimize too early. I doubt you will run into any problems no matter which way you choose.

I personally would update positions on the CPU side and put everything into a VBO. I wouldnt bother to specify a model matrix for a 2D sprite.

If there's a general method for making things run much faster then I figure I might as well learn how to do it. Also at one point I was rendering 10.000 "sprites" below 60 FPS. I would guess updating the model matrix uniform per object had something to do with it.
How would you specify the transform? As I mentioned my renderer needs to be able to handle arbitrary meshes as well, that's why I'm giving a matrix to every object.

Quote from: elect on July 24, 2015, 09:15:46
Which version are you targetting? How often are you changing model matrix per object?

Version of OpenGL? The console tells me I'm using 4.4.0. Target for the engine itself is Desktop, but honestly I haven't given this any thought. I'm not putting any version constraints on this project anyway.
Second question, not sure what you mean. In terms of sending the model matrix to the shader I did that once per object per frame.

Kai

The thing with "methods for making things run much faster" is: Your desired solution must meet certain restrictions to be able to use those methods.
And implementing them can be anything but trivial.

One example of such a restriction/constraint: Your 10,000 sprites are all the same base "model" (probably a quad) with only differences in how each "instance" is being affinely transformed.
In this case you can use instancing with a big Uniform Buffer Object to hold each instance's matrix in. This would require only one GL call to upload the UBO and only one draw call to draw all 10,000 instances at once.
You can then select the right matrix via gl_InstanceID in your shader.

Additional constraint: Every instance needs to have a separate texture or one of a few possible textures.
Solution: For this to work with instancing you could either use a texture atlas and per-instance texture coordinates, which you could also specify in the UBO, to select the right texture part for the right instance.
Or you could use array textures with each layer being a different texture and then encode the texture to be used inside the UBO as per-instance data (as a simple integer).

You see that there is a wide variety of possibilities. The first thing you need to set are your constraints/requirements that you have as explicitly as possible.
And then search for methods that can enhance the performance while still meeting your constraints.

elect

Quote from: Lugen on July 24, 2015, 12:10:45
Version of OpenGL? The console tells me I'm using 4.4.0. Target for the engine itself is Desktop, but honestly I haven't given this any thought. I'm not putting any version constraints on this project anyway.
Second question, not sure what you mean. In terms of sending the model matrix to the shader I did that once per object per frame.

Use mapBufferRange with persistent and cohrent flags and make your buffer at least 3 times big, read this


Lugen

Quote from: Kai on July 24, 2015, 12:30:18
The thing with "methods for making things run much faster" is: Your desired solution must meet certain restrictions to be able to use those methods.
And implementing them can be anything but trivial.

One example of such a restriction/constraint: Your 10,000 sprites are all the same base "model" (probably a quad) with only differences in how each "instance" is being affinely transformed.
In this case you can use instancing with a big Uniform Buffer Object to hold each instance's matrix in. This would require only one GL call to upload the UBO and only one draw call to draw all 10,000 instances at once.
You can then select the right matrix via gl_InstanceID in your shader.

Additional constraint: Every instance needs to have a separate texture or one of a few possible textures.
Solution: For this to work with instancing you could either use a texture atlas and per-instance texture coordinates, which you could also specify in the UBO, to select the right texture part for the right instance.
Or you could use array textures with each layer being a different texture and then encode the texture to be used inside the UBO as per-instance data (as a simple integer).

You see that there is a wide variety of possibilities. The first thing you need to set are your constraints/requirements that you have as explicitly as possible.
And then search for methods that can enhance the performance while still meeting your constraints.

Interesting read. I'll try to sum up my requirements/constraints.

1. First of all, might be useful to mention that I'm making my system data-oriented. Anything that can be drawn will use the same type of render-component which is not really a component but a component manager, storing all the vertex data, textures and matrices for each "object".

2. The game will not have grid-based graphics.

3. The models will vary, both in proportion and number of vertices, so I guess instancing is out of the question?

4. Any texture can be assigned to any model but I plan to use atlases as much as possible. Objects will be sorted by depth and the buffer is flushed whenever the next object to draw is using a different texture.

5. I don't care too much about pushing down the amount of draw calls to the minimum as long as it's much lower than the number of objects being drawn, I'm thinking half the amount.

That's what I can think of right now. Hope that gives some clarity. While I'm at it, which term do you prefer I use "draw call" or "state change"? As I understand it a draw call can refer to whenever you call a glBind-function or send uniform data to a shader.

Kai

Quote
While I'm at it, which term do you prefer I use "draw call" or "state change"? As I understand it a draw call can refer to whenever you call a glBind-function or send uniform data to a shader.
No. A "draw call" is by definition/convention a call that "draws" something. :)
There are very few draw call commands actually. Among them glDrawArrays(), glDrawElements(), glDrawArraysInstanced() and glDrawElementsInstanced().
Some extensions also add additional "draw call" commands with other parameters.
Draw calls are classified as such in OpenGL because they are where drivers usually have to do most of the work, because it is only then that the driver has really all information available to issue an optimized command stream to the GPU to make the GPU perform work (i.e. "render stuff on the screen").
The rest of the OpenGL commands just change some state somewhere in the big OpenGL state machine.

Lugen

Quote from: Kai on July 24, 2015, 14:26:23
Quote
While I'm at it, which term do you prefer I use "draw call" or "state change"? As I understand it a draw call can refer to whenever you call a glBind-function or send uniform data to a shader.
No. A "draw call" is by definition/convention a call that "draws" something. :)
There are very few draw call commands actually. Among them glDrawArrays(), glDrawElements(), glDrawArraysInstanced() and glDrawElementsInstanced().
Some extensions also add additional "draw call" commands with other parameters.
Draw calls are classified as such in OpenGL because they are where drivers usually have to do most of the work, because it is only then that the driver has really all information available to issue an optimized command stream to the GPU to make the GPU perform work (i.e. "render stuff on the screen").
The rest of the OpenGL commands just change some state somewhere in the big OpenGL state machine.

Thanks, that clarifies things a lot.

Lugen

Thought I'd show some progress. I went with simply applying the model matrices on the vertices before putting them in the vertex buffer. Projection and view matrices are applied in the vertex shader.
Did some testing and I think I'm in a pretty good place to move on. Note that the FPS is at the first second after program start, afterwards it doubled and on the top example it kept increasing as more sprites started to move outside the picture.


Kai

So the relevant properties of your scene are:
- it is triangles you render
- the static triangles' geometries do not change
- you do not have to update the (static) triangles vertices
- there is currently only a very limited number of textures being used

Provided this, you should do the following to gain optimal performance:
- *not* pre-transforming the triangles on the CPU
- don't do any frustum culling (if you do any)
- use instancing (glVertexAttribDivisor, glDrawArraysInstanced)
- store the model transformation as translation(x, y), rotation(angle) and scaling(uniform) factors in the form of a vec4 for each triangle as a per-instance vertex attribute (use glVertexAttribDivisor) in a buffer object (you don't have to upload whole 4x4 matrices for what you do. A single 4-element vector suffices)
- have a vertex attribute in a VBO which contains the position of only a single triangle (you really only need a VBO containing a single triangle)
- transform each triangle instance using the compact vec4 transformation representation in a vertex shader
- group your draw calls by texture used
- issue a texture bind and an instanced draw call for each group (from your image I count two of them)

This should easily give you 300 FPS and above. Rendering 10,000 triangles really is nothing for a GPU. :)
Since you are neither vertex transform-bound nor fillrate-bound, you are very likely CPU/driver-bound by the number of GL calls you make. So: Decrease the number of state changes and draw calls -> increase FPS. :)

Lugen

Quote from: Kai on July 24, 2015, 23:19:39
So the relevant properties of your scene are:
- it is triangles you render
- the static triangles' geometries do not change
- you do not have to update the (static) triangles vertices
- there is currently only a very limited number of textures being used

Hm, perhaps choice of graphics where a bit misleading. I am still rendering rectangles/squares, 2 triangles each. I draw the buffer with glDrawElements(GL_TRIANGLES, etc).
In this test none of them are static, they are all moving, but in the game it is most likely that most of the geometry will be static.
One of my requirements is to also draw arbitrary meshes along with the "sprites", that why I'm not using any shortcuts for those if there happen to be any.

Quote
Provided this, you should do the following to gain optimal performance:
- *not* pre-transforming the triangles on the CPU
- don't do any frustum culling (if you do any)
- use instancing (glVertexAttribDivisor, glDrawArraysInstanced)
Haven't learned how to do frustum culling yet so not using it. In what way would it be better to not use it?
In the game there will be perhaps a 75/25 mix of instanced meshes and generated ones (platforms). I guess combining the two would perhaps make things overly complicated rather than just have everything be "unique". Seems to run fast enough anyway.

Quote
- store the model transformation as translation(x, y), rotation(angle) and scaling(uniform) factors in the form of a vec4 for each triangle as a per-instance vertex attribute (use glVertexAttribDivisor) in a buffer object (you don't have to upload whole 4x4 matrices for what you do. A single 4-element vector suffices)
- have a vertex attribute in a VBO which contains the position of only a single triangle (you really only need a VBO containing a single triangle)
- transform each triangle instance using the compact vec4 transformation representation in a vertex shader
- group your draw calls by texture used
- issue a texture bind and an instanced draw call for each group (from your image I count two of them)
My objects need to have a z-position as well, which eventually will have perspective scaling applied in order to make the parallax effect. Also I would like to have non uniform-scaling on my objects as well as do any other effects like shearing in the future so I assume using a matrix is the most convenient way to go?
The z-sorting goes before texture sorting then. If I understand instancing (mind i haven't tried it) right then I could still have everything in one buffer and the textures would be collected in an array and picked for the right triangle?

Quote
This should easily give you 300 FPS and above. Rendering 10,000 triangles really is nothing for a GPU. :)
Since you are neither vertex transform-bound nor fillrate-bound, you are very likely CPU/driver-bound by the number of GL calls you make. So: Decrease the number of state changes and draw calls -> increase FPS. :)
20.000 triangles are being rendered. Again I guess the graphics I picked was misleading, I guess I like triangles too much. Still yes you're right that's nothing for the GPU. The bottleneck in this case is most likely all that object handling on the CPU.

Thanks a bunch for all the info! Some good reference to have.

Kai

Quote from: Lugen on July 25, 2015, 11:49:17
In this test none of them are static, they are all moving, but in the game it is most likely that most of the geometry will be static.
Nah, that is not what I mean by "static". :)
Perhaps more formally correct, by "static meshes" I mean:
- the number of vertices comprising that mesh does not change
- the topology of the mesh does not change (i.e. how vertices are connected to triangles)
It has nothing to do with the relative position/orientation/scaling/shearing to the "viewer" or whichever transformation you are applying to all the mesh'es vertices. That does not affect whether or not you can use instancing.
The worst-case scenario is when your geometries are dynamic (opposite of static), where you have to compute the topology of your mesh every frame. Think about some triangulation of metaballs for example.

Quote from: Lugen on July 25, 2015, 11:49:17
My objects need to have a z-position as well, which eventually will have perspective scaling applied in order to make the parallax effect. Also I would like to have non uniform-scaling on my objects as well as do any other effects like shearing in the future so I assume using a matrix is the most convenient way to go?
If you want the whole package then a 4x4 would be necessary, yes.

Quote from: Lugen on July 25, 2015, 11:49:17
The z-sorting goes before texture sorting then. If I understand instancing (mind i haven't tried it) right then I could still have everything in one buffer and the textures would be collected in an array and picked for the right triangle?
For simple meshes you could probably get away without CPU-side z-sorting. Z-sorting is only useful if you are fillrate-bound (i.e. use a very complex fragment shader) or need to perform blending for transparency.
As for instancing: Generally it is useful if you want to render the same static geometry many times with only slight variations, such as different model transformations and different textures (which you can hold in an array texture).

Lugen

Quote from: Kai on July 25, 2015, 12:03:58
Quote from: Lugen on July 25, 2015, 11:49:17
In this test none of them are static, they are all moving, but in the game it is most likely that most of the geometry will be static.
Nah, that is not what I mean by "static". :)
Perhaps more formally correct, by "static meshes" I mean:
- the number of vertices comprising that mesh does not change
- the topology of the mesh does not change (i.e. how vertices are connected to triangles)
It has nothing to do with the relative position/orientation/scaling/shearing to the "viewer" or whichever transformation you are applying to all the mesh'es vertices. That does not affect whether or not you can use instancing.
The worst-case scenario is when your geometries are dynamic (opposite of static), where you have to compute the topology of your mesh every frame. Think about some triangulation of metaballs for example.
Gotcha, then you're correct. I don't have any plans on modifying meshes during game-play. In the future I want an editor for making polygonal shapes but in that case I would just be creating a new mesh whenever it gets edited.
The only actual mesh I have right now is the batch itself. That one is declared in this fashion.
VBO = glGenBuffers();
	glBindBuffer(GL_ARRAY_BUFFER, VBO);
	glBufferData(GL_ARRAY_BUFFER, size, GL_DYNAMIC_DRAW);


Quote
Quote from: Lugen on July 25, 2015, 11:49:17
The z-sorting goes before texture sorting then. If I understand instancing (mind i haven't tried it) right then I could still have everything in one buffer and the textures would be collected in an array and picked for the right triangle?
For simple meshes you could probably get away without CPU-side z-sorting. Z-sorting is only useful if you are fillrate-bound (i.e. use a very complex fragment shader) or need to perform blending for transparency.
I assume I am fillrate-bound then since 8-bit transparency on the sprites is a must. The only alternative I can think of is using the z-buffer, but then everything would need slightly offset positions in depth in order to not cause any z-fighting.

Quote
As for instancing: Generally it is useful if you want to render the same static geometry many times with only slight variations, such as different model transformations and different textures (which you can hold in an array texture).
Sounds like something I should look into. I suppose any unique meshes can be their own instance as well so that everything can be drawn together more easily.