Vertex Buffer Objects performance problems

Niels · October 27, 2004, 21:20:05

Hi guys, been a while

I've been looking at LWJGL a lot over the past 14 days, brushing up on my very rusty OGL skills. I've run into one (well two) really weird problem(s):

Using vertex buffer objects is consistently slower than immediate mode (approx. 3x slower), and display lists where 10x slower (this however may have been because I was using multiple textures in the same list, I'm not going to use DL anyway so I never gave it much additional thought).

VBO's however are really annoying me. A boiled down version of my immediate code looks like this:

private void renderMesh()
{
for (int f = 0; f < m_aFaces.length; f++)
{
Face face = m_aFaces[f];
GL11.glBegin(GL11.GL_TRIANGLES);
GL11.glNormal3f(face.nx, face.ny, face.nz);
GL11.glBegin(GL11.GL_TRIANGLES);
GL11.glVertex3f(face.v0x, face.v0y, face.v0z);
GL11.glVertex3f(face.v1x, face.v1y, face.v1z);
GL11.glVertex3f(face.v2x, face.v2y, face.v2z);
GL11.glEnd();
}
}

My VBO code (again in a trimmed down version) looks like this

GL11.glEnableClientState(GL11.GL_VERTEX_ARRAY);
GL11.glEnableClientState(GL11.GL_NORMAL_ARRAY);
ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, m_iVertVBO);
GL11.glVertexPointer(3, GL11.GL_FLOAT, 0, 0);
ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, m_iNormVBO);
GL11.glNormalPointer(GL11.GL_FLOAT, 0, 0);

for (int f = 0; f < m_aFaces.length; f++)
{
Face face = m_aFaces[f];
GL11.glBegin(GL11.GL_TRIANGLES);
GL11.glArrayElement(face.v0);
GL11.glArrayElement(face.v1);
GL11.glArrayElement(face.v2);
GL11.glEnd();
}

GL11.glDisableClientState(GL11.GL_VERTEX_ARRAY);
GL11.glDisableClientState(GL11.GL_NORMAL_ARRAY);

The VBO is created like this:

FloatBuffer vertbuffer = BufferUtils.createFloatBuffer(3*verts.length);

for(int j=0,i=0;i<verts.length;i++)
{
vertbuffer.put(j++,verts.x);
vertbuffer.put(j++,verts.y);
vertbuffer.put(j++,verts.z);
}

IntBuffer temp = BufferUtils.createIntBuffer(1);
ARBVertexBufferObject.glGenBuffersARB(temp);

int iVBO = temp.get(0);

ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, iVBO);
ARBVertexBufferObject.glBufferDataARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, verts.length*3*4, vertbuffer, ARBVertexBufferObject.GL_STATIC_READ_ARB);

I'm aware that glArrayElement is not the fastest way to use buffers, but I need multiple texcoord per vertex so I don't really have a choice (AFAIK?). I'm on a 2.5GHz Intel using a GF4 Ti4400 card, newest drivers (2 days ago).

Any ideas?

tomb · October 28, 2004, 00:15:20

You wan't to use glDrawArrays or glDrawElements. Draw as many triangles as possible with as few calls as possible. I don't see why you can't just unpack a whole mesh and draw with one call.

What your doing is just abusing vbos :wink:

Niels · October 28, 2004, 07:24:33

Like I said: I know...

Still, even if I'm not using VBOs in the fastest possible way, I fail to see why it would be 3x slower than immediate mode. I'd think worst case would be identical performance. (I would imagine that if glArrayElement was intentionally slower than immediate mode, they hadn't bothered implementing it)

I'll try unpacking the mesh to see if it makes any significant difference, though I really hate the idea of a massive vertex overhead just because someone didn't realize that the vast majority of all real world cases requires more than one texture coordinate per vertex.

But I'm still looking for an answer to my original question...

Niels · October 28, 2004, 08:27:02

Just a thought (and excuse my ignorance, I should probably read the VBO spec

) :

Is it by any chance so that the HW will transform and light the vertices of a VBO only when matrices (projection or object) change between glArrayElement calls, so that issuing multiple glArray-type calls without changing the matrices will be very fast but re-using the same VBO for multiple instances of the same mesh will be very slow?

I am of course doing the latter.

The alternative would be to create one VBO each frame for the entire scene and then use this for each pass.

Or should I just put the crack pipe down and go read the spec?

Niels · October 28, 2004, 09:22:12

Side question:

Why would GL_STATIC_DRAW_ARB be 3x slower still than GL_STATIC_READ_ARB?

I am not reading data from GL, as you can see from my code. DRAW should be the correct hint, and there is no reason why READ would be faster, esp. 3x faster. But it is. (In summary DRAW is 9x slower than immediate, READ is 3x slower).

AND, when I say 3x slower, i am talking about actual framerate. Since pushing vertices to HW is only part of the frametime, the actual isolated performance difference must be much larger.

princec · October 28, 2004, 09:50:37

Niels, most rendering does require multiple texture coordinates. You have to specify each of them interleaved and then render with various offsets, changing the offset for the texture coords each time but leaving the vertex coords and colour coords the same.

And make sure you've packed your vertices into even multiples of 32 bytes, so that AGP copes nicely with them.

Cas

Niels · October 28, 2004, 10:16:40

Not sure I understand.

As far as I can tell, there is a 1:1 mapping between all arrays. E.g.: glArrayElement only has one parameter from which GL must find all information for a given vertex (I.e. x, y and z, nx, ny and nz, and tu, tv and possibly tw).

So, one vertex has exactly one (tu,tv) pair -> If i need multiple (tu,tv) pairs I must also have multiple vertices, or?

I just scanned all my meshes, and none of them has a single shared vertex for which (tu,tv) is constant for all faces sharing that vertex.

---

To get 32 bytes per vertex I'd need to include normal as well as (tu,tv) for each vertex, interleaved. So if this is the reason why VBOs are slow, you're effectively saying that without texturing and per-vertex lighting VBOs can't be used.

Sorry, but while I understand why the things you mention can affect performance, I completely fail to see why not doing it would be slower than immediate mode.

Niels · October 28, 2004, 10:43:28

For the record:

Tried interleaving data and it made ~5% difference (from 9.8 to 10.3 fps) .

Tried padding to get 32 bytes per vertex and it made no difference what so ever.

Tried various combinations of STATIC/DYNAMIC/STREAM and READ/COPY/DRAW. Apparently STATIC/DYNAMIC/STREAM makes no difference, while READ is faster than COPY which is faster than DRAW to the tune of 10/5/3 fps respectively. Go figure.

spasi · October 28, 2004, 11:55:40

Hi Niels,

What you're doing is a complete misuse of the OpenGL API.

1. This is a really bad way of drawing triangles:

for (int f = 0; f < m_aFaces.length; f++) 
{ 
Face face = m_aFaces[f]; 
GL11.glBegin(GL11.GL_TRIANGLES); 
GL11.glArrayElement(face.v0); 
GL11.glArrayElement(face.v1);
GL11.glArrayElement(face.v2);
GL11.glEnd();
}

Even if you can't use index arrays (I can't see why not), why are you calling begin/start for each triangle? Put them outside the loop! These two things make your app 100% CPU limited.

2. Why are you using two VBOs, one for positions and one for normals? Not good (should not work at all).

3. You *should* use STATIC_DRAW. I'm not sure why it's faster with STATIC_READ (the data is probably kept on system mem), but I'd guess it's because of the way you draw your mesh.

I hope this helps.

spasi · October 28, 2004, 12:17:22

Fastest way of drawing triangles:

1. Put all your mesh data in a VBO (use STATIC_DRAW).
2. The data should be interleaved (and optionally padded).
3. Use an index VBO for your indices (ELEMENT_ARRAY_BUFFER).
4. Indices should be unsigned shorts.
5. Use glDrawRangeElements for rendering.

If everything goes well (depends on the implementation), both mesh data and indices will be on non-system memory and your CPU will have minimal/no work to do.

Niels · October 28, 2004, 12:21:20

Spasi, thx for the suggestions, but you are not reading my info:

1) I know - the mistake is the post, not the code (you obviously couldn't know that

), but putting it oustide the loop does not change anything. I CAN use index array but at the penalty of having ~6 times as many vertices (to get unique tu,tv for each vertex - see above)

2) See above - I've changed it to one, at no performance gain (well, 5%).

3) Agree, but a x3 difference?

I'll post again in an hour or so when I know how well glDrawElements performs. (If it's finally faster, someone needs to update the spec to state that immediate mode is preferred over glArrayElement - it's not exactly obvious that this should be the case, IMHO).

princec · October 28, 2004, 12:49:09

Niels, what you should be doing is packing vertex data thus:

x,y,z,nx,ny,nz,tx0,ty0,tx1,ty1,tx2,ty2,tx3,ty3

You write all the data but once, linearly, from start to finish, every frame, optionally skipping data that has not changed.

You then position your vertex pointer at offset 0 in the buffer, normal pointer at offset 12 bytes in the buffer, and texture coords at 24 bytes. Draw your triangles in one huge fast call using glDrawRangeElements. Use an index array, not lots of little calls to glArrayElement. The driver cannot optimise glArrayElement and most likely doesn't even attempt to as they only optimise common usage paths using the most efficient techniques.

Now change textures and change your texture coord pointer to point at buffer offset by 32 bytes. Draw. You do not change the vertex or normal pointers, nor write any data twice.

Now change textures and change your texture coord pointer to point at buffer offset by 40 bytes. Draw.

And finally offset by 48 bytes. Draw.

This is the only efficient and correct way to do what you want to do. And it's blindingly fast normally.

Cas

Niels · October 28, 2004, 13:17:21

Cas,

Makes sense now, but your suggestion makes two (in my case) incorrect assumptions:

1) A vertex only changes texture coordinate if it is shared by faces using different textures. Not generally true, though I guess I could accept that constraint if I had to.

2) A constant small number of texture coords per vertex. In your example 3 different texture coords per vertex. I generally have 6, sometimes more sometimes much less. Allocating, say 8, per vertex for worst case coverage is a lot of overhead.

I guess I need to limit the mappings I allow (Currently I grab whatever 3DS Max throws at me)

----

Anyway, I've gone ahead and done what I was going to do eventually anyway:

1) Separate the mesh in chunks for each material to reduce state changes (I still need a proper shader tree, but that's next).

2) Accepted a 600% increase in vertices and unpacked the mesh so that each vertex exist once for each face using it. This allow me to incorporate unique texture vertices for each face and provides a way to have different normals for each face (necesarry for complex part smooth- part flat-shaded meshes).

3) Interleaving all float data in one VBO and unsigned short face-vertex indices in another. I now use glDrawRangeElements to draw each chunk.

And performance is up x2 compared to immediate mode. Finally

Thanks for all the input, though I still find it odd that glArrayElement is worse than strict immediate mode, I'll chalk it up as a learning experience.

Cas, you're probably right about vendors optimizing for common paths - I just wish they'd remove the others then, or at least not recommend using them (I've found the red book to be terrible in this respect).

Niels · October 28, 2004, 16:52:19

Here is another funny one:

I would expect glDrawRangeElements to work like this (though I find the mix of indices and byte offsets rather ugly). I.e. from and to are face indices whereas size and offset are in bytes :

GL12.glDrawRangeElements( GL11.GL_TRIANGLES, firstfaceidx, lastfaceidx, 3*facecount, GL11.GL_UNSIGNED_SHORT, 0);

This however appear to draw faces twice (?? very strange ??)

I played around a bit and eventually came up with this instead:

GL11.glDrawElements( GL11.GL_TRIANGLES, 3*(firstfaceidx-lastfaceidx+1), GL11.GL_UNSIGNED_SHORT, 2*3*firstfaceidx);

Which appear to work correctly, but is a horrible mix of different type offsets. I.e. size is in number of shorts, offset is in bytes, and I use the regular glDrawElements rather than glDrawRangeElements.

(I define correctness by comparing generated images to my previous immediate mode code - Using glDrawRangeElements draws a very dark image suggesting multiple texture modulation passes).

Does anyone know exactly which methods accepts

1) Byte offsets
2) Machine type index (number of shorts)
3) Primitive index (number of faces)

tomb · October 28, 2004, 17:50:52

First of all, have a look at the documentation: http://oss.sgi.com/projects/ogl-sample/registry/EXT/draw_range_elements.txt

"start" and "end", wich you have called "firstfaceidx" and "lastfaceidx", is the index of the first and last vertex. It's the minimum and maximum value of the indices your about to draw. It's all confusing, and I'm not sure if don't understand the docs, or I don't understand your code.

As for what methods accepths what, have a look at the docs:
http://developer.3dlabs.com/glmanpage_index.htm
http://www.opengl.org/documentation/specs/man_pages/hardcopy/GL/html/gl/
http://oss.sgi.com/projects/ogl-sample/registry/

News:

Vertex Buffer Objects performance problems