LWJGL 3 - Why does drawing sprites take so much performance?

Started by mrdlink, March 03, 2018, 11:26:50

Previous topic - Next topic

mrdlink

I'm wondering how drawing simple geometry with textures can eat up so much performance (below 60fps)? Even my good graphics card (GTX 960) can "only" draw up to 1000 sprites smoothly. The textures I'm using are all power of 2 textures and don't exceed a size of 512x512. I'm even filtering with GL_NEAREST only.
The sprites themself are random generated in size. So there are no 1000 fullscreen quads, which would be no real use case.

I'm drawing my sprites batched, meaning I have one dynamic vertex buffer and a static index buffer. I update the vertex buffer every frame with glBufferSubData once and then draw everything with `glDrawElements`. I have about 5 different textures which I bind once per frame resulting in 5 draw calls. For rendering I'm using only one shader which is bound once the application starts.
So I have 5 texture bindings, 5 draw calls, and one vertex buffer update per frame which is not really that much.

I was profiling my program first, but updating the vertex buffers and all the geometry takes about 10% of the total time per frame. But swapping the buffers takes up the rest of the 90% frame time.

So I'm asking, how can such big AAA games render their scenes with millions of vertices, if drawing pixels is such a time consuming task? I know that their is a lot of optimizations in the code, but still.

spasi

Try using glColorMask/glDepthMask to disable framebuffer writes. This will tell you if you're bound by framebuffer bandwidth, or if the problem is something else.

mrdlink

Ok. So I first set everything to false in both methods. Nothing appears on screen, obviously, and the program performs twice as fast. Setting any of the colormasks to true lets the performance drop back to where it was before.
But the updating takes away a lot of performance already I guess, because only updating with masks turned off is eating up the most performance (about 500fps from 1500). Drawing then only halfs the performance.

KaiHH

Drawing as little as 1000 quads does not warrant updating any vertex buffers dynamically every frame. I'm guessing you do that to implement viewport culling to only submit and draw sprites contained in the viewport.
You will likely be far better off just drawing all of them.
Your graphics card should _easily_ be able to draw 10,000 quads at above 100Hz.
If you still want to perform culling on the CPU, do you use a hierarchical spatial acceleration structure such as a quadtree or do you just linearly iterate over all your sprites and check whether each of them is visible?
If your sprites' positions are static then definitely use a quadtree.

QuoteI'm even filtering with GL_NEAREST only.
That will actually worsen performance! When using nearest, the texture fetching hardware has to randomly jump around your 512x512 texture to retrieve the texels resulting in bad cache coherence.
Definitely use mimapping with at least GL_NEARST_MIPMAP_NEAREST.

mrdlink

A static vbo performs the same. I already tried with GL_STATIC_DRAW and setting it up once. So right now I'm not sorting anything, just drawing stuff.

Cornix

You must obviously be doing something wrong. Even on my 8 years old machine I can render millions of 2D textured quads (2 triangles per quad) with simple immediate mode (glBegin / glEnd). We can help you much better if you simplify your code as much as possible and then show us what exactly you are doing. Try to keep everything contained in a single java file without any logic other than the rendering (and perhaps texture loading).

mrdlink

Well, simplified I'm setting up and rendering everything like this:
val shaderProgram = ShaderProgram("assets/default.vert", "assets/default.frag")
val texture = Texture("assets/libgdx-logo.png")
val texture2 = Texture("assets/libgdx-logo-mipmap.png")
val sprite = BufferSprite(texture)

val vertexData = MemoryUtil.memAllocFloat(8 * 4 * 8192)

fun setup() {
    glEnable(GL_TEXTURE)
    glColorMask(true, true, true, true)
    glDepthMask(false)

    //Setup vertex buffer and index buffer
    stackPush().use { stack ->
            val indices = MemoryUtil.memAllocInt(6 * 8192)
            for(i in 1..8192)
                indices.put(intArrayOf(
                        0 + 4*(i-1), 1 + 4*(i-1), 2 + 4*(i-1), 1 + 4*(i-1), 3 + 4*(i-1), 2 + 4*(i-1)
                ))
            indices.flip()

            val vao = stack.mallocInt(1)
            glGenVertexArrays(vao)
            glBindVertexArray(vao.get(0))

            val vbos = stack.mallocInt(2)
            glGenBuffers(vbos)
            glBindBuffer(GL_ARRAY_BUFFER, vbos.get(0))
            glBufferData(GL_ARRAY_BUFFER, vertexData, GL_DYNAMIC_DRAW)
            glEnableVertexAttribArray(0)
            glEnableVertexAttribArray(1)
            glEnableVertexAttribArray(2)
            glVertexAttribPointer(0, 2, GL_FLOAT, false, 8 * sizeof(Float), 0)
            glVertexAttribPointer(1, 4, GL_FLOAT, false, 8 * sizeof(Float), 2.toLong() * sizeof(Float))
            glVertexAttribPointer(2, 2, GL_FLOAT, false, 8 * sizeof(Float), 6.toLong() * sizeof(Float))

            glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vbos.get(1))
            glBufferData(GL_ELEMENT_ARRAY_BUFFER, indices, GL_STATIC_DRAW)
     }

    //Setup projection matrix
    glUseProgram(shaderProgram.program)
    stackPush().use { stack ->
        val mat = stack.mallocFloat(16)
        projView.get(mat)
        val loc = glGetUniformLocation(shaderProgram.program, "u_projView")
        glUniformMatrix4fv(loc, false, mat)
    }
}

fun flush() {
    if(vertexData.position() > 0) vertexData.flip()
    glBufferSubData(GL_ARRAY_BUFFER, 0, vertexData)
    glDrawElements(GL_TRIANGLES, 6 * mesh.vertexData.limit()/(8 * 4), GL_UNSIGNED_INT, 0)
    vertexData.clear()
}

fun draw(sprite: BufferSprite) {
    vertexData.put(sprite.vertexData)
}

fun render() {
    glClear(GL_COLOR_BUFFER_BIT)

    texture.bind()
    for(i in 1..500)
        draw(sprite)
    flush()

    texture2.bind()
    for(i in 1..500)
        draw(sprite)
    flush()
}

//Texture loading
class Texture(file: String) {

    val handle: Int

    val width: Int
    val height: Int

    init {
        val stack = stackPush()
            val w = stack.mallocInt(1)
            val h = stack.mallocInt(1)
            val bpp = stack.mallocInt(1)
            stbi_set_flip_vertically_on_load(true)
            val image = stbi_load(file, w, h, bpp, 4)
                    ?: throw RuntimeException("Failed to load a texture file!" + System.lineSeparator() + stbi_failure_reason())

            width = w.get()
            height = h.get()

            val textureID = stack.mallocInt(1)
            glGenTextures(textureID)
            handle = textureID.get()

            glBindTexture(GL_TEXTURE_2D, handle)
                glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE)
                glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE)
                glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_NEAREST)
                glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST)
                glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, image)
                glGenerateMipmap(GL_TEXTURE_2D)
            glBindTexture(GL_TEXTURE_2D, 0)

            stbi_image_free(image)
        stack.pop()
    }

    fun bind(unit: Int = 0) {
        glActiveTexture(GL_TEXTURE0 + unit)
        glBindTexture(GL_TEXTURE_2D, handle)
    }
}

//Sprite wrapping vertex data
class BufferSprite(val texture: Texture) {

    companion object {
        val VERTEXDATA_FLOAT_SIZE = 8 * 4
        val VERTEXDATA_BYTE_SIZE = 8 * 4 * 4
    }

    private val vertexData: FloatBuffer = MemoryUtil.memAllocFloat(VERTEXDATA_FLOAT_SIZE)

    init {
        vertexData.put(floatArrayOf(
                0f,0f, 1f,1f,1f,1f, 0f,0f,
                texture.width.toFloat(),0f, 1f,1f,1f,1f, 1f,0f,
                0f,texture.height.toFloat(), 1f,1f,1f,1f, 0f,1f,
                texture.width.toFloat(),texture.height.toFloat(), 1f,1f,1f,1f, 1f,1f
        ))
        vertexData.clear()
    }
}

Cornix

If I read your code correctly you are rebuilding your buffer in each frame, writing all the floats, sending them to the GPU and then clearing the buffer again. if that is the case it is no wonder you get bad performance. Its not your GPU which is the problem, it is the work your CPU needs to do to fill the buffer and send it over. Try to fill your buffer just once and never clearing it. How does your performance change?

mrdlink

The reason I need to rebuild the buffer every frame is that I want to transform the sprites. So I need to update the data and send it to the gpu every frame.
Taking out glDrawElements from the rendering loop, raises the performance back to a reasonable framerate, so drawing must the problem. Btw. the cpu usage running the above is about 5 - 8% while gpu load is at peak 100%.
But on the other hand, if I make a static buffer (GL_STATIC_DRAW) and set the data once in the setup method, it performs the same. But why?

Cornix

Perhaps because of the buffer update, especially because it is a glBufferSubData call instead of a glBufferData call. Moving the data from the client side to the server side is expensive and with glBufferSubData you have to wait for the previous frame to finish rendering before the buffer can be overwritten. With glBufferData the driver can decide to allocate a completely new buffer without having to wait until reading from the previous data is over. Maybe it would be even better to keep a few VBO's at hand and cycling between them. You write to VBO-1 while reading from VBO-2 for rendering, etc.

mrdlink

I already tried different methods of updating VBOs, including buffer orphanaging using glBufferData and double/triple buffering. I even tried using persistent, immutable buffers by using glBufferStorage and glMapBufferRange to directly map gpu memory to a buffer. But the load on cpu and gpu stays the same. It really bugs me as I can't seem to find a solution. Updating 128 KByte of data per frame shouldn't be really a problem.
I even tried immediate mode rendering using glBegin/glEnd recently. But it changes nothing either.

Maybe it's a JVM problem?

spasi

Could you also post the main loop and shader code? Or, preferably, a complete program (with inline shader code) that we can run and maybe reproduce the issue.

mrdlink

Here is the complete project. It's an IntelliJ IDEA Project written in Kotlin. You can use source code in Eclipse too, but you have to install  the kotlin plugin.

spasi

Performance is limited by framebuffer bandwidth. You're rendering a thousand 200x250 images one on top of the other, which may not sound like a lot, but if you do the math:

200x250 pixels x 1000 images x 4 bytes ~= 190 MB per frame. My 970 GTX renders the scene at 950 fps, so 190 x 950 ~= 176 GB/s, which is awfully close to the theoretical hardware limit (196 GB/s for this GPU).

This workload is not realistic. It's massive overdraw and real rendering applications do everything they can to minimize it (depth-testing, rendering opaque surfaces before transparent ones, occlusion culling, etc). The performance you're seeing has nothing to do with the vertex data updates or render call submissions (though you could do a lot about that too).

I would also highly recommend switching measurements from fps to ms/frame. (source 1, source 2)