glsl skinning shader performance issues

Started by plummew, February 21, 2008, 23:32:05

Previous topic - Next topic

plummew

Hi,

I've recently knocked together a (matrix palette) skinning vertex shader to replace the Java-based skinning I was using.

I fully expected a hefty performance boost when I switched to using a shader running from VBO data but this just hasn't materialised.

The only data I pass across to the card each frame is the interpolated final matrix palette. (As a uniform array of 4x4 matrices) Everything else is in VBOs: vertices, tex co-ords, norms, palette indices/weights.

When I profile my code it turns out that the call to send the final matrix array:

ARBShaderObjects.glUniformMatrix4ARB(ResourceFactory.FINAL_MATRICES_LOCATION, false, object.finalMatrixBuffer);

is using a ridiculous amount of CPU.

Does anyone out there have experience of using LWJGL and shaders for skinning and have any pointers/gotchas that I might have overlooked?

(Oh and I'm running an Nvidia GeForce Go 7900 GS - so it definitely supports hardware shaders etc.)

Cheers.

plummew

This gets more puzzling:

I've been experimenting with ARBShaderObjects.glUniform methods and have found some amazing differences in performance.

Take the following code, which supplies a uniform vec4 as a FloatBuffer:

FloatBuffer buf = BufferUtils.createFloatBuffer(4);
ARBShaderObjects.glUniform4ARB(loc, buf);

Now take functionally equivalent code, which takes a uniform vec4 as 4 floats

FloatBuffer buf = BufferUtils.createFloatBuffer(4);
ARBShaderObjects.glUniform4fARB(loc, buf.get(0), buf.get(1),buf.get(2),buf.get(3));

The second method runs 50 times faster than the first.

So my problem seems to lie with either the LWJGL implementations of ARBShaderObjects.glUniform4ARB, ARBShaderObjects.glUniformMatrix4ARB etc. or the underlying native methods. (Getting beyond the realms of my knowledge now)

So can anyone suggest an approach for resolving this problem?

Do I need to delve into the native methods being used here?

Any help much appreciated.


spasi

Quote from: plummew on February 21, 2008, 23:32:05ARBShaderObjects.glUniformMatrix4ARB(ResourceFactory.FINAL_MATRICES_LOCATION, false, object.finalMatrixBuffer);

This should be fast, but I can't tell what might be wrong without more info.

I use glUniform4fv (I upload 3 rows per matrix, not 4) from GL20 for hundreds of characters and it's very fast.

plummew

Spasi,

Thanks for responding.

I'm not sure what extra info would help - my second post simplifies matters to a single difference: using a FloatBuffer or using 4 floats.

I've taken a look at other examples of GLSL compiling and linking and I'm doing nothing unusual:
        shaderID = ARBShaderObjects.glCreateShaderObjectARB(ARBVertexShader.GL_VERTEX_SHADER_ARB);
        ARBShaderObjects.glShaderSourceARB(sID, source);
        ARBShaderObjects.glCompileShaderARB(shaderID);
        pID = ARBShaderObjects.glCreateProgramObjectARB();
        ARBShaderObjects.glAttachObjectARB(pID, sID);
        ARBShaderObjects.glLinkProgramARB(pID);
then later...
        final int loc = ARBShaderObjects.glGetUniformLocationARB(pID, buf); // where buf contains name "boneMatrices"
then later...
        ARBShaderObjects.glUseProgramObjectARB(ResourceFactory.paletteShader.pID);
        FloatBuffer buf = BufferUtils.createFloatBuffer(4);
// This one runs as fast as I'd expect:
//        ARBShaderObjects.glUniform4fARB(ResourceFactory.FINAL_MATRICES_LOCATION, buf.get(0), buf.get(1),buf.get(2),buf.get(3));
// this one runs very slowly:
        ARBShaderObjects.glUniform4ARB(ResourceFactory.FINAL_MATRICES_LOCATION, buf);

        ARBShaderObjects.glUseProgramObjectARB(0);

and my shader declaration:
uniform vec4 boneMatrices;

I was using ARBVertexShader.glGetUniformLocationARB instead of ARBShaderObjects.glGetUniformLocationARB at first but when I changed it, it made no difference.

Does the process detailed above match the process you're using - Are you performing any relevant shader setup that I'm not?

It's encouraging that you have no problems with uniforms as it means I 'should' be able to resolve this but I'm at a loss to know what the problem might be.

Can I ask what Gfx card you're using?

One thing I could try is to convert all my shader calls to use the GL20 calls you're using. Bit of a desperate measure I know...

Any further thoughts appreciated.

Fool Running

Are you creating a new FloatBuffer each time you call it? I can't tell by looking at the code snippets you posted.

EDIT: I just thought of something else... Are you flip()ing the buffer before you send it? (if you used the put methods to add stuff to the buffer its position will be wrong).
Programmers will, one day, rule the world... and the world won't notice until its too late.Just testing the marquee option ;D

plummew

"Are you creating a new FloatBuffer each time you call it? " - Not in my code proper. I'm doing it in the test code I knocked up just because it was convenient.
But it doesn't matter anyway as the performance difference I'm measuring is in the single call to  ARBShaderObjects.glUniform4fARB(ResourceFactory.FINAL_MATRICES_LOCATION, buf.get(0), buf.get(1),buf.get(2),buf.get(3));
versus
ARBShaderObjects.glUniform4ARB(ResourceFactory.FINAL_MATRICES_LOCATION, buf);
The former (4 floats) being > 50 times faster than the latter (FloatBuffer size 4)

"Are you flip()ing the buffer" - yep.

I'm currently converting to use GL20 rather than ARB functions. I don't expect this to make a difference though.

If I get no joy, I'll try posting on a jogl forum or see if there are any issues that might offer a clue on the Nvidia developer site.

plummew

... and for anyone who's interested...

Converting from ARB to GL20 made no difference.

Well, that's me clueless.

spasi

This doesn't make sense. I just tested my code on a character with 22 bones; populating the matrix buffer AND calling GL20.glUniform4 doesn't take more than 2000 nanoseconds.

I'd make an isolated test case if I were you. This might be a shader/compiler issue (shader getting recompiled?).

Quote from: plummew on February 22, 2008, 22:11:06uniform vec4 boneMatrices;

Is this a vec4 array or a mat4 array in your normal code? If it's a matrix, try expanding it to 4 vec4 each and adjusting your code accordingly. Then try using glUniform4 instead of glUniformMatrix4.

Edit: forgot to reply to this:

Quote from: plummew on February 22, 2008, 22:11:06Can I ask what Gfx card you're using?

This is on a 8800 GTX atm, but I've been using the same code since I had an NV30 with ARBShaderObjects and OpenGL 1.4. Works fine on ATI cards too.

plummew

Spasi,

Thanks again for the response.

I think the isolated test case is good advice, I'll knock it up this evening.

I'll create a simple class that compiles/links and uses a trivial vertex shader.
I'll add an elapsed time check and code a simple uniform population.

If I still get the same performance difference, I'll post the full code here.

plummew

OK,

I knocked up a test program. (See below for code)

It's a single static class, run via main method.

Shader is stored as a String to keep matters simple.

I've run this with VM options: -Xms512m -Xmx512m

And I consistently find the non-FloatBuffer variant runs 10 times faster than the FloatBuffer.
Here's my results:

Running with GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3))
Average elapsed in nanosecs:1964.0

Running with GL20.glUniform4(loc, buf)
Average elapsed in nanosecs:19258.0


I let it run for a few thousand iterations to give the hotspot time to compile before summing for an average elapsed time.

I've also tried altering the shader to accept an array of 40 vec4 and the code to pass in a FloatBuffer of 160 floats.
This ran in much the same time as the "GL20.glUniform4(loc, buf)" call above (approx. 19000 nanosecs)

It's also interesting to note that this test code shows only a 10 fold difference in performance rather than the 50 fold difference I see when this processing runs inside my code proper.

Spasi,

1) Is there any chance you could look through this code and see if anything jumps out at you as odd or problemmatic?
2) Do you initialise/process your vertex shaders in the same way as this test code or do you do anything significantly different?

3) It would also be interesting to see what kind of results you get when you run this test code - is that possible?

import java.nio.ByteBuffer;
import java.nio.FloatBuffer;
import java.nio.IntBuffer;

import org.lwjgl.BufferUtils;
import org.lwjgl.LWJGLException;
import org.lwjgl.opengl.Display;
import org.lwjgl.opengl.DisplayMode;
import org.lwjgl.opengl.GL20;

public class Test {
    
    private static long timeStart = 0;
    private static long timeEnd = 0;

    private static int GL_FALSE = 0;

    private static IntBuffer pBuf = BufferUtils.createIntBuffer(1);
    private static ByteBuffer fBuf = BufferUtils.createByteBuffer(100);
    private static ByteBuffer source;

    private static int shaderID;
    private static int programID;
    
    private static final long NUM_ITERATIONS = 1000000L;
    private static long sum = 0L;

    public static void main(String[] args) {

        try {
            Display.destroy();
            Display.setDisplayMode(new DisplayMode(800, 600));
            Display.setFullscreen(false);
            Display.create();
        } catch (LWJGLException e) {
            System.exit(10);
        }
        
        String s = "uniform vec4 testUniform; void main() { gl_Position = gl_ModelViewProjectionMatrix * testUniform * gl_Vertex;}";
        source = BufferUtils.createByteBuffer(s.length());
        source.put(s.getBytes());
        source.flip();
        shaderID = GL20.glCreateShader(GL20.GL_VERTEX_SHADER);
        GL20.glShaderSource(shaderID, source);
        GL20.glCompileShader(shaderID);
        GL20.glGetShader(shaderID, GL20.GL_COMPILE_STATUS, pBuf);

        if (pBuf.get(0) == GL_FALSE) {
            System.exit(102);
        }
        programID = GL20.glCreateProgram();
        GL20.glAttachShader(programID, shaderID);
        GL20.glLinkProgram(programID);
        GL20.glGetProgram(programID, GL20.GL_LINK_STATUS, pBuf);

        if (pBuf.get(0) == GL_FALSE) {
            System.exit(103);
        }
        
        int loc = getUniformLocation("testUniform");
        FloatBuffer buf = BufferUtils.createFloatBuffer(4);
        buf.put(1).put(2).put(3).put(4);
        buf.flip();

        for(int i = 0; i < NUM_ITERATIONS; i++) {
            timeStart = System.nanoTime();            

// I'm fast
//            GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3));
// I'm slow
            GL20.glUniform4(loc, buf);

            timeEnd = System.nanoTime();

            GL20.glUseProgram(0);
            if(i > 9999) { // Give hotspot time to compile code
                sum += timeEnd - timeStart;
            }
        }
        
        System.out.println("Average elapsed in nanosecs:" + (float)(sum / (NUM_ITERATIONS - 9999)));
    }

    
    private static int getUniformLocation(String name) {
         fBuf.clear();

        int length = name.length();

        char[] charArray = new char[length];
        name.getChars(0, length, charArray, 0);

        for ( int i = 0; i < length; i++ )
                fBuf.put((byte)charArray[i]);
        fBuf.put((byte)0); // Must be null-terminated.
        fBuf.flip();
        GL20.glGetUniformLocation(programID, fBuf);

        int location = GL20.glGetUniformLocation(programID, fBuf);

        if ( location == -1 )
                throw new IllegalArgumentException("The uniform \"" + name + "\" does not exist in the Shader Program.");

        return location;
    }

    
}

spasi

Quote from: plummew1) Is there any chance you could look through this code and see if anything jumps out at you as odd or problemmatic?

Everything looks fine, except your benchmark code. Timing a single function call is never going to be reliable. I usually poll the timer outside the loop, like this:

{
    // Warm-up
    test(10000, loc, buf);
    // Benchmark
    sum = test(NUM_ITERATIONS, loc, buf);
}

private static long test(final long iterations, final int loc, final FloatBuffer buf) {
    timeStart = System.nanoTime();

    for ( long i = 0; i < iterations; i++ ) {
// I'm fast
        //GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3));
// I'm slow
        GL20.glUniform4(loc, buf);
    }
    timeEnd = System.nanoTime();

    return timeEnd - timeStart;
}


Quote from: plummew2) Do you initialise/process your vertex shaders in the same way as this test code or do you do anything significantly different?

I do it the same way.

Quote from: plummew3) It would also be interesting to see what kind of results you get when you run this test code - is that possible?

Using your benchmark code:

Running with GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3))
Average elapsed in nanosecs: 405.0
Running with GL20.glUniform4(loc, buf)
Average elapsed in nanosecs: 470.0

Using my benchmark code:

Running with GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3))
Average elapsed in nanosecs: 74.6
Running with GL20.glUniform4(loc, buf)
Average elapsed in nanosecs: 132.2

Run on an Intel Q6600.

The above numbers look consistent with what I would expect, given that glUniform4 is doing some extra work (at least java-side). I also tried making the testUniform an array. With only 10 vec4s, using a loop of glUniform4f is 3 times slower than using glUniform4 to update them in one go, which is again what you would expect.

It all sounds like a driver issue on your side.

plummew


I'll confirm the driver issue possibility by running code on a desktop rather than my usual laptop.

I'll try updating my Dell driver (if available)

Otherwise I'll just live with the perfomance problem for development purposes.

I'll post again if I even definitively get to the bottom of this.

Thanks again for the input.

plummew

Spasi,

For info:

Using your benchmark code:

Running with GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3))
Average elapsed in nanosecs: 78.0
Running with GL20.glUniform4(loc, buf)
Average elapsed in nanosecs: 24456.0

Only 313 times slower...

plummew

...and one driver update later:

Running with GL20.glUniform4f(loc, buf.get(0), buf.get(1), buf.get(2), buf.get(3))
Average elapsed in nanosecs: 76.0
Running with GL20.glUniform4(loc, buf)
Average elapsed in nanosecs: 210.0

So fixed.

Groovy.