Awesome, looks great!
Just a heads-up though, CLKernel.setArgSize(int index, long size) and the low-level equivalent clSetKernelArg(CLKernel kernel, int arg_index, long arg_value_arg_size) will crash when called on the trunk version. This problem has been fixed in the OpenGL ES branch, so better switch to that if you'd like to use OpenCL's work-group local memory (you may have to in order to optimize the implementation for the GPU).