I'm using lwjgl through Scala and enqueueing openCL kernels for fast array operations. This is not for a game, this is training neural nets. I'm trying to speed up training of neural nets and notice that when I run training on my macbook the "enqueue" kernel command accounts for about 0.5% of all time, but when I run on a linux machine with Nvidia graphics cards it accounts for about 15% of all time. I've also compared the enqueue times in lwjgl with enqueue times in C, and they're sometimes significantly longer e.g. 300,000ns vs 20,000ns (comparing worst times).
I've tried a number of different ways to measure timing, and tried tweaking some lwjgl configurations, but I still can't figure out why the "enqueue" time should be sometimes slow.
Does anyone have any tips for tracking down the cause of this slowness?