SLI multicasting with multisampled FBOs


#1

Hi there, this is probably a very “niche” question. I am trying to use the NVIDIA SLI multicast extension (GL_NV_gpu_multicast) with Cinder, to get better performance in VR. It seems to be working fine, and I’m definitely seeing performance improvements (so far around 1.7x frame rate improvement in my particular scenario, with two GTX 1080s). However, I’ve run into a mysterious and frustrating thing…

I’m rendering (for VR) to two Cinder FBOs, normally with 4 multisamples. When SLI multicast is available, I render the left eye into GPU 0’s left FBO, and simultaneously the right eye into GPU 1’s “left” FBO. Then, when the rendering is done (after making a single set of draw calls to draw both eyes!), I need to blit the GPU 1’s “left” FBO’s contents back to GPU 0 as its “right”. This is required prior to submission of the two eye textures to OpenVR (or even just to draw on screen, say). This transfer is done with a call to glMulticastBlitFramebufferNV() (which is like the regular glBlitFramebuffer call, but allows you to explicitly transfer between GPUs).

What I’m seeing is that the right-eye (after transferring back to GPU 0) is not multisampled. It is all aliased, like the resolve didn’t happen properly (or more like rendering was done without multisampling, since you can’t visualize a MS buffer). I have called fbo->resolveTextures() and prior to that bindFramebuffer(), which ensures it’s marked as “dirty” so it does the resolve.

Strangely, my exact same code works glitch-free if I use 8 or 16 samples in the FBOs, and also works if I turn on CSAA sampling. But it fails (the right eye is aliased) when I have 2 or 4 samples. Of course, it also works fine (both eyes aliased) when no multisampling is enabled.

I know it’s a long shot, but has anyone any thoughts on what might be wrong? I’ve tried all kinds of combinations of things, but keep seeing the same result. Maybe it’s even a driver bug, but somehow I doubt that.

Thanks,
Glen.


#2

Just for the record, I finally managed to sort this out (after first writing a pure OpenGL program to try to repro it, then finding the pure OGL program didn’t exhibit the bug! :wink: ). Ah, so it was time to start stepping into the details of what Cinder is actually doing.

To do the multi-GPU stuff I want, I kind of need to “wrangle” a Cinder FBO. But Cinder kind of protects you from the underlying details, keeping both multisampled and resolved versions in a single Fbo instance (normally a good thing!). But I need to transfer the resolved version from one GPU to the other.

So in the end it wasn’t really complicated, but I was confused by the various “intricacies”, such as that simply calling Fbo::bindFramebuffer() also marks the framebuffer as dirty (needing multisample resolve). It does that even if you just bind it for reading with GL_READ_FRAMEBUFFER. (Devs: maybe that should be changed, only flagging it as dirty for GL_DRAW_FRAMEBUFFER or GL_FRAMEBUFFER bindings?)

Also, I needed more granular control of the multisample vs resolved “sub” framebuffers within one Cinder Fbo object, in particular to explicitly bind the resolved version – luckily there is an Fbo::getResolveId() that lets you do everything directly, with raw OGL calls. For awhile I was also confused because I thought Fbo::getId() was returning that (as opposed to Fbo::getMultisampleId()), but of course in the multisample case they both return the same value. (Yes, this is mentioned in the header…I just missed it the first time! :wink: )

Anyhow – it’s very cool, now everything works, double GPU action now, even when multisampling is active.

Thanks,
Glen.


#3

Thanks for sharing your insights, Glen!


#4

Hi there,

I wanted to mention that there’s been a bit of thought around how ci::gl::Fbo could be made more flexible and intuitive, as you can see on this list. I’m not sure if his work in progress addresses your needs, but when the time comes I’m sure it’d be great to get feedback or suggestions on how to make things like what you’re doing easier.

cheers,
Rich


#5

Thanks for pointing me to that. Just in case it’s useful to him, I added a comment there.


#6

Can you share a little code for how to transfer from one GPU to the other?
Thank you


#7

hi totalgee
How to select one of gpus to do tasks?


#8

Hi there. Yes, it’s a bit complicated. I really suggest you look at the description of the NVIDIA OpenGL extension:
https://www.khronos.org/registry/OpenGL/extensions/NV/NV_gpu_multicast.txt

But in quick summary (from what I recall – it’s been awhile), to select a GPU (or several) to perform subsequent rendering commands, you would use:

void glRenderGpuMaskNV(bitfield mask);

With (e.g.) 0x1 as mask for the first GPU, 0x2 for the second, 0x3 for both.

And to transfer data (a framebuffer at the end of rendering) from one GPU to the other (e.g to get the right eye’s rendering back to the left-eye GPU to submit to a VR compositor), you would use:

void glMulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
                                    int srcX0, int srcY0, int srcX1, int srcY1,
                                    int dstX0, int dstY0, int dstX1, int dstY1,
                                    bitfield mask, enum filter);

Before doing the blit, however, you’d likely need to wait for the other GPU to finish, so you use:

void glMulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);

You also need to WaitSync after doing the blit, before submitting to the VR compositor (or using the transferred buffer in some other way).

I also found I needed to have the environment variable GL_NV_GPU_MULTICAST set to 1 before doing any OpenGL stuff, for the extension to even show up as available when queried.

The extension (link shown above) has some example code, but it can be a bit tricky to get it working properly. One potential point of confusion is that some methods take a GPU mask, others take the index (0-based) of the GPU.

Obviously, you don’t want to render the same thing on both GPUs. So for example in the stereo rendering case, you want the same geometry and uniforms to be sent to both GPUs, but the viewing/projection matrices to be different. You do this by setting the uniform buffer that contains the camera transform (at least) to have a “per GPU” storage (GL_PER_GPU_STORAGE_BIT_NV or’ed into the flags for glBufferStorage()). And then to set different data to each “version” of the UBO, you’d use the extension method:

void glMulticastBufferSubDataNV(
        bitfield gpuMask, uint buffer,
        intptr offset, sizeiptr size,
        const void *data);

There is more you can do with it, this is just one use case. Again, look at the extension documentation.

Hope that helps a bit,
Glen.