Optimized Texture Upload

I am using a Texture3D to optimize my application’s drawing speed. Most of my drawing is a single instanced call with rectangles cropped from that texture.

I now would like to optimize updating the content in that texture. New images are loaded on the fly and slotted into their own z-layer of the 3d texture.

My first and current working approach was just to use Texture3d::update(surface, level) in a separate thread. This avoids blocking the CPU with the initial image load/transfer, which is great, but can introduce some stuttering due to (I believe) the GPUs implicit synchronization behavior.

This update method is simple and easy to understand:

_texture->update(*surface, index);

I would like to move to a faster approach and am trying (at the recommendation of Ryan Bartley) to use a PBO and a glFenceSync object so I can upload the data without hitting the GPUs implicit synchronization. I have successfully written code that uploads to a PBO and uses a client fence to signal when that upload is complete (and there are no hiccups caused by this).

This upload happens in code that is essentially as follows:

auto *pixels = static_cast<ColorA8u*>(_pixel_buffer->mapBufferRange(offset, imageByteSize, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT));
// might need to correct for channel order (BGRA?)
std::memcpy(pixels, surface->getData(), surface->getRowBytes() * surface->getHeight());

_pixel_buffer->unmap();
// create a fence so CPU doesn't proceed on this thread until the mapped buffer writing is complete
auto fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
auto result = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, 0);
auto ready = [&result] {
    return (result == GL_ALREADY_SIGNALED) || (result == GL_CONDITION_SATISFIED);
};

while (!ready()) {
    if (result == GL_WAIT_FAILED) {
        auto err = gl::getError();
        CI_LOG_E("Gl error waiting on fence: " << gl::getErrorString(err));
    }
    result = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, 0);
}

That copies the data over the the GPU and seems to be working fine (without any way to look at the result). Unfortunately, I am unsure how to make the Texture3d’s data correspond to that in the Pbo.

To set the texture data, I tried passing my Pbo to Texture::Format::setIntermediatePbo(), but that only seems to be (optionally) used at initial texture construction time. There is no Texture3d::update(PboRef) method to try like there is for Texture2D. Furthermore, I think it would be ideal if I could just write directly to the texture’s pixel data without going through another copy step on the GPU. That would both avoid doubling the amount of data on the GPU and adding more hidden synchronization points.

Is there some way I can just get the buffer of data used by my Texture3d, then map and write directly to it? Do I need a Pbo for that operation? If so, how might I tell the texture to only update from a portion of the Pbo? And (fingers crossed) will that texture update cause the initial synchronization hiccups that this whole thing is meant to avoid?

I have treated Vbos as 1-dimensional textures in the past, which gives me hope that I can do something similar for this data upload. I’m mostly not sure how to specify the buffers at the moment.

Looking at Cinder’s source, I found that the Texture2d::update(Pbo) binds the pbo and uses glTexSubImage* to copy the data from the Pbo to the texture. I’m going to try that approach with the raw OpenGl commands for my 3d texture and see how it goes.

Copying over via glTexSubImage3D makes the images appear correctly. Unfortunately, it also reintroduces stuttering, as feared. So I’m no better off than the _texture->update(surface, index) at this point, with significant additional user code. It would be great to have some non-synchronized method to update the texture’s data, but I’m not sure what it would be.

To copy the data, we use glTexSubImage3D:

{ // copy into texture (unfortunately, this reintroduces the stuttering
    gl::ScopedBuffer buffer_scope(_pixel_buffer);
    gl::ScopedTextureBind texture_scope(_texture);
    glTexSubImage3D(_texture->getTarget(), 0, 0, 0, index, _texture->getWidth(), _texture->getHeight(), 1, GL_RGBA, GL_UNSIGNED_BYTE, (GLvoid*)offset);
}

Bummed that this just reintroduces the stutter. Any alternative ways to update the texture buffer really appreciated. I suppose I could try having two textures, but would need to eventually upload all images to both textures, so that would add quite a bit of complexity (and memory usage).

I’ve never async uploaded to a 3D texture before, but what it seems to me is happening is even though you know you aren’t uploading to a texture that is being used currently, the driver doesn’t understand that. Therefore, it’s disregarding your unsynchronized call. Specifically that enum is just a hint. Basically, it seems that the texture3d is being looked at like a whole resource rather than parts. How much of the total texture memory are you using at any one time? There’s some apple specific api’s that also allow you to flush memory for a buffer after a map only to a specific range, which could also help in letting the driver know to get out of your way. I’m not at my computer right now but I’ll be back soon and can link to it.

Hello,

I don’t have experience either with uploading to a 3D texture, but keep in mind that using PBO’s you can achieve asynchronuous uploads only on the CPU side. The GPU side is controlled by the driver. PBO’s basically enable fast DMA copy-operations on GPU memory, in your case unpacking pixel data from the PBO to the currently bound texture. You fill the buffer with data, but still have to make sure that data gets in your texture. For this, you use the glTexSubImage* family of functions, but with last parameter set to 0. OpenGL will then copy over pixel data from the current bound buffer (PBO) and not from client memory. This function will return directly and leave the actual copying for the driver.

An example of double buffered upload to textures using PBO’s can be the following example. Note the use of glBufferData() to avoid stalling for glMapBuffer() to return a valid pointer.

// be sure to unbind any unpack buffer before start          
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

glBindTexture(GL_TEXTURE_2D, gl_tex);

if (pbo_supported)
{
static int tex_fill_index = 0;
int pbo_fill_index = 0;

// tex_fill_index is used to copy pixels from a PBO to a texture object
// pbo_fill_index is used to update pixels in a PBO
tex_fill_index = (tex_fill_index + 1) % 2;
pbo_fill_index = (tex_fill_index + 1) % 2;

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboIds[tex_fill_index]);

// don't use pointer for uploading data (last parameter = 0), data will come from bound PBO
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, player->getWidth(), player->getHeight(), gl_format, player->getBytesPerFrame(), 0);

// bind PBO to update pixel values
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboIds[pbo_fill_index]);

// map the buffer object into client's memory
// Note that glMapBuffer() causes sync issue.
// If GPU is working with this buffer, glMapBuffer() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferData() with NULL pointer before glMapBuffer().
// If you do that, the previous data in PBO will be discarded and
// glMapBuffer() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
// http://www.songho.ca/opengl/gl_pbo.html
glBufferData(GL_PIXEL_UNPACK_BUFFER, player->getBytesPerFrame(), 0, GL_STREAM_DRAW);

// map pointer for memcpy from our frame buffer
GLubyte* ptr = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
if (ptr)
{
	memcpy(ptr, player->getBufferPtr(), player->getBytesPerFrame());
	glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
}

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
}
// when PBO's are not supported, fall back to traditional texture upload
else
{
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, player->getWidth(), player->getHeight(), gl_format,    player->getBytesPerFrame(), player->getBufferPtr());
}

glBindTexture(GL_TEXTURE_2D, 0);`

Thanks @vjacobs and @ryanbartley for your responses.

Running my upload on a secondary thread already takes care of the CPU-blocking issues (that code is omitted above for brevity and clarity). So the PBO usage is a way to hopefully gain faster copying performance to the actual texture. Unfortunately, the “fast” DMA copy operation from PBO to Texture still isn’t very fast (with 512x512 slices) and introduces stuttering.

Most examples I have seen—including the one you posted from songho.ca—using a PBO assume a single streaming image source (e.g. a video) that can easily use ping-ponged textures for display. In my case, I want to avoid ping-ponging images because my texture is 512x512x512 and duplication of that data would be quite expensive.

That gives me an idea: I will try making a smaller PBO (the size of a single slice) and updating the texture from that. That way, the GPU may not be locking as much memory for the duration of the DMA copy.

Present status:

  • Upload working
  • glTexSubImage* is too slow; introduces stuttering
  • Using a 512x512x1 PBO (instead of 512^3) for buffering doesn’t reduce stuttering

UPDATE:
Using a smaller PBO does not help, though it still works as before.

So the problem remains as before: “fast” DMA copying on the GPU is still slow and is probably introducing implicit synchronization to the program which is causing stuttering.

Is there a method short of double-buffering a texture to avoid paying that synchronization price (haven’t yet tested whether double-buffering solves the issue, either).

For the record, I am potentially using the texture for rendering at the same time I’m updating it, but I’m only ever using memory that isn’t in the process of being updated (I have a marker for “clean” regions that can be drawn and only use those). I agree Ryan that it seems like the GPU wants to treat the texture monolithically and won’t let me do things that I know are okay.

Here’s the article I was talking about. How many of the textures would you be using in 1 draw call?

Also, are you mapping the buffer on a different thread, through a shared context?

Hey Ryan. I have yet to read that article, but yes, I am mapping the buffer on a different thread. Here’s the setup:

Main Thread:
Uses Texture3D in instanced draw calls for 10—1000 elements at a time.

Secondary Thread:
Shared OpenGL Context
Updates Texture3D

I have some atomic data members I am using to manage what z-indices of the texture are available to draw and what their dimensions are. All that stuff is working well.

There is a little more complication, of course, but the only parts touching the texture are those two threads. I suspect OpenGL just won’t let the Update/Draw conflict with each other through coarse means (even though I know they don’t through fine control).

The note about using TEXTURE_RECTANGLE being required for DMA in there is surprising. Perhaps it isn’t possible to get DMA support for TEXTURE_2D_ARRAY. That goes at least part way to explain why the glTexSubImage remains slow with a PBO.

I’m guessing this kind of stuff is where Vulkan will start coming in handy/replacing OpenGL. It’s a little frustrating having the GPU bog things down when you know you are safely writing and using your data.