Asynchronous multi-threaded PBO texture loading memory usage

laythe · May 12, 2017, 12:27pm

Hi,

I’m doing a texture loader application and have hit a bit of a brick wall. This is possibly linked to my other question:

“Actual GPU memory usage vs glDeleteXXXX?”

But I’m using a PBO now, so I’m thinking it warrant a question in itself.

After further investigations (Ie. Giving up :)), I went back to a single threaded loader where I could load and unload images as per the camera frustum. I could observe the allocated video memory increase/decrease accordingly. For example an image of 959x1440 took about 13MB in GPU memory, so the memory usage started at 30MB when nothing in view, (presumably some system memory), and then went to 43MB when loaded and in view and then back to around 30MB when unloaded and not in view again. This memory behavior was like clockwork.

This seemed fine and gave me a good reference, but since this was single threaded, as expected it suffered from unacceptable stuttering, so I have tried various different methods of putting the texture loading on a second thread and doing the loading there. All my methods were based on various examples that I found (mainly paul.houx - thanks paul) and my reading on GL contexts etc.

All my attempts work from a visual point of view but I cannot satisfy myself that they work from a memory usage point of view. I’m not sure whether I am right about being wrong, or I just have a wrong understanding. Either way – there is way too many wrongs in there!

From what I gather the best way to do this is to use a PBO in a background thread, so I implemented that as per below pseudo code:

void setup()
{	
    mTextureLoaderRequests = new ConcurrentCircularBuffer<TextureLoadRequest*>(TEXTURE_LOADER_REQUEST_Q_SIZE);
    mTextureLoaderBackgroundCtx = gl::Context::create(gl::context());
    mPBO = gl::Pbo::create(GL_PIXEL_UNPACK_BUFFER, 5000 * 5000 * 4, nullptr, GL_STATIC_DRAW);
    mTextureLoaderThread = shared_ptr<thread>(new thread(bind(&imageApp::textureLoaderThreadFn, this, mTextureLoaderBackgroundCtx)));
}

void textureLoaderThreadFn(gl::ContextRef context)
{
    ci::ThreadSetup threadSetup; 
    context->makeCurrent();

    while (1)
    { 
    	TextureLoadRequest *textureLoaderRequest = nullptr;
    	while (mTextureLoaderRequests->tryPopBack(&textureLoaderRequest))
        {
            // Load texture in CPU memory.
            auto surface = loadImage(loadFile(textureLoaderRequest->mFilename));

            // Upload to GPU using the Pbo.
            auto fmt = gl::Texture2d::Format().intermediatePbo(mPBO);
            auto texture = gl::Texture2d::create(surface, fmt);             // TODO check size of PBO > size of texture
        
            // Create a fence so we know when the upload has finished.
            auto fence = gl::Sync::create();

			// read this may be needed - TODO: is this needed?
            glFlush();														

            // Now check the fence.
    		while (1)
    		{
    			auto status = fence->clientWaitSync(GL_SYNC_FLUSH_COMMANDS_BIT, 0L);
    			if (status == GL_CONDITION_SATISFIED || status == GL_ALREADY_SIGNALED)
    				break;                
    		}

    		// assign the texture so main thread can access it
    		{
    			std::lock_guard<std::mutex> lk(mImageMutex);
	    		mImage = texture;
	    	}
       }
   }
}

void draw()
{
    std::lock_guard<std::mutex> lk(mImageMutex);
    gl::draw(mImage...)
}

Visually, when I run this, the application feels super smooth and things load very fast, however the memory usage does not behave as per single threaded version.

Multithreaded PBO vs Single Thread Memory Usage

If I initially run the application with 0 images loaded, and then bring 1 image into view, record the memory usage, bring it out again and repeat, I get the following memory usage (for both single and multi-threaded PBO:

Multithreaded PBO OpenGL Contexts

For the multithreaded PBO case, in CodeXL (visual studio tool to debug GPU), I can also see the following structures for the GL contexts:

So starting with nothing in view:

0 allocations – 0 images loaded

Then bringing 1 image into view:

1 allocations – 1 images loaded

If I then move the image out of view again, so that the image is unloaded, the GL context tree looks as per 0 allocations – 0 images loaded.

I’ve breakpointed on any Texture creations, and it does not get created by my application, so Texture 1 (512x512 as seen in 0 allocations – 0 images loaded) is being created, but I have no idea what is creating it.

Singlethreaded PBO OpenGL Contexts

For the single threaded case, in CodeXL (visual studio tool to debug GPU), I can also see the following structures for the GL contexts:

So starting with nothing in view:

0 allocations – 0 images loaded

Much cleaner and as expected, no texture are present.

1 allocations – 1 images loaded

As expected, 1 texture is present.

Conclusion

I can see in CodeXL that the PBO is 97MB in size, so I would expect a larger memory usage, but I cannot make sense of what I am seeing. It seems that the graphics driver is freeing the GPU memory but not the amounts expected and is allocating more than expected. Does my usage of PBO/multi-threaded texture loading look ok? Am I correct, to use the single threaded version as a reference with regard to memory usage?

I’ve also tried not using a PBO, but the memory misbehavior seems identical.

Appreciate any help or pointers,
thanks! - Laythe

paul.houx · May 12, 2017, 4:18pm

Hi,

it’s cool to see how far you went down the rabbit hole. Looking at your (pseudo-)code and memory readings, here are few things that came to mind:

You’re creating the Pbo on the main thread, using the main context. I’d advice to create it on the loader thread after enabling the shared context. That way you can be sure it’s available.
It seems more than one Pbo is created, judging by the jumps in used memory of roughly 100MB each. It could be the driver is creating copies for you (a.k.a. buffer renaming), because the Pbo is still in use somehow. This could be caused by you checking for the fence on the loader thread, instead of on the main thread. There needs to be a bit of time between creating the fence and checking it, so that’s why in my own code I let the main thread check the fence. However, I’m not sure if this is much different from your code as I have never done debugging on this matter as extensively as yours.
You’re not checking all fence conditions. It might fail or expire. You should act on these situations.

-Paul

Edit: you may indeed want to add Pbo support only after making sure your threaded loading actually behaves like it should. The Pbo is only a way to optimize loading even more.

laythe · May 12, 2017, 4:57pm

Hi Paul,

Ok, will move the PBO to the loader thread. I’m also going to try to see what effects of multiple threads (contexts) are by having multiple texture loader running at the same time.

Sounds like I should try the other way round aka your method, but I’m not sure I understand what you mean.

At the moment,

Worker thread:

while(1) 
{ 
    texture = load image
    set fence
    wait for clientWaitSync to return signalled
    take image lock
    assign image = texture
}

Main thread:

while(1) 
{ 
    do normal cinder stuff; 
    take image lock
    if (image!= nullptr)
        gl::draw(image...)
}

How could the above be adapted to reflect what you suggest - do you mean something like :

Worker thread:

while(1) 
{ 
    texture = load image
    set fence
    take image lock
    assign image = texture
}

Main thread:

while(1) 
{ 
    do normal cinder stuff; 
    take image lock
    if (clientWaitSync == signalled &&
        image != nullptr)
    {
        gl::draw(image...)
    }
}

I came across glWaitSync but I dont fully understand its use yet and I’m currently trying to get my head around the syncing - is this of any use to me maybe?

Thanks - Laythe

paul.houx · May 12, 2017, 8:02pm

I mean something like:

Loader thread:

{
    make context current for this loader thread
    create Pbo
    while running {
        pop path from queue
        load texture (using Pbo is optional)
        set fence
        pass structure containing both fence and texture to main thread
    }
}

Main thread:

{
    pop structure containing both fence and texture
    if available {
        check fence status
        if failed or expired, queue texture for loading again
        if succeeded and signaled, add texture to available textures
    }

    draw available texture(s)
}

A fence is basically a little flag that you create and then insert into the stream of commands that are sent to the GPU. The GPU will perform all commands in sequence and when it encounters the fence, it will update its state. In our case, it will set a value that tells whether or not all commands in front of the fence have been fully processed. If they are, we know the texture has been uploaded and we can start using it. For more detailed information, see this little tutorial.

-Paul

P.S.: locks (mutexes) are relatively slow, so try to avoid them when possible. In your case, you only really need mutexes when accessing the queues that you use to pass data from one thread to the other. The queues could be simple vectors, but in the example above (passing data to the loader thread and v.v.), a more efficient container is the ConcurrentCircularBuffer.

laythe · May 13, 2017, 1:42am

Thanks for the info Paul - I will digest.

I have came across the following from nVidia:

image.png1148×588 187 KB

Ref: p15 - http://on-demand.gputechconf.com/gtc/2012/presentations/S0356-GTC2012-Texture-Transfers.pdf

In your pseudo-code, where you mention “check fence status” how does this relate to glClientWaitSync/glWaitSync and what is going on in the diagram above?
Are you saying I don’t need to use std::mutex to protect mImage in my example because I am using the ConcurrentCircularBuffer?

Once again, muchos gracias! - Laythe

paul.houx · May 13, 2017, 8:41am

Hi,

in your code, you do need a mutex to control access to the texture, because it is used directly in both threads. But this means you have to take the lock every time you want to draw it, which slows or even blocks the render thread unnecessarily. So instead, use a (concurrent) container to pass the texture between threads and only take the lock whenever you access the container. That way, you never need to lock the texture when drawing it, only just once when removing it from the container. And when using a ConcurrentCircularBuffer, the lock is handled for you and you don’t have to worry about concurrent access.

To check the fence status, do this on the main thread:

if( mResults.tryPopBack( result ) ) { 
    auto status = result.fence->clientWaitSync( GL_SYNC_FLUSH_COMMANDS_BIT, 0L );
    switch( status ) {
        case GL_CONDITION_SATISFIED:
        case GL_ALREADY_SIGNALED:
            // Add texture to list of available textures. No lock required.
            mTextures.push_back( result.texture );
            break;
        case GL_WAIT_FAILED:
            // Queue texture for loading.
            mQueue.pushFront( result.path );
            break;
        case GL_TIMEOUT_EXPIRED:
            // Retry next frame.
            mResults.pushFront( result );
            break;
    }
}

, where result is a structure of your own making and mResults is a ConcurrentCircularBuffer which is fed from the loader queue. mQueue is a ConcurrentCircularBuffer, too. mTextures is a simple std::vector.

In the NVIDIA sample, they explain that it isn’t enough to use synchronization on the CPU side (in this case using Windows-only events i.c.w. WaitForSingleObject and SetEvent), you also need to use a GPU fence. But you knew that already.

-Paul

laythe · May 13, 2017, 6:15pm

Awesome! I will try this. Thanks - Laythe

laythe · May 14, 2017, 7:02pm

Hi.

I’ve implemented your suggestions and also tried doing the wait on the main thread also (also with no PBO). While they all work visually, and are smooth, I was still getting my memory problems (gl texture count = 0 but still high video memory used by the process), so I created a small test program to highlight my high memory use:

Apologies for the longish code dump (also please excuse the code standard - I’ve been iterating the code a lot). I hope maybe someone spots an obvious boo-boo (single threaded version works fine):

#include "cinder/app/App.h"
#include "cinder/app/RendererGl.h"
#include "cinder/gl/gl.h"
#include "cinder/ConcurrentCircularBuffer.h"
#include "cinder/gl/Texture.h"

using namespace ci;
using namespace ci::app;
using namespace std;

#define NUM_TEXTURES        10

// app to loader thread messages
#define REQUEST_LOAD_TEXTURE        0
#define REQUEST_UNLOAD_TEXTURE      1

// loader to app thread messages
#define RESPONSE_LOAD_TEXTURE        0

// app requests texture loader thread either load or unload
class TextureRequest
{
public:
    int mMessageID;
    int mTextureIndex;
    std::string         mFilename;  
};

// texture loader responds to the app with a texture tied to a texture index
class TextureResponse 
{
public:
    int mMessageID;
    int mTextureIndex;
    cinder::gl::TextureRef mTexture;
};

class TextureTestApp : public App
{
public:
    static void prepareSettings(cinder::app::AppMsw::Settings *settings);

    void setup() override;
    void update() override;
    void draw() override;
    void keyDown(KeyEvent event) override;
    void textureLoaderThreadFn(gl::ContextRef context);
    void processTextureLoadRequest(TextureRequest *textureRequest);
    void processTextureUnloadRequest(TextureRequest *textureRequest);
    
    cinder::ConcurrentCircularBuffer<TextureRequest*>	   *mTextureRequestMessages;
    cinder::ConcurrentCircularBuffer<TextureResponse*>	   *mTextureResponseMessages;    
    cinder::gl::TextureRef		                            mTextures[NUM_TEXTURES];
    std::thread		                                       *mThread;
    std::atomic<bool>                                       mThreadShouldQuit;
    cinder::gl::ContextRef                                  mThreadCtx;
}; 

void TextureTestApp::prepareSettings(Settings *settings)
{
    settings->setConsoleWindowEnabled();
}

void TextureTestApp::setup()
{
    for (int i = 0; i < NUM_TEXTURES; i++)
    {
        mTextures[i] = nullptr;
    }    
    mTextureRequestMessages = new ConcurrentCircularBuffer<TextureRequest*>(1000);
    mTextureResponseMessages = new ConcurrentCircularBuffer<TextureResponse*>(1000);
    mThreadCtx = gl::Context::create(gl::context());
    mThreadShouldQuit = false;
    console() << "texture loader thread 1 starting..." << std::endl;
    mThread = new thread(bind(&TextureTestApp::textureLoaderThreadFn, this, mThreadCtx));    
}

void TextureTestApp::update()
{    
    // assign available textures
    TextureResponse *textureResponse;
    if (mTextureResponseMessages->tryPopBack(&textureResponse))
    {
        mTextures[textureResponse->mTextureIndex] = textureResponse->mTexture;
        console() << "Main thread recieved and assigned texture = " << textureResponse->mTextureIndex << std::endl;
        delete textureResponse;
    }
}

void TextureTestApp::keyDown(KeyEvent event)
{
    if (event.getChar() == 'c')
    {
        try
        {
            for (int i = 0; i < NUM_TEXTURES; i++)
            {
                TextureRequest *tr = new TextureRequest();
                tr->mFilename = "C:\\test images\\tt.jpg";
                tr->mMessageID = REQUEST_LOAD_TEXTURE;
                tr->mTextureIndex = i;
                mTextureRequestMessages->pushFront(tr);
                console() << "load requests sent - " << i << std::endl;
            }
        }
        catch (Exception &exc)
        {
            console() << "failed to send create texture requests." << std::string(exc.what());
        }
    }
    if (event.getChar() == 'd')
    {
        try
        {
            for (int i = 0; i < NUM_TEXTURES; i++)
            {
                TextureRequest *tr = new TextureRequest();
                tr->mMessageID = REQUEST_UNLOAD_TEXTURE;
                tr->mTextureIndex = i;
                mTextureRequestMessages->pushFront(tr);
                console() << "unload requests sent - " << i << std::endl;
            }
        }
        catch (Exception &exc)
        {
            console() << "failed to send unload texture requests." << std::string(exc.what());
        }
    }
}

void TextureTestApp::draw()
{
    gl::clear(Color(0.5f, 0.5f, 0.5f));
    gl::enableAlphaBlending();

    for (int i = 0; i < NUM_TEXTURES; i++)
    {
        if (mTextures[i] != nullptr)
        {
            gl::draw(mTextures[i], Rectf(mTextures[i]->getBounds()).getCenteredFit(getWindowBounds(), true).scaledCentered(0.85f));
        }
    }
}

//-----------------------------------------------------------------------------------------
// TEXTURE THREAD 1
void TextureTestApp::textureLoaderThreadFn(gl::ContextRef context)
{
    TextureRequest *textureRequest = nullptr;
    ci::ThreadSetup threadSetup;
    context->makeCurrent();    
    while (!mThreadShouldQuit)
    {
        while (mTextureRequestMessages->tryPopBack(&textureRequest))
        {
            if (mThreadShouldQuit)
                break;
     
            if (textureRequest->mMessageID == REQUEST_LOAD_TEXTURE)
            {
                processTextureLoadRequest(textureRequest);
            }
            else if (textureRequest->mMessageID == REQUEST_UNLOAD_TEXTURE)
            {
                processTextureUnloadRequest(textureRequest);
            }
            delete textureRequest;
        }
    }
}

void TextureTestApp::processTextureLoadRequest(TextureRequest *textureRequest)
{
    try
    {
        // load texture
        cinder:gl::TextureRef texture = cinder::gl::Texture::create(loadImage(textureRequest->mFilename));
    
        // Create a fence so we know when the upload has finished.
        auto fence = gl::Sync::create();
        
        // Now check the fence.
waitAgain:      
        auto status = fence->clientWaitSync(GL_SYNC_FLUSH_COMMANDS_BIT, 0L);
        switch (status)
        {
            case GL_CONDITION_SATISFIED:
            case GL_ALREADY_SIGNALED:
            { 
                // OK to continue - got texture
                break;
            }            
            case GL_WAIT_FAILED:
            {
                // TODO:- RESCHEDULE image . add a retry count to the request, and retry a few times before giving up.
                console() << "THREAD: ERROR -> clientWaitSync = GL_WAIT_FAILED " << std::endl;
                break;
            }
            case GL_TIMEOUT_EXPIRED:
            {
                // Retry next frame.
                //imageApp::mApplication->log("THREAD 2: WARNING -> clientWaitSync = GL_TIMEOUT_EXPIRED ");
                goto waitAgain;
            }
        }

        // Pass structure containing both fence and texture to main thread
        TextureResponse *tr = new TextureResponse();
        tr->mMessageID = RESPONSE_LOAD_TEXTURE;
        tr->mTextureIndex = textureRequest->mTextureIndex; // tie request texture id to response texture id
        tr->mTexture = texture;
        mTextureResponseMessages->pushFront(tr);
        console() << "created texture index " << tr->mTextureIndex << std::endl;
    }
    catch (Exception &exc)
    {
        console() << "failed to create texture." << std::string(exc.what());
    }
}

void TextureTestApp::processTextureUnloadRequest(TextureRequest *textureRequest)
{
    try
    {
        mTextures[textureRequest->mTextureIndex].reset();
        mTextures[textureRequest->mTextureIndex] = nullptr;
        console() << "deleted texture index " << std::to_string(textureRequest->mTextureIndex) << std::endl;        
    }
    catch (Exception &exc)
    {
        console() << "failed to delete texture." << std::string(exc.what());
    }
}

CINDER_APP(TextureTestApp, RendererGl(RendererGl::Options().msaa(4)), &TextureTestApp::prepareSettings)

All the app does is load 10 images when you press ‘C’ and remove them on ‘D’. I’ve tried a single thread version of this and it is fine, memory being cleared accordingly, but the moment I involve multiple contexts/threads, I seem to get these memory issues.

I wondered…do I need to protect the main thread that is drawing a texture from the unloading, so to rule that out I changed the code so that the main thread does the delete, but the memory issues still remain.

Appreciate any help
Thanks - Laythe

laythe · May 15, 2017, 6:38pm

I modified the above program such that pressing ‘c’ will load the images and then 'd will unload them and also destroy the context. Once the shared context is destroyed, I see that the GPU memory goes back down.
Ie.

auto sharedContextPlatformData = dynamic_pointer_cast<cinder::gl::PlatformDataMsw> (context->getPlatformData());
::wglDeleteContext(sharedContextPlatformData->mGlrc);

Therefore I can only deduce that the context is not releasing the memory. Since my GL texture count is 0 I guess it follows that this must be after the glDeleteTexture calls. But then I am at a loss as to what circumstances may arise such that a cinder GL texture count of 0 does not actually reflect released GPU memory?

Cheers,
Laythe

laythe · May 16, 2017, 12:29am

Hi.
I’ve narrowed the problem down to the texture unloads occurring on the worker thread. If I delete the texture from the main thread, the memory gets de-allocated properly. This is better than loading/creating the texture, but I imagine would cause stutter on the main thread if there is a large amount of texture data.

I am now thinking the failure to actually release the texture memory may be due to the render thread drawing the texture at the same time. In this case, I would have thought to get glGetError() errors, but I don’t - very puzzling, however it is the only explanation that sort of makes sense. I’m not sure how to go about doing that yet - Is there an an example of where this kind of thing is done?

I have also read tit-bits saying i need to rebind textures when changing opengl object state. Does this apply, or does cinder do this under the hood?

Cheers - Laythe

laythe · May 16, 2017, 7:49pm

Just to follow up, it seems the cost of doing glDeleteTextures is negligible (from the CPU side), resulting in no noticeable stuttering, even with large amounts of texturing. I guess an asynchronous API is a double edge sword after all.

I managed to find another PC to test on and the memory behavior is still not correct when unloading on the worker thread. This confirms my belief that I must be doing bad syncing for the texture unload. Room for improvement I suppose, which I can get to after a break from multi-threaded opengl resource management for a bit

Thanks - Laythe

paul.houx · May 16, 2017, 9:57pm

Hey, I’d love to look at your code but haven’t found the time yet. Maybe in a few days.

Topic		Replies	Views
Optimized Texture Upload Using Cinder	11	3376	July 1, 2016
Actual GPU memory usage vs glDeleteXXXX? Using Cinder	13	2295	April 6, 2017
Crash when creating a texture on a thread	6	1246	September 8, 2017
Texture transfert from FBO Using Cinder	1	619	August 23, 2018
Optimised CPU read back of GPU data Using Cinder	10	2257	July 8, 2017

Asynchronous multi-threaded PBO texture loading memory usage

Multithreaded PBO vs Single Thread Memory Usage

Multithreaded PBO OpenGL Contexts

Singlethreaded PBO OpenGL Contexts

Conclusion

Related Topics