Question

Differences between Microsoft and Nvidia examples of DirectX12 CPU/GPU synchronization

I'm trying to understand the CPU/GPU synchronization in DirectX 12, but there are some things that confuse me. Here is the sample code from Microsoft's HelloFrameBuffering example:

// Prepare to render the next frame.
void D3D12HelloFrameBuffering::MoveToNextFrame()
{
    // Schedule a Signal command in the queue.
    const UINT64 currentFenceValue = m_fenceValues[m_frameIndex];
    ThrowIfFailed(m_commandQueue->Signal(m_fence.Get(), currentFenceValue));

    // Update the frame index.
    m_frameIndex = m_swapChain->GetCurrentBackBufferIndex();

    // If the next frame is not ready to be rendered yet, wait until it is ready.
    if (m_fence->GetCompletedValue() < m_fenceValues[m_frameIndex])
    {
        ThrowIfFailed(m_fence->SetEventOnCompletion(m_fenceValues[m_frameIndex], m_fenceEvent));
        WaitForSingleObjectEx(m_fenceEvent, INFINITE, FALSE);
    }

    // Set the fence value for the next frame.
    m_fenceValues[m_frameIndex] = currentFenceValue + 1;
}

My question is, why do we update the m_frameIndex before checking if the fence has reached the expected fence value? This means we use the fence value of a different framebuffer, which is not the same value we used in the Signal() call. This seems a bit strange to me.

I also check out Nvidia's sample code:

struct FrameContext
{
    ComPtr<ID3D12CommandAllocator> m_allocator;
    ComPtr<ID3D12CommandAllocator> m_computeAllocator;
    ComPtr<ID3D12Fence>            m_fence;
    uint64_t                       m_fenceValue = 0;
};

void DeviceResources::MoveToNextFrame()
{
    FrameContext* ctx = &m_frameContext[m_frameIndex];
    DX::ThrowIfFailed(m_commandQueue->Signal(ctx->m_fence.Get(), ctx->m_fenceValue));
    m_frameIndex = m_swapChain->GetCurrentBackBufferIndex();
    if (ctx->m_fence->GetCompletedValue() < ctx->m_fenceValue)
    {
        DX::ThrowIfFailed(ctx->m_fence->SetEventOnCompletion(ctx->m_fenceValue, m_fenceEvent.Get()));
        WaitForSingleObjectEx(m_fenceEvent.Get(), INFINITE, false);
    }
    ctx->m_fenceValue++;
}

As we can see, they use the same fence value for the Signal() call and for comparision with GetCompletedValue(). Could someone help me understand the pros and cons of these two approaches?

3 93 3

1 Jan 1970

Solution

D3D12HelloFrameBuffering::MoveToNextFrame confused me too. I came up with my own frame idea.

The trick is to treat it like a ring/circular buffer problem. You want to work in the Naturals and then project them into modulo FrameCount (i.e. when you index into your ring buffer your go buffer[index % FrameCount], but only ever increment as index += 1).

You have two indices. But I use sentinels i.e. indices that point one parse the end.

sentinel_gpu_completed which you get from ID3D12Fence::GetCompletedValue().

sentinel_cpu_record which you manage yourself.

The key is to ensure that sentinel_cpu_record - sentinel_gpu_completed < FrameCount

for(;;)
{
    // start recording a frame
    auto current_record_index = sentinel_cpu_record.fetch_add(1);
    // if gpu lags too many frames wait
    auto sentinel_gpu_completed = m_fence->GetCompletedValue();
    if (not (sentinel_cpu_record - sentinel_gpu_completed < FrameCount))
    {
        winrt::check_hresult(m_fence->SetEventOnCompletion(sentinel_cpu_record - (FrameCount - 1), m_fenceEvent.get()));
       WaitForSingleObject(m_fenceEvent.get(), INFINITE);
    }
    // after waiting this Command Allocator is safe to use
    auto commandAllocator = m_commandAllocators[current_record_index % FrameCount].get();
    // submit render commands for frame
}

I probably should add that this is probably not safe to overflow. Not sure about the others.

2024-07-10

Tom Huntington

Solution

First of all, I want to thank you guys for answering my question! After reading DX12 documentation and searching for more infomation, I think I probably understand how Microsoft's sample works.

Update Frame Index Before Fence Value Comparison

Microsoft's code updates framebuffer index before fence value comparsion because it just wants to make sure that the rendering on the new index framebuffer (the one we're going to render) is finished. Microsoft uses an array to store multiple fence values for each framebuffer so that we can know which fence value was used in the Signal() calls for a specific framebuffer.

Therefore, Microsoft's sample code is more efficient than Nvidia's. Obviously Nvidia's code shows that it ensures the rendering on the current framebuffer (f) is finished before it processes the next PopulateCommandList() call on the next framebuufer (f+1).

Multiple ID3D12Fence or Just One

Nvidia's code indicates that it uses different ID3D12Fence objects and fence values for each framebuffer, but Microsoft does not. Microsoft uses the same fence object for every framebuffer. In my opion, using multiple ID3D12Fence objects or using just one doesn't matter as long as you promise not to reuse the fence value that has been set after the Signal() call for a single ID3D12Fence object. Otherwise the fence value comparsion won't work anymore. This is one of reasons why we should record the fence value in our program.

Frambuffer Race Condititon

This is the last question that I still not sure about. I am wondering what happens when the rendering of the next framebuffer finishes faster than the currrent one? For example:

framebuffer[ 0 ] : fence value is 5.
framebuffer[ 1 ] : fence value is 6.

After rendering framebuffer 1, we call Signal() and tell GPU that if it has finished its current job (which is the rendering of framebuffer 1), then set the fence to 6. Then, we update the frame index to 0, and pick the fence value of framebuffer 0, which is 5.

Because we want to start rendering framebuffer 0, we need to make sure that the previous work the GPU was doing is finished. We call GetCompletedValue() to get the current fence value. If the return value is less than 5, it means that framebuffer 0 is not done yet.

But what if the framebuffer 1 finishes faster than framebuffer 0? If so, does GetCompletedValue() return 6 instead of 5, causing unsynchronized problems because it mistakenly indicates that the rendering on framebuffer 0 is finished when it actually is not?

There is another question on StackOverflow, and I read the answers. It says calling ID3D12CommnadQueue::ExecuteCommandLists guarantees the fist command list finishes before GPU executes the second one.

If that is true, then there is nothing to worry about, I can trust command queue to execute command lists in the call order, ensuring that the second command lists do not finish faster than the first. But if it is not, then I think there is a necessary reason to create different ID3D12Fence objects for each framebuffer and manage these separately.

Hopefully, I didn't get it wrong. If there is any mistake, please tell me. I would greatly appreciate it!

2024-07-11

ThIsJaCk