Question

What is actually a Queue family in Vulkan?

I am currently learning Vulkan, right now I am just taking apart each command and inspecting the structures to try to understand what they mean.

Right now I am analyzing QueueFamilies, for which I have the following code:

vector<vk::QueueFamilyProperties> queue_families = device.getQueueFamilyProperties();
for(auto &q_family : queue_families)
{
    cout << "Queue number: "  + to_string(q_family.queueCount) << endl;
    cout << "Queue flags: " + to_string(q_family.queueFlags) << endl;
}

This produces this output:

Queue number: 16
Queue flags: {Graphics | Compute | Transfer | SparseBinding}
Queue number: 1
Queue flags: {Transfer}
Queue number: 8
Queue flags: {Compute}

So, naively I understand it like this:

There are 3 Queue families, one queue family has 16 queues, all capable of graphics, compute, transfer, and sparse binding operations (no idea what the last 2 are)

Another has 1 queue, capable only of transfer (whatever that is)

The final one has 8 queues capable of compute operations.

What is each queue family? I understand it's where we send execution commands like drawing and swapping buffers, but this is a somewhat broad explanation, I would like a more knowledgeable answer with more details.

What are the 2 extra flags? Transfer and SparseBidning?

And finally, why do we have/need multiple command queues?

 46  20159  46
1 Jan 1970

Solution

 77

To understand queue families, you first have to understand queues.

A queue is something you submit command buffers to, and command buffers submitted to a queue are executed in order[1] relative to each other. Command buffers submitted to different queues are unordered relative to each other unless you explicitly synchronize them with VkSemaphore. You can only submit work to a queue from one thread at a time, but different threads can submit work to different queues simultaneously.

Each queue can only perform certain kinds of operations. Graphics queues can run graphics pipelines started by vkCmdDraw* commands. Compute queues can run compute pipelines started by vkCmdDispatch*. Transfer queues can perform transfer (copy) operations from vkCmdCopy*. Sparse binding queues can change the binding of sparse resources to memory with vkQueueBindSparse (note this is an operation submitted directly to a queue, not a command in a command buffer). Some queues can perform multiple kinds of operations. In the spec, every command that can be submitted to a queue have a "Command Properties" table that lists what queue types can execute the command.

A queue family just describes a set of queues with identical properties. So in your example, the device supports three kinds of queues:

  • One kind can do graphics, compute, transfer, and sparse binding operations, and you can create up to 16 queues of that type.

  • Another kind can only do transfer operations, and you can only create one queue of this kind. Usually, this is for asynchronously DMAing data between host and device memory on discrete GPUs, so transfers can be done concurrently with independent graphics/compute operations.

  • Finally, you can create up to 8 queues that are only capable of compute operations.

Some queues might only correspond to separate queues in the host-side scheduler, other queues might correspond to actual independent queues in hardware. For example, many GPUs only have one hardware graphics queue, so even if you create two VkQueues from a graphics-capable queue family, command buffers submitted to those queues will progress through the kernel driver's command buffer scheduler independently but will execute in some serial order on the GPU. But some GPUs have multiple compute-only hardware queues, so two VkQueues for a compute-only queue family might actually proceed independently and concurrently all the way through the GPU. Vulkan doesn't expose this.

The bottom line: decide how many queues you can usefully use, based on how much concurrency you have. For many apps, a single "universal" queue is all they need. More advanced ones might have one graphics+compute queue, a separate compute-only queue for asynchronous compute work, and a transfer queue for asynchronous DMA. Then map what you'd like onto what's available; you may need to do your own multiplexing, e.g. on a device that doesn't have a compute-only queue family, you might create multiple graphics+compute queues instead, or serialize your async compute jobs onto your single graphics+compute queue yourself.

[1] Oversimplifying a bit. They start in order, but are allowed to proceed independently after that and complete out of order. Independent progress of different queues is not guaranteed though. I'll leave it at that for this question.

2019-03-21

Solution

 23

A Queue is a thing that accepts Command Buffers containing operations of a given type (given by the family flags). The commands submited to a Queue have a Submission Order, therefore they are subject to synchronization by Pipeline Barriers, Subpass Dependencies, and Events (while across queues Semaphore or beter has to be used).

There's one trick: COMPUTE and GRAPHICS can always implicitly accept TRANSFER workload (even if the QueueFamilyProperties do not list it. See this in Note below Specification of VkQueueFlagBits).

Transfer is for Copy and Blit commands. Sparse is something like paging; it allows to bind multiple Memory handles to a single Image, and it allows to re-bind different memory later too.

In the Specification, below given vkCmd* command it always says which are the "Supported Queue Types".

Queue Family is a group of Queues that have special relation to themselves. Some things are restricted to a single Queue Family, such as Images (they have to be transfered between Queue Families) or Command Pool (creates Command Buffers only for consumption by the given Queue Family and no other). Theoretically on some exotic device there could be more Queue Families with the same Flags.

That's pretty much everything the Vulkan Specification guarantees. See an Issue with this at KhronosGroup/Vulkan-Docs#569


There are some vendor-specific materials given, e.g.:

The GPUs have asynchronous Graphics Engine(s), Compute Engine(s), and Copy\DMA Engine(s). The Graphics and Compute would of course contest the same Compute Units of the GPU.

They usually have only one Graphics Frontend. That is a bottleneck for graphics operations, so that means there's no point in using more than one Graphics Queue.

There are two modes of operation for Compute: Synchronous Compute (exposed as GRAPHICS|COMPUTE family) and Async Compute (exposed as COMPUTE-only family). The first is a safe choice. The second can give you about 10 % perf, but is more tricky and requires more effort. The AMD article suggests to always do the first as a baseline.

There can theoretically be as many Compute Queues as there are Compute Units on the GPU. But AMD argues there's no benefit to more than two Async Compute Queues and exposes that many. NVIDIA seems to go with the full number.

The Copy\DMA Engines (exposed as the TRANSFER-only family) are primarily intended for CPU⇄GPU transfers. They would usually not achieve full throughput for an inside-GPU copy. So unless there's some driver magic, the Async Transfer Family should be used for CPU⇄GPU transfers (to reap the Async property, being able to do Graphics next to it unhindered). For inside-GPU copies it should be better for most cases to use the GRAPHICS|TRANSFER family.

2019-03-21