A context cannot be shared by multiple threads. Each thread must have its own co...

A context cannot be shared by multiple threads. Each thread must have its own context, otherwise all threads will crash immediately. Thus your description of a barrel processor is completely contrary to reality.

When threads are implemented only in software, without hardware support, you have what is called coarse-grained multithreading. In this case, a CPU core executes one thread, until that thread must wait for a long time, e.g. for the completion of some I/O operation. Then the operating system switches the context from the stalled thread to another thread that is ready to run, by saving all registers used by the old thread and restoring the registers of the new thread, from the values that were saved when the new thread has been executed last time.

Such multithreading is coarse-grained, because saving and restoring the registers is expensive so it cannot be done often.

When hardware assists context-switching, by being able to store internally in the CPU core multiple sets of registers, i.e. multiple thread contexts, then you can have FGMT (fine-grained multithreading). In the earliest CPUs with FGMT the switching of the thread contexts was done after each executed instruction, but in all more recent CPUs or GPUs with FGMT the context switching can be done after each clock cycle.

Barrel processors are a subset of the FGMT processors, the simplest and the least efficient of them. Barrel processors are now only of historical interest. Nobody has made barrel processors during the last decades. In barrel processors, the threads are switched in round robin, i.e. in a fixed order. You cannot choose the next thread to run. This wastes clock cycles, because the next thread in the fixed order may be stalled, waiting for some event, so nothing can be done during its allocated clock cycle.

The name "barrel", introduced by CDC 6600 in 1964, refers to the similarity with the barrel of a revolver, you can rotate it with a position, bringing the next thread for execution, but you cannot jump over a thread to reach some arbitrary position.

What is switched in a barrel CPU at each clock cycle between threads is not a context, i.e. not the registers, but the execution units of the CPU, which become attached to the context of the current thread, i.e. to its registers. For each thread there is a distinct set of registers, storing the thread context.

The descriptions of the internal architecture of GPUs are extremely confusing, because NVIDIA has chosen to replace in its documentation all the words that have been used for decades when describing CPUs with different words, with no apparent reason except of obfuscating the GPU architecture. AMD has followed NVIDIA, and they have created a third set of architectural terms, mapped one to one to those of NVIDIA, but using yet other words, for maximum confusion.

For instance, NVIDIA calls "warp" what in a CPU is called "thread". What NVIDIA calls "thread" is what in a CPU is called "vector lane" or "SIMD lane". What NVIDIA calls "stream multiprocessor" is what in a CPU is called "core".

Both GPUs and CPUs are made of multiple cores, which can execute programs in parallel.

Each core can execute multiple threads, which share the same execution units. For executing multiple threads, most if not all GPUs use FGMT, while most modern CPUs use SMT (Simultaneous Multithreading).

Unlike FGMT, SMT can exist only on superscalar processors, i.e. which can initiate the execution of multiple instructions in the same clock cycle. Only in that case it may also be possible to initiate the execution of instructions from distinct threads in the same clock cycle.

Some GPUs may be able to initiate 2 instructions per clock cycle, only when certain conditions are met, but for all such GPUs their descriptions are typically very vague and it may be impossible to determine whether those 2 instructions may come from different threads, i.e. from different warps in the NVIDIA terminology.