The challenge the research lab is facing is likely due to communication overhead between multiple GPUs . This is a common issue when implementing model parallelism in situations where a model, such as a large transformer-based language model, exceeds the capacity of a single GPU.
When training very large models, it's common to distribute parts of the model across multiple GPUs to utilize more aggregate memory and computational power. Although this allows the model to be trained without running out of memory, it introduces complexity in how the different parts of the model communicate with each other.
Communication Overhead : In a distributed setup, when layers or parts of the model are split across different GPUs, there needs to be frequent communication between GPUs to pass data, such as gradients during backpropagation. This communication is often done over high-speed links, but it can still introduce latency that hampers overall speed and efficiency.
Examples of Communication : Let's say one GPU processes the output from layer 1, and another GPU needs that output to process layer 2. The output must be transmitted over the network connecting the GPUs, which is generally slower than the internal memory access within a single GPU.
Impact : This overhead can slow down the training process because GPUs may spend significant time waiting for data from each other rather than performing computations.
Solutions : Optimizing the model parallel training setup often involves improving the communication efficiency between the GPUs, such as by using specialized interconnects (e.g., NVLink), reducing the frequency of communications, or compressing data to reduce the amount transferred.
While inefficient memory usage, insufficient computational resources, and model architecture limitations can also pose challenges, in this specific context—distributing model parts across multiple GPUs—communication overhead is the most likely critical issue.