This article is contributed. See the original author and article here.
You’re welcome to follow my GitHub repo and give it a star:https://github.com/xinyuwei-david/david-share.git
Factors Affecting AI Training Time
In deep learning training, the calculation of training time involves multiple factors, including the number of epochs, global batch size, micro batch size, and the number of computing devices, among others. Below is a basic formula illustrating the relationship between these parameters (note that this is just a basic illustrative formula, mainly explaining proportional and inversely proportional relationships; actual training may require considering more factors):
Among them—
- Epochs refer to the number of times the model processes the entire training dataset.
- Total Number of Samples is the total number of samples in the training dataset.
- Global Batch Size is the total number of data samples processed in each training iteration.
- Time per Step is the time required for each training iteration, which depends on hardware performance, model complexity, optimization algorithms, and other factors.
- Number of Devices is the number of computing devices used for training, such as the number of GPUs.
This formula provides a basic framework, but please note that the actual training time may be influenced by many other factors, including I/O speed, network latency (for distributed training), CPU-GPU communication speed, The Frequency of Hardware Failures During GPU Training, etc. Therefore, this formula can only serve as a rough estimate, and the actual training time may vary.
Detailed explanations
The training time of a deep learning model is determined by multiple factors, including but not limited to the following:
- Number of Epochs: An epoch means that the model has processed the entire training dataset once. The more epochs, the more data the model needs to process, and thus the longer the training time.
- Global Batch Size: The global batch size is the total number of data samples processed in each training iteration. The larger the global batch size, the more data is processed in each iteration, which may reduce the number of iterations required per epoch, potentially shortening the total training time. However, if the global batch size is too large, it may lead to memory overflow.
- Micro Batch Size: The micro batch size refers to the number of data samples processed by each computing device in each training iteration. The larger the micro batch size, the more data each device processes per iteration, which may improve computational efficiency and thus shorten training time. However, if the micro batch size is too large, it may lead to memory overflow.
- Hardware Performance: The performance of the computing devices used (such as CPUs, GPUs) will also affect training time. More powerful devices can perform computations faster, thereby shortening training time.
- Model Complexity: The complexity of the model (such as the number of layers, number of parameters, etc.) will also affect training time. The more complex the model, the more computations are required, and thus the longer the training time.
- Optimization Algorithm: The optimization algorithm used (such as SGD, Adam, etc.) and hyperparameter settings like learning rate will also affect training time.
- Parallel Strategy: The use of parallel computing strategies such as data parallelism, model parallelism, etc., will also affect training time.
There are many factors that determine the length of training time, and they need to be considered comprehensively based on the specific training task and environment.
So, in this formula
batch_size = 10 # Batch size
total_num = 1000 # Total number of training data
When training one batch of data and updating the gradient once (gradient accumulation steps = 1):
train_steps = total_num / batch_size = 1000 / 10 = 100
This means there are 100 steps per epoch, and the gradient update steps are also 100.
When the memory is insufficient to support a batch size of 10, we can use gradient accumulation to reduce the size of each micro-batch. Suppose we set the gradient accumulation steps to 2:
gradient_accumulation_steps = 2
micro_batch_size = batch_size / gradient_accumulation_steps = 10 / 2 = 5
This means that for each gradient update, we accumulate data from 2 micro-batches, with each micro-batch size being 5. This reduces memory pressure, but the data size per gradient update remains 10 data points.
Result:
- The number of training steps per epoch (train_steps) remains 100 because the total amount of data and the number of steps per epoch have not changed.
- The gradient update steps remain 100 because each gradient update accumulates data from 2 micro-batches.
It is important to note that when using gradient accumulation, each training step handles the accumulation of gradients from multiple micro-batches, which may slightly increase the computation time per step. Therefore, if memory is sufficient, it is better to increase the batch size to reduce the number of gradient accumulations. When memory is insufficient, gradient accumulation is an effective method.
The global batch size significantly impacts the training effectiveness of the model. Generally, a larger global batch size provides more accurate gradient estimates, aiding model convergence. However, it also increases memory pressure on each device. If memory resources are limited, using a large global batch size may not be feasible.
In such cases, gradient accumulation can be used. By training with a smaller micro-batch size on each device, we reduce memory pressure while maintaining a large global batch size for accurate gradient estimates. This allows training large models on limited hardware resources without sacrificing the global batch size.
In summary, gradient accumulation is a trade-off strategy to balance global batch size and training effectiveness when memory resources are limited.
So, if we look at these two formulas:
The larger the global batch size, the shorter the total training time, provided that there is no OOM (Out of Memory) and the GPU computational power is not fully utilized.
The Relationship Between Data Parallelism and Batch Size
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments