- 获取链接
- X
- 电子邮件
- 其他应用
- 获取链接
- X
- 电子邮件
- 其他应用
1, reduce batch size
2, use chunk in a batch
3, If at least batch size=1 can be run, you can learn about the technique of gradient accumulation, which can increase batch size without requiring more GPU memory.
optimizer = ... NUM_ACCUMULATION_STEPS = ... for epoch in range(...): for idx, sample in enumerate(dataloader): inputs, labels = sample # Forward Pass outputs = model(inputs) # Compute Loss and Perform Back-propagation loss = loss_fn(outputs, labels) # Normalize the Gradients loss = loss / NUM_ACCUMULATION_STEPS loss.backward() if ((idx + 1) % NUM_ACCUMULATION_STEPS == 0) or (idx + 1 == len(dataloader)): optimizer.zero_grad() # Update Optimizer optimizer.step()
That's all it takes!
- We normalize the loss with regard to the number of gradient accumulation steps
- Only update the optimizer every chunk, the number of chunks being a number of steps/accumulation steps. Or at the end of the data loader.
4,If batch size=1 still doesn't fit in GPU memory, there is a checkpointing:
but it comes at the cost of sacrificing some performance.
5, use the repo huggingface/accelerate
评论
发表评论