Out Of Memory! How to reduce the cuda memory cost while training a network?

1, reduce batch size

2, use chunk in a batch

3, If at least batch size=1 can be run, you can learn about the technique of gradient accumulation, which can increase batch size without requiring more GPU memory.

optimizer = ...
NUM_ACCUMULATION_STEPS = ...


for epoch in range(...):
    for idx, sample in enumerate(dataloader):
        inputs, labels = sample


	# Forward Pass
        outputs = model(inputs)
        # Compute Loss and Perform Back-propagation
	loss = loss_fn(outputs, labels)


	# Normalize the Gradients
	loss = loss / NUM_ACCUMULATION_STEPS
        loss.backward()


	if ((idx + 1) % NUM_ACCUMULATION_STEPS == 0) or (idx + 1 == len(dataloader)):
		optimizer.zero_grad()
		# Update Optimizer
        	optimizer.step()

That's all it takes!

We normalize the loss with regard to the number of gradient accumulation steps

Only update the optimizer every chunk, the number of chunks being a number of steps/accumulation steps. Or at the end of the data loader.

4,If batch size=1 still doesn't fit in GPU memory, there is a checkpointing：

but it comes at the cost of sacrificing some performance.

5, use the repo huggingface/accelerate

Miles629

搜索此博客

DDPM（Diffusion Probabilistic Models）Pytorch core code analysis.

Out Of Memory! How to reduce the cuda memory cost while training a network?

评论

发表评论