Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda out of memory #10

Open
wlhgtc opened this issue Mar 24, 2018 · 6 comments
Open

cuda out of memory #10

wlhgtc opened this issue Mar 24, 2018 · 6 comments

Comments

@wlhgtc
Copy link

wlhgtc commented Mar 24, 2018

Thanks for your code, it helps me a lot. And I try to write on my own but I meet some questions.
When I rewrite the loss function as follows:
`class Custom_Loss(nn.Module):
def init(self):
super(Custom_Loss, self).init()

def loss_function(self, data, labels):
    loss = Variable(torch.zeros(1))
    for d, l in zip(data, labels):
        loss -= torch.log(d[l]).cpu()
    loss /= data.size(0)
    return loss

def forward(self, p1, p2, S, E):
    """
    N for batch and T for length of context

    :param p1: A tensor (N,T) represents for possibility of  choosing each word as answer(start)
    :param p2: A tensor (N,T) represents for possibility of  choosing each word as answer(end)
    :param S: A tensor for each query's start position
    :param E: A tensor for each query's end position
    :return: Loss of the BiDAF model
    """

    l1 = self.loss_function(p1, S)
    l2 = self.loss_function(p2, E)
    loss=l1+l2
    return loss`

I meet the error: cuda out of memory, I check my code and could not find the reason, can you help me?

@jojonki
Copy link
Owner

jojonki commented Mar 24, 2018

@wlhgtc Thank you for your report. I am not sure about this. But cuda out of memory is exactly the problem of your GPU memory. Is your GPU memory sufficient? If you sum up Variables directly, this problem will happen. If so, .detach or .data may be helpful.

@wlhgtc
Copy link
Author

wlhgtc commented Mar 25, 2018

@jojonki Thanks for your reply. I debug my code the whole day. I test my model layer by layer(I comment out the backward step and optimizer step). The "out of memory error" occur when I compute the matrix S(S=W[H,U,HºU] ) with the batch size 60(according to the paper) . But I find you config is 20. The model goes well in 30 size.
Finally , I find a phenomenon in 30 size : the memory wen in 9G at first and went down for7.2G finally remain steady. I don't know how you deal with the data. I use the torchtext package, for each batch, this package will padding the context according to the max length of context automatically . I think there are some context that are too long(in some batches) so that the memory run out!
So I wonder if you padding the context in batch the same as me. And why you set the batch size 20?
By the way ,I use GTX 1080Ti with Pytorch 0.3!

@Vimos
Copy link

Vimos commented Jun 25, 2018

Is the memory increasing in your case? Mine runs out of memory in the middle of training.

[20180625-174613] Epoch 0 74.2%, loss_p1: 3.338, loss_p2: 2.325
p1 acc: 9.000% (6077/65000), p2 acc: 10.000% (6521/65000)
 75%|████████████████████████████████████████████▋               | 3266/4379 [07:33<02:34,  7.21it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "main.py", line 237, in <module>
    train(model, train_data, optimizer, ema, start_epoch=args.start_epoch)
  File "main.py", line 153, in train
    (loss_p1+loss_p2).backward()
  File "/home/vimos/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vimos/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

@wlhgtc
Copy link
Author

wlhgtc commented Jun 25, 2018

@Vimos There are some sentences with lenth>500. You'd better set them with a fixed length(for me 300)

@Vimos
Copy link

Vimos commented Jun 25, 2018

@wlhgtc Thanks for the advice.
If I keep using the default value for length, I have to change to a smaller batch size of 10. This still require 7709MiB memory.

@wlhgtc
Copy link
Author

wlhgtc commented Jun 26, 2018

If the memory keeps steady, it's fine. It's a large model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants