Thijsvanede/DeepLog

CUDA Out of Memory - example_hdfs.py

Closed this issue · 1 comments

Hi, i'm trying to run the example_hdfs.py with the logs given and I get this error :
`RuntimeError Traceback (most recent call last)

in ()
33 y_pred_normal, confidence = deeplog.predict(
34 X = X_test,
---> 35 k = 3, # Change this value to get the top k predictions (called 'g' in DeepLog paper, see Figure 6)
36 )
37

5 frames

/usr/local/lib/python3.7/dist-packages/deeplog/deeplog.py in predict(self, X, y, k, variable, verbose)
105 """
106 # Get the predictions
--> 107 result = super().predict(X, variable=variable, verbose=verbose)
108 # Get the probabilities from the log probabilities
109 result = result.exp()

/usr/local/lib/python3.7/dist-packages/torchtrain/module.py in predict(self, X, batch_size, variable, verbose, **kwargs)
205 X_ = X[batch:batch+batch_size]
206 # Add prediction
--> 207 result.append(self(X_))
208
209 # Concatenate result and return

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/deeplog/deeplog.py in forward(self, X)
63
64 # Perform LSTM layer
---> 65 out, hidden = self.lstm(X, (hidden, state))
66 # Perform output layer
67 out = self.out(out[:, -1, :])

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
760 if batch_sizes is None:
761 result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
--> 762 self.dropout, self.training, self.bidirectional, self.batch_first)
763 else:
764 result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.34 GiB already allocated; 3.75 MiB free; 13.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

This error means that your GPU is too small to run the example. You can run the code on your CPU as well (but this will be a lot slower). Try removing/commenting out the part:

if torch.cuda.is_available():
    # Set deeplog to device
    deeplog = deeplog.to("cuda")

    # Set data to device
    X_train        = X_train       .to("cuda")
    y_train        = y_train       .to("cuda")
    X_test         = X_test        .to("cuda")
    y_test         = y_test        .to("cuda")
    X_test_anomaly = X_test_anomaly.to("cuda")
    y_test_anomaly = y_test_anomaly.to("cuda")

Alternatively, you can also reduce the input and output size of DeepLog for this example. These parameters are used by DeepLog to determine how many different log types to expect. We set it to 300 because this was the amount of log types we had in our largest dataset, but for the HDFS dataset, we can set it to a lower number (e.g., 30, but please double check whether this is enough). By changing the input and output size, the model itself will become smaller and might fit on your GPU.
To do so, please change the following part of the example:

# Create DeepLog object
deeplog = DeepLog(
    input_size  = 30, # Number of different events to expect, I think 30 should be enough but please check
    hidden_size = 64 , # Hidden dimension, we suggest 64
    output_size = 30, # Number of different events to expect, I think 30 should be enough but please check
)