baidu/Senta

Out of memory error on GPU 0 应该如何修改呢

nxbnxb opened this issue · 1 comments

paddlepaddle的验证信息

fluid.install_check.run_check()
Running Verify Fluid Program ...
W1022 11:53:24.008435 13978 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0
W1022 11:53:24.010244 13978 device_context.cc:244] device: 0, cuDNN Version: 7.6.
Your Paddle Fluid works well on SINGLE GPU or CPU.
I1022 11:53:24.924460 13978 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 2. And the Program will be copied 2 copies
W1022 11:53:26.624143 13978 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.
I1022 11:53:26.624205 13978 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1
I1022 11:53:26.624480 13978 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I1022 11:53:26.624694 13978 parallel_executor.cc:315] Cross op memory reuse strategy is enabled, when build_strategy.memory_optimize = True or garbage collection strategy is disabled, which is not recommended
I1022 11:53:26.624883 13978 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now

cat env.sh
set -x
#在LD_LIBRARY_PATH中添加cuda库的路径
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
#在LD_LIBRARY_PATH中添加cudnn库的路径
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
#需要先下载NCCL,然后在LD_LIBRARY_PATH中添加NCCL库的路径
export LD_LIBRARY_PATH=/usr/local/cuda/nccl/lib:$LD_LIBRARY_PATH
#如果FLAGS_sync_nccl_allreduce为1,则会在allreduce_op_handle中调用cudaStreamSynchronize(nccl_stream),这种模式在某些情况下可以获得更好的性能
export FLAGS_sync_nccl_allreduce=1
#表示分配的显存块占GPU总可用显存大小的比例,范围[0,1]
export FLAGS_fraction_of_gpu_memory_to_use=1
#选择要使用的GPU
export CUDA_VISIBLE_DEVICES=0,1
#表示是否使用垃圾回收策略来优化网络的内存使用,<0表示禁用,>=0表示启用
export FLAGS_eager_delete_tensor_gb=1.0
#是否使用快速垃圾回收策略
export FLAGS_fast_eager_deletion_mode=1
#垃圾回收策略释放变量的内存大小百分比,范围为[0.0, 1.0]
export FLAGS_memory_fraction_of_eager_deletion=1
#设置fluid路径
export PATH=fluid=/usr/local/lib/python3.7/site-packages/paddle/include/paddle/fluid:$PATH
#设置python
alias python=/usr/bin/python3
set +x

报错信息
Out of memory error on GPU 0. Cannot allocate 90.250244MB memory on GPU 0, available memory is only 25.062500MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please try one of the following suggestions:
    1. Decrease the batch size of your model.
    2. FLAGS_fraction_of_gpu_memory_to_use is 1.00 now, please set it to a higher value but less than 1.0.
      The command is export FLAGS_fraction_of_gpu_memory_to_use=xxx.

at (/paddle/paddle/fluid/memory/detail/system_allocator.cc:151)
F1021 21:29:34.217520 13068 exception_holder.h:37] std::exception caught,


C++ Call Stacks (More useful to developers):

0 paddle::memory::detail::GPUAllocator::Alloc(unsigned long*, unsigned long)
1 paddle::memory::detail::BuddyAllocator::RefillPool(unsigned long)
2 paddle::memory::detail::BuddyAllocator::Alloc(unsigned long)
3 void* paddle::memory::legacy::Allocpaddle::platform::CUDAPlace(paddle::platform::CUDAPlace const&, unsigned long)
4 paddle::memory::allocation::NaiveBestFitAllocator::AllocateImpl(unsigned long)
5 paddle::memory::allocation::Allocator::Allocate(unsigned long)
6 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
7 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
8 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
9 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
10 paddle::framework::Tensor::mutable_data(paddle::platform::Place, paddle::framework::proto::VarType_Type, unsigned long)
11 paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
12 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
14 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
15 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
16 paddle::framework::details::ComputationOpHandle::RunImpl()
17 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
18 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*)
19 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
20 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
21 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

也是同样的错误,我使用了四块gpu资源。出现在两个情况,第一个情况是在调用模型预测量级较大的数据时。第二个情况是在finetune过程中训练数据量级较大时。请问该如何解决?并没有发现可以调小model batch size的地方,配置文件中并不能修改这一项。