FastAI seems very slow compared to "vanilla" Flux
Closed this issue · 9 comments
When I try to train a simple resnet on CIFAR10 dataset, FastAi seems very slow compared to Flux (≈ 9-19 times slower).
It seems, it could be a garbage collector problem, because with Flux I can have a batch-size of 512, and with FastAI I can't exceed 128 without having a out of memory error.
FastAI code:
using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")
data, blocks = loaddataset("cifar10", (Image, Label))
method = ImageClassificationSingle(blocks)
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks)
learner = methodlearner(method, data;
lossfn=Flux.crossentropy,
callbacks=[ToGPU()],
batchsize=16,
model=model,
optimizer=Descent())
@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)
Flux code:
using Flux
using Flux: DataLoader, onehotbatch
using Augmentor
using MLDatasets
using ParameterSchedulers
using ParameterSchedulers: Scheduler
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")
normpip = SplitChannels() |> PermuteDims(3, 2, 1) |> ConvertEltype(Float32)
labels = CIFAR10.classnames() .|> Symbol
function datasets(batchsize)
train = let
x = CIFAR10.traintensor() |> CIFAR10.convert2image
y = map(i -> labels[i + 1], CIFAR10.trainlabels())
DataLoader((x, y), batchsize = batchsize, shuffle = true, partial = false)
end
test = let
x = CIFAR10.testtensor() |> CIFAR10.convert2image
y = map(i -> labels[i + 1], CIFAR10.testlabels())
DataLoader((x, y), batchsize = batchsize)
end
train, test
end
function minibatch(x, y)
h, w, n = size(x)
xb = Array{Float32}(undef, w, h, 3, n)
augmentbatch!(CPUThreads(), xb, x, normpip)
yb = onehotbatch(y, labels)
xb, yb
end
function train!(model, optimiser, nepochs)
loss_hist = []
loss(x, y) = Flux.crossentropy(model(x), y)
ps = params(model)
for e in 1:nepochs
# Training phase
tloss = 0
trainmode!(model)
for (x, y) ∈ train
x, y = minibatch(x, y) |> gpu
gs = gradient(ps) do
l = loss(x, y)
tloss += l
l
end
Flux.Optimise.update!(optimiser, ps, gs)
end
tloss /= length(train)
# Validation phase
testmode!(model)
vloss = 0
for (x, y) ∈ test
x, y = minibatch(x, y) |> gpu
vloss += loss(x, y)
end
vloss /= length(test)
push!(loss_hist, (tloss, vloss))
end
loss_hist
end
train, test = datasets(16)
nepochs = 5
s = Triangle(λ0 = 1f-5, λ1 = 1f-3, period = nepochs * length(train))
opt = Scheduler(s, Descent())
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0) |> gpu
@time train!(model, opt, nepochs)
Results on a RTX 2080 Ti:
FastAI:
1841.008685 seconds (3.92 G allocations: 212.561 GiB, 59.59% gc time, 0.00% compilation time)
Flux:
98.444806 seconds (106.49 M allocations: 16.643 GiB, 3.58% gc time, 2.58% compilation time)
Results on a Quadro P5000:
FastAI:
1574.714976 seconds (3.92 G allocations: 212.473 GiB, 11.08% gc time)
Flux:
177.416636 seconds (105.55 M allocations: 16.639 GiB, 2.05% gc time, 1.42% compilation time)
Thanks for the report!
I think the most likely culprit is that the images do not have the same size, since ImageClassificationSingle
defaults to resizing to (128, 128)
while CIFAR10 source is significantly smaller. This should make a 16x (128^2 / 32^2) theoretical difference. The re(/up)sizing here doesn't make much sense here, but it is the default since many image datasets have differing image sizes.
Could you let me know if the FastAI code runs faster without the upsizing, i.e. using
method = ImageClassificationSingle(blocks, size=(32, 32))
Thank you for the suggestion, it was the problem. I should have read more carefully the documentation of ImageClassificationSingle
.
Results on RTX 2080 Ti:
159.958922 seconds (3.91 G allocations: 147.793 GiB, 24.01% gc time)
Results on Quadro P5000:
228.792806 seconds (3.90 G allocations: 147.644 GiB, 23.16% gc time)
(I just discover FastAi.jl, I love it)
More of a documentation error then, not on your part 😉
That still seems pretty slow, but is probably due to the amounts of garbage collection happening and since the dataset isn't preloaded (as FastAI.jl assumes out-of-memory datasets by default). You may get another speedup if you preload the batches once, since I think they will fit into memory. To do so, construct the Learner
and then before training, do
learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)
If you try it out, let me know if that improves the allocation issues!
I will try that. During theses experiments, Volatile GPU-Util
is not always above to 90% (most of the time it's above 90%, but it can quickly drop to 0%). I supect I need to optimize the data loading part.
GPU utilization tends to be a bit lower on validation splits, since the GPU needs to do less work compared to the data loading part. If you try the above suggestion, the data loading overhead will be effectively zero.
Collecting data into RAM clearly reduces the garbage collection time from 21% to 3%. Unfortunately, the training process is completely broken, the training loss decreases quickly and the validation loss is unchanged. It looks like an overfitting after one epoch, which is very very surprinsing.
With collecting data:
using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")
data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data;
lossfn=Flux.crossentropy,
callbacks=[ToGPU()],
batchsize=16,
model=model,
optimizer=Descent())
learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)
@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)
Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:33
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 1.0 │ 0.66604 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 1.0 │ 2.30902 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 2.0 │ 0.00658 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 2.0 │ 2.31039 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 3.0 │ 0.00178 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 3.0 │ 2.31698 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 4.0 │ 0.00102 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 4.0 │ 2.32021 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 5.0 │ 0.00087 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 5.0 │ 2.32081 │
└─────────────────┴───────┴─────────┘
173.158782 seconds (110.26 M allocations: 13.375 GiB, 2.65% gc time, 0.01% compilation time)
Without collecting data:
using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")
data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data;
lossfn=Flux.crossentropy,
callbacks=[ToGPU()],
batchsize=16,
model=model,
optimizer=Descent())
@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)
Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:43
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:03
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 1.0 │ 2.31503 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼────────┤
│ ValidationPhase │ 1.0 │ 1.5084 │
└─────────────────┴───────┴────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 2.0 │ 1.32031 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 2.0 │ 1.28183 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 3.0 │ 0.96906 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 3.0 │ 0.97488 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 4.0 │ 0.62234 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 4.0 │ 0.83441 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 5.0 │ 0.42819 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 5.0 │ 0.80053 │
└─────────────────┴───────┴─────────┘
223.763756 seconds (3.90 G allocations: 147.653 GiB, 21.48% gc time, 0.07% compilation time)
Hm, try passing buffered = false
to methodlearner
. I think collect
reuses the same buffer since buffered = true
by default, so what I wrote doesn't work for collecting the whole dataset.
If that fixes training, it looks like you'll be at Flux.jl speeds (which you really should since that part is basically the same)
Yeah!!! It works!!!
Quadro P5000: 177.265435 seconds (110.25 M allocations: 13.374 GiB, 3.17% gc time)
RTX 2080 Ti: 91.993256 seconds (111.44 M allocations: 13.404 GiB, 4.17% gc time)
Great! I'll close this issue then.