FastAI seems very slow compared to "vanilla" Flux

Question

FastAI seems very slow compared to "vanilla" Flux

Closed this issue 3 years ago · 9 comments

When I try to train a simple resnet on CIFAR10 dataset, FastAi seems very slow compared to Flux (≈ 9-19 times slower).
It seems, it could be a garbage collector problem, because with Flux I can have a batch-size of 512, and with FastAI I can't exceed 128 without having a out of memory error.

FastAI code:

using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

data, blocks = loaddataset("cifar10", (Image, Label))
method = ImageClassificationSingle(blocks)
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks)
learner = methodlearner(method, data; 
    lossfn=Flux.crossentropy,
    callbacks=[ToGPU()],
    batchsize=16,
    model=model,
    optimizer=Descent())

@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)

Flux code:

using Flux
using Flux: DataLoader, onehotbatch
using Augmentor
using MLDatasets
using ParameterSchedulers
using ParameterSchedulers: Scheduler
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

normpip = SplitChannels() |> PermuteDims(3, 2, 1) |> ConvertEltype(Float32)

labels = CIFAR10.classnames() .|> Symbol

function datasets(batchsize)
    train = let
        x = CIFAR10.traintensor() |> CIFAR10.convert2image
        y = map(i -> labels[i + 1], CIFAR10.trainlabels())
        DataLoader((x, y), batchsize = batchsize, shuffle = true, partial = false)
    end

    test = let
        x = CIFAR10.testtensor() |> CIFAR10.convert2image
        y = map(i -> labels[i + 1], CIFAR10.testlabels())
        DataLoader((x, y), batchsize = batchsize)
    end
    
    train, test
end

function minibatch(x, y)
    h, w, n = size(x)
    xb = Array{Float32}(undef, w, h, 3, n)
    augmentbatch!(CPUThreads(), xb, x, normpip)
    yb = onehotbatch(y, labels)
    xb, yb
end

function train!(model, optimiser, nepochs)
    loss_hist = []
    loss(x, y) = Flux.crossentropy(model(x), y)
    ps = params(model)
    for e in 1:nepochs
        # Training phase
        tloss = 0
        trainmode!(model)
        for (x, y) ∈ train
            x, y = minibatch(x, y) |> gpu
            gs = gradient(ps) do
                l = loss(x, y)
                tloss += l
                l
            end
            Flux.Optimise.update!(optimiser, ps, gs)
        end
        tloss /= length(train)
        # Validation phase
        testmode!(model)
        vloss = 0
        for (x, y) ∈ test
            x, y = minibatch(x, y) |> gpu
            vloss += loss(x, y)
        end
        vloss /= length(test)
        push!(loss_hist, (tloss, vloss))
    end
    
    loss_hist
end

train, test = datasets(16)
nepochs = 5
s = Triangle(λ0 = 1f-5, λ1 = 1f-3, period = nepochs * length(train))
opt = Scheduler(s, Descent())
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0) |> gpu
@time train!(model, opt, nepochs)

Results on a RTX 2080 Ti:
FastAI:
1841.008685 seconds (3.92 G allocations: 212.561 GiB, 59.59% gc time, 0.00% compilation time)
Flux:
98.444806 seconds (106.49 M allocations: 16.643 GiB, 3.58% gc time, 2.58% compilation time)

Results on a Quadro P5000:
FastAI:
1574.714976 seconds (3.92 G allocations: 212.473 GiB, 11.08% gc time)
Flux:
177.416636 seconds (105.55 M allocations: 16.639 GiB, 2.05% gc time, 1.42% compilation time)

Answer 1 · 2021-12-14T11:35:06.000Z

Thanks for the report!

I think the most likely culprit is that the images do not have the same size, since ImageClassificationSingle defaults to resizing to (128, 128) while CIFAR10 source is significantly smaller. This should make a 16x (128^2 / 32^2) theoretical difference. The re(/up)sizing here doesn't make much sense here, but it is the default since many image datasets have differing image sizes.

Could you let me know if the FastAI code runs faster without the upsizing, i.e. using

method = ImageClassificationSingle(blocks, size=(32, 32))

Answer 2 · 2021-12-14T12:26:51.000Z

Thank you for the suggestion, it was the problem. I should have read more carefully the documentation of ImageClassificationSingle.

Results on RTX 2080 Ti:
159.958922 seconds (3.91 G allocations: 147.793 GiB, 24.01% gc time)
Results on Quadro P5000:
228.792806 seconds (3.90 G allocations: 147.644 GiB, 23.16% gc time)

(I just discover FastAi.jl, I love it)

Answer 3 · 2021-12-14T13:04:54.000Z

More of a documentation error then, not on your part 😉

That still seems pretty slow, but is probably due to the amounts of garbage collection happening and since the dataset isn't preloaded (as FastAI.jl assumes out-of-memory datasets by default). You may get another speedup if you preload the batches once, since I think they will fit into memory. To do so, construct the Learner and then before training, do

learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)

If you try it out, let me know if that improves the allocation issues!

Answer 4 · 2021-12-14T13:14:20.000Z

I will try that. During theses experiments, Volatile GPU-Util is not always above to 90% (most of the time it's above 90%, but it can quickly drop to 0%). I supect I need to optimize the data loading part.

Answer 5 · 2021-12-14T13:31:59.000Z

GPU utilization tends to be a bit lower on validation splits, since the GPU needs to do less work compared to the data loading part. If you try the above suggestion, the data loading overhead will be effectively zero.

Answer 6 · 2021-12-14T14:21:47.000Z

Collecting data into RAM clearly reduces the garbage collection time from 21% to 3%. Unfortunately, the training process is completely broken, the training loss decreases quickly and the validation loss is unchanged. It looks like an overfitting after one epoch, which is very very surprinsing.

With collecting data:

using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data; 
    lossfn=Flux.crossentropy,
    callbacks=[ToGPU()],
    batchsize=16,
    model=model,
    optimizer=Descent())
learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)

@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)

Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:33
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01

┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   1.0 │ 0.66604 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   1.0 │ 2.30902 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   2.0 │ 0.00658 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   2.0 │ 2.31039 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   3.0 │ 0.00178 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   3.0 │ 2.31698 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   4.0 │ 0.00102 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   4.0 │ 2.32021 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   5.0 │ 0.00087 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   5.0 │ 2.32081 │
└─────────────────┴───────┴─────────┘
173.158782 seconds (110.26 M allocations: 13.375 GiB, 2.65% gc time, 0.01% compilation time)

Without collecting data:

using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data; 
    lossfn=Flux.crossentropy,
    callbacks=[ToGPU()],
    batchsize=16,
    model=model,
    optimizer=Descent())

@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)

Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:43
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:03

┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   1.0 │ 2.31503 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬────────┐
│           Phase │ Epoch │   Loss │
├─────────────────┼───────┼────────┤
│ ValidationPhase │   1.0 │ 1.5084 │
└─────────────────┴───────┴────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   2.0 │ 1.32031 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   2.0 │ 1.28183 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   3.0 │ 0.96906 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   3.0 │ 0.97488 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   4.0 │ 0.62234 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   4.0 │ 0.83441 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   5.0 │ 0.42819 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   5.0 │ 0.80053 │
└─────────────────┴───────┴─────────┘
223.763756 seconds (3.90 G allocations: 147.653 GiB, 21.48% gc time, 0.07% compilation time)

Answer 7 · 2021-12-14T14:59:50.000Z

Hm, try passing buffered = false to methodlearner. I think collect reuses the same buffer since buffered = true by default, so what I wrote doesn't work for collecting the whole dataset.

If that fixes training, it looks like you'll be at Flux.jl speeds (which you really should since that part is basically the same)

Answer 8 · 2021-12-14T15:08:47.000Z

Yeah!!! It works!!!

Quadro P5000: 177.265435 seconds (110.25 M allocations: 13.374 GiB, 3.17% gc time)
RTX 2080 Ti: 91.993256 seconds (111.44 M allocations: 13.404 GiB, 4.17% gc time)

Answer 9 · 2021-12-14T16:54:50.000Z

Great! I'll close this issue then.