huggingface/transformers.js

[WebGPU] zero-shot-classification model Xenova/nli-deberta-v3-xsmall not accelerated by WebGPU

martin-ada-adam opened this issue · 2 comments

System Info

transformers.js v3 alpha 12-19, webapp chrome latest stable

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Embedding is pretty accelarated by WebGPU in v3 aplha but zero-shot-classification is quite slow using WebGPU
seems GPU is not much used,
q8 wasm is much faster - strange

Reproduction

import {env,pipeline} from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.0-alpha.19';

async function execute()
{

const classifierMulti = await pipeline('zero-shot-classification',
'Xenova/nli-deberta-v3-xsmall',
// { dtype: 'q8', device:'wasm'}
// { dtype: 'fp16', device:'wasm' }
{ dtype: 'fp16', device:'webgpu' }
);

const startTime = Date.now();
const result = await classifierMulti("Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.",
["failure","database","mobile phone"], { multi_label: true });
const endTime = Date.now();
const timeSpent = endTime - startTime;

console.log("result",result);

console.log("time:",timeSpent);
}

execute();

codepen example
https://codepen.io/martin-adam/pen/eYqJPpy

Do note that shaders are compiled on first run, so can you measure later execution calls to the model?

Thank you.
I have checked it, second and next calls of pipeline are about 30% faster.

Just Embeding accelerated by WebGPU fp16 is 6x faster than wasm q8.

In my (codepen example) case zero-shot-classification using WebGPU fp16 is 3x slower that wasm q8, strange...
WebGPU fp16 about 700ms
Wasm q8 about 230ms

maybe because of errors I see in console
"�[0;93m2024-10-04 15:12:32.393400 [W:onnxruntime:, constant_folding.cc:268 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'�[m"
"�[0;93m2024-10-04 15:12:32.485300 [W:onnxruntime:, constant_folding.cc:268 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'�[m"
"�[0;93m2024-10-04 15:12:32.494100 [W:onnxruntime:, constant_folding.cc:268 ApplyImpl] Could not find a CPU kernel and hence can't constant fold ReduceMean node '/deberta/encoder/LayerNorm/ReduceMean'�[m"
"�[0;93m2024-10-04 15:12:32.705900 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.�[m"
"�[0;93m2024-10-04 15:12:32.706200 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.�[m"