The inference results on the same model are inconsistent when using WebGPU and WASM

Question

The inference results on the same model are inconsistent when using WebGPU and WASM

Opened this issue 3 months ago · 1 comments

System Info

"@huggingface/transformers": "^3.0.0-alpha.5"

Environment/Platform

Description

When I perform NER inference using WebGPU, the results vary across different users' computers, and the execution results of WebGPU differ from those of WASM. The only code changes between using WASM and WebGPU involve modifying the device setting from WebGPU to WASM.
For some models, there is no difference, such as with Xenova/bert-base-multilingual-cased-ner-hrl

Reproduction

First, I converted the model to ONNX format using the following method: On the v3 branch of transformers.js, I executed the command python -m script.convert --quantize --model_id Isotonic/distilbert_finetuned_ai4privacy_v2. Then, I used the following code for model loading and inference:

env.allowLocalModels = true;
env.backends.onnx.wasm.numThreads = 1;

export class PipelineSingleton {
    static task = 'token-classification';
    // static model = '/Isotonic/distilbert_finetuned_ai4privacy_v2 ';

    static instance = null;

    static async getInstance(progress_callback = null) {
        if (this.instance === null) {
            this.instance = pipeline(this.task, this.model, {
                progress_callback,
                dtype: "fp16",
                device: "wasm"
            });
        }

        return this.instance;
    }
}

When performing inference on the text "Anuj Joshi - Founder (May 2020) Over 22+ experience in channel space building various Route To Markets for global giants like Amazon, IBM & Autodesk," no entities are extracted. However, if the device is changed to "wasm", entities can be extracted.

Answer 1 · 2024-09-14T06:26:47.000Z

I gave a try and it seems fp32 model on WebGPU (as below) could get similar results as WASM.