huggingface/transformers.js

(V3) WebGPU - await extractor() broken - Error: Session already started

Closed this issue ยท 5 comments

System Info

I'm on an M3 Max in Chrome and using scrimba as quick test environment.

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Your normal code works fine: https://v2.scrimba.com/s0lmm0qh1q

However, for some reason, the await function in the extractor pipeline is broken, so that if you call extractor(text) two times in a row without timeout, it crashes.

image

With device "wasm" I do not encounter this issue.

Reproduction

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

// Function to compute embeddings and log them
async function computeEmbeddings() {
  const texts = ['Hello world!', 'This is an example sentence.'];
  const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
  console.log(embeddings.tolist());
}

// Call the computeEmbeddings function every 500ms
setInterval(computeEmbeddings, 5000);
  • Repeated calls with 0ms pause do not work, await seems to fail:
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

// Function to compute embeddings and log them
async function computeEmbeddings() {
  const texts = ['Hello world!', 'This is an example sentence.'];
  const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
  console.log(embeddings.tolist());
}

// Call the computeEmbeddings function every 0ms
setInterval(computeEmbeddings, 0);

// Call the computeEmbeddings function every 0ms
setInterval(computeEmbeddings, 0);

In this case, what do you think the expected behaviour should be? ๐Ÿ‘€ Should each invocation be added to a queue?

The example maybe doesn't make much sense from an application point of view. It was just the quickest way of demonstrating that the await statement with webgpu doesn't actually await the extractor's output.

With wasm that's the case, so an eternal loop works fine without queue as the individual runs just finish before continuing.

So the behavior difference is what this issue is supposed to be about. Please correct me if I got the logic wrong somehow.

To simplify the implementation, ONNX Runtime Web doesn't allow a new session if the previous session isn't finished (https://github.com/microsoft/onnxruntime/blob/8c5336449d2279c0394cc4a742d336d7f3bd4124/onnxruntime/wasm/pre-jsep.js#L110). So if we trigger more than one session.run() in parallel (without await), such kind of error ("Session already started") may (not always) be seen.

Thanks @gyagp! Ok so the problem in my JS code was that I was circumventing the await logic with async js code. As I want to call the extractor function two times in a row I need to rewrite the function and always await the finish. Calling sequentially works:

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

// Compute embeddings
const texts = ['Hello world!', 'This is an example sentence.'];
const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings.tolist());

const embeddings2 = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings2.tolist());

Considering my my previous code, the setInterval function simply did not make sense as I was forcing multiple extractor sessions and then it obviously failed. Instead, one simply needs to await the finish.

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

const texts = ['Hello world!', 'This is an example sentence.'];

// Define an async function to handle the extraction logic
async function extractEmbeddings(index) {
  const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
  console.log(`Embeddings from iteration ${index + 1}:`, embeddings.tolist());
}

// Run the function 100 times in sequence
async function runExtractions() {
  for (let i = 0; i < 100; i++) {
    await extractEmbeddings(i);
  }
}

runExtractions();

Sorry for the noise, but I was confused why wasm seemingly worked (opening several concurrent sessions) and webgpu didn't.

As a side note, there are still these error messages in the example on https://v2.scrimba.com/s0lmm0qh1q:

image

You may ignore these error messages now, given you don't encounter correctness and performance issue.
In my last reply, I just mentioned "not always" and didn't jump into the implementation details to avoid the complexity and confusion. Actually if ONNX Runtime internally doesn't reach any async points (like memory copies between device and host), such kind of calls without await still works (There is no multiple valid sessions at the same time). This is very model and implementation dependent, so for simplicity, we always suggest to use await.