(V3) WebGPU - await extractor() broken - Error: Session already started

Question

(V3) WebGPU - await extractor() broken - Error: Session already started

Closed this issue 3 months ago · 5 comments

do-me commented 3 months ago

System Info

I'm on an M3 Max in Chrome and using scrimba as quick test environment.

Environment/Platform

Description

Your normal code works fine: https://v2.scrimba.com/s0lmm0qh1q

However, for some reason, the await function in the extractor pipeline is broken, so that if you call extractor(text) two times in a row without timeout, it crashes.

With device "wasm" I do not encounter this issue.

Reproduction

Works: https://v2.scrimba.com/s0lmm0qh1q
Repeated calls with 50ms pause work too:

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

// Function to compute embeddings and log them
async function computeEmbeddings() {
  const texts = ['Hello world!', 'This is an example sentence.'];
  const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
  console.log(embeddings.tolist());
}

// Call the computeEmbeddings function every 500ms
setInterval(computeEmbeddings, 5000);

Repeated calls with 0ms pause do not work, await seems to fail:

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

// Function to compute embeddings and log them
async function computeEmbeddings() {
  const texts = ['Hello world!', 'This is an example sentence.'];
  const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
  console.log(embeddings.tolist());
}

// Call the computeEmbeddings function every 0ms
setInterval(computeEmbeddings, 0);

Answer 1 · 2024-08-31T20:16:59.000Z

// Call the computeEmbeddings function every 0ms
setInterval(computeEmbeddings, 0);

In this case, what do you think the expected behaviour should be? 👀 Should each invocation be added to a queue?

Answer 2 · 2024-09-01T09:26:54.000Z

The example maybe doesn't make much sense from an application point of view. It was just the quickest way of demonstrating that the await statement with webgpu doesn't actually await the extractor's output.

With wasm that's the case, so an eternal loop works fine without queue as the individual runs just finish before continuing.

So the behavior difference is what this issue is supposed to be about. Please correct me if I got the logic wrong somehow.

Answer 3 · 2024-09-01T12:14:09.000Z

To simplify the implementation, ONNX Runtime Web doesn't allow a new session if the previous session isn't finished (https://github.com/microsoft/onnxruntime/blob/8c5336449d2279c0394cc4a742d336d7f3bd4124/onnxruntime/wasm/pre-jsep.js#L110). So if we trigger more than one session.run() in parallel (without await), such kind of error ("Session already started") may (not always) be seen.

Answer 4 · 2024-09-04T07:11:01.000Z

Thanks @gyagp! Ok so the problem in my JS code was that I was circumventing the await logic with async js code. As I want to call the extractor function two times in a row I need to rewrite the function and always await the finish. Calling sequentially works:

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

// Compute embeddings
const texts = ['Hello world!', 'This is an example sentence.'];
const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings.tolist());

const embeddings2 = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings2.tolist());

Considering my my previous code, the setInterval function simply did not make sense as I was forcing multiple extractor sessions and then it obviously failed. Instead, one simply needs to await the finish.

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers';

const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  dtype: 'fp32',
  device: 'webgpu', // <- Run on WebGPU
});

const texts = ['Hello world!', 'This is an example sentence.'];

// Define an async function to handle the extraction logic
async function extractEmbeddings(index) {
  const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
  console.log(`Embeddings from iteration ${index + 1}:`, embeddings.tolist());
}

// Run the function 100 times in sequence
async function runExtractions() {
  for (let i = 0; i < 100; i++) {
    await extractEmbeddings(i);
  }
}

runExtractions();

Sorry for the noise, but I was confused why wasm seemingly worked (opening several concurrent sessions) and webgpu didn't.

As a side note, there are still these error messages in the example on https://v2.scrimba.com/s0lmm0qh1q:

Answer 5 · 2024-09-04T07:37:39.000Z

You may ignore these error messages now, given you don't encounter correctness and performance issue.
In my last reply, I just mentioned "not always" and didn't jump into the implementation details to avoid the complexity and confusion. Actually if ONNX Runtime internally doesn't reach any async points (like memory copies between device and host), such kind of calls without await still works (There is no multiple valid sessions at the same time). This is very model and implementation dependent, so for simplicity, we always suggest to use await.