Minimal Captioning tool for Zoom

本ツールについて

音声認識したテキストをZoomの字幕として表示する簡易的なツールです。音声認識を開始すると、認識中のテキストを逐次Webブラウザに表示し、認識が完了する度にZoomの字幕として送信します。Webブラウザの画面をキャプチャして仮想カメラとして会議に参加させることにより、認識中のテキストをほぼリアルタイムで共有できます。なお、あくまで最小限の機能だけを実装した簡易的なものであり、一般向け公開サービスとして使用されることを想定したものではありません。今後の予定についてはTODO.mdをご覧ください。

sequenceDiagram
    actor H as Zoomホスト
    participant O as OBS Studio
    participant B as Webブラウザ
    participant HZA as ホストのZoomアプリ
    participant A as Speech Service
    participant Z as Zoomサービス
    participant PZA as 参加者のZoomアプリ
    actor P as Zoom参加者

    A->>H: Speech Serviceの設定情報を取得
    HZA->>H: ZoomのAPIトークンを取得
    H->>B: Speech Serviceの設定情報とZoomのAPIトークンを入力
    H->>O: 設定し仮想カメラを開始
    HZA--)B: Zoomの音声を受け取る
    B--)A: 受け取った音声を送信
    A--)B: 認識中のテキストを送信
    B--)O: 認識中のテキストをキャプチャ
    O--)HZA: 認識中のテキストを仮想カメラの映像として送信
    HZA--)Z: 送信
    Z--)PZA: 映像を配信
    PZA--)P: 認識中のテキスト（仮想カメラの映像として）
    A--)B: 認識後のテキストを送信
    B--)Z: 認識後のテキストを字幕として送信
    Z--)PZA: 字幕を配信
    PZA--)P: 認識後のテキスト（字幕として）

準備

Webサーバ：本ツールを使用するためにはWebサーバが必要です。手軽にWebアプリケーションをつくれるWebサービス「Glitch」を利用すると、簡単に使い始めることができます（説明）。
Microsoft Cognitive Services Speech Serviceのサブスクリプション：Speech Serviceを使い始めるには「Speech Serviceを無料で試す」を参照してください。
Zoomアプリからブラウザへの音声入力経路：
- macOSの場合：まず、BlackHoleの2チャンネル版を導入し、Zoomアプリの［スピーカー］で［BlackHole 2ch］を選択します（同時に内蔵スピーカーなどでも確認したい時はこちらの記事で説明されている［複数出力装置］が必要です）。次に、Webブラウザの音声入力に関する設定（Chromeの場合にはchrome://settings/content/microphone）で同じく［BlackHole 2ch］を選択します。以上で、Zoomアプリの音声がWebブラウザへ入力されるようになります。
仮想カメラ：Webブラウザの画面キャプチャを仮想カメラとしてZoomミーティング／ウェビナーに参加させるには、ライブストリーミング用ツール「OBS Studio」などを用意し、設定します。

使い方

Speech ServiceのSubscription KeyとService Regionを入力する
Zoomミーティング／ウェビナーでホストがAPIトークンを取得し（説明）、取得したAPIトークンを入力する
［音声認識を開始する］ボタンを押して音声認識を開始する

以上で、音声認識が完了する度、Zoomに字幕として投稿されるようになります。なお、音声認識を使用せず、［字幕］欄に入力し［字幕を投稿する］ボタンを押すことにより手動で字幕を投稿することもできます。

参考にした記事など

動作確認した環境

Google Chrome 99.0.4844.83（macOS 12.3）

使用しているライブラリ

About this tool

A tiny tool for displaying speech-recognized text as subtitles in Zoom. Once you start speech recognition, your web browser shows the text in recognizing continuously, and the tool submits the final text as Zoom subtitle once recognized. By capturing the web browser screen and involving it in a meeting as a virtual camera, you can share the text in recognizing in near real-time. Please note that this is a small tool with minimal functionality and is not intended for use as a public service. See TODO.md for plans.

References

Libraries

sequenceDiagram
    actor H as Zoom host
    participant O as OBS Studio
    participant B as Web Browser
    participant S as Web Server
    participant A as Azure Speech Service
    participant Z as Zoom
    participant ZA as Zoom App
    actor P as Zoom participant

    H->>O: Start the virtual camera
    H->>B: Input the Speech Service API key
    H->>B: Input the Zoom API key
    H->>B: Start recognition
    B->>A: Start continuous recognition async
    loop
        B--)+A: Audio signal
        A--)B: Recognizing text
        O--)B: Capture the recognizing text
        O--)Z: Show the recognizing text (as camera images)
        Z--)ZA: Recognizing text (as camera images)
        P--)ZA: Recognizing text (as camera images)
        A->>-B: Recognized text
        B->>S: Submit the recognized text + Zoom API key
        S->>+Z: Ask the seq number of the last successful send
        Z->>-S: The seq number
        S->>+Z: Submit the recognized text with the seq number
        Z->>-ZA: Recognized text as closed captions
        P--)ZA: Recognized text as closed captions
    end
    H->>B: Stop recognition
    B->>A: Stop continuous recognition async
    H->>O: Stop the virtual camera

kotobuki/tiny-captioning-tool-for-zoom