An RFB proxy that enables WebSockets and audio.
This crate proxies a TCP Remote Framebuffer server connection and exposes a
WebSocket endpoint, translating the connection between them. It can optionally
enable audio using the Replit Audio RFB extension if the
--enable-audio
flag is passed or the VNC_ENABLE_EXPERIMENTAL_AUDIO
environment variable is set to a non-empty value.
Since this is a proxy, you'll need to have an RFB server running already. TigerVNC is a good option:
Xvnc --SecurityTypes={None,VNCAuth} --rfbport=5901 --localhost :1
Now rfbproxy
can run:
cargo run -- [--enable-audio] [--address=0.0.0.0:5900] [--rfb-server=127.0.0.1:5901]
This uses a proposed extension to the RFB protocol in order to negotiate and transmit encoded audio. This is the main difference from the pre-existing QEMU Audio messages.
This registers the following pseudo-encodings:
Number | Name |
---|---|
0x52706C41 | Replit Audio Pseudo-encoding |
A client that supports this encoding is indicating that it is able to receive an encoded audio data stream. If a server wishes to send encoded audio data, it will send a pseudo-rectangle with the following contents:
No. of bytes | Type | Description |
---|---|---|
2 | U16 |
version |
2 | U16 |
number-of-codecs |
2 * number-of-codecs | U16 array |
codecs |
The supported codecs are as follow:
Codec | Description |
---|---|
0 | Opus codec, WebM container |
1 | MP3 codec, MPEG-1 container |
After receiving this notification, clients may optionally use the Replit Audio Client Message.
This registers the following message types:
Number | Name |
---|---|
245 | Replit Audio Client Message |
This message may only be sent if the client has previously received a
FrameBufferUpdate that confirms support for the intended message-type. Every
Replit Audio Client Message
begins with a standard header
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
submessage-type | |
2 | U16 |
payload-length |
This header is then followed by arbitrary data of length payload-length, and whose format is determined by the submessage-type. Possible values for submessage-type and their associated minimum versions are
Submessage Type | Minimum version | Description |
---|---|---|
0 | 0 | Start Encoder |
1 | 0 | Frame Request |
2 | 0 | Start Continuous Updates |
This submessage allows the client to request the server to start audio capture with the provided configuration
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
0 | submessage-type |
2 | U16 |
6 | payload-length |
1 | U8 |
enabled | |
1 | U8 |
channels | |
2 | U16 |
codec | |
2 | U16 |
kbytes_per_sec |
After invoking this operation, the client will receive a Replit Audio Server Start Encoder Message with the result of the operation.
Valid values for the enabled field are 0, which disables/stops the audio encoder, and 1, which starts the audio encoder. Valid values for the channels field are 1 (Mono audio) and 2 (Stereo audio). Valid values for the codec field are the ones sent by the server in the Replit Audio Pseudo-encoding pseudo-rect. Valid values for the kbytes_per_sec field are codec-dependent. The Opus codec achieves good performance with 32, whereas the MP3 codec might require 128 for a comparable experience.
This submessage allows the client to request the server for a single audio frame. The length of an audio frame is codec-dependent, but is typically between 5 and 40 milliseconds. Each frame is encoded with the parameters chosen by the Start Encoder message. The client MUST send a Start Encoder message and have received acknowledgement from the server that the chosen parameters are valid prior to sending this message.
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
1 | submessage-type |
2 | U16 |
0 | payload-length |
After invoking this operation, the client will receive a Replit Audio Server Frame Response Message with the encoded audio frame in the corresponding container format.
This submessage allows the client to request the server send audio frames continuously, which saves bandwidth and reduces audio latency incurred by the TCP stack by half compared to requesting frames individually. The length of an audio frame is codec-dependent, but is typically between 5 and 40 milliseconds. Each frame is encoded with the parameters chosen by the Start Encoder message. The client MUST send a Start Encoder message and have received acknowledgement from the server that the chosen parameters are valid prior to sending this message.
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
1 | submessage-type |
2 | U16 |
0 | payload-length |
After invoking this operation, the client will receive a Replit Audio Server Start Continuous Updates Message with the result of the operation. If the operation was successful, that message will be followed by Replit Audio Server Frame Response Message messages with and encoded audio frame in the corresponding container format.
Once audio frames start being continuously sent, this can be stopped by sending
a Start Encoder message with the
enabled field set to 0
. Due to inherent race conditions in the protocol,
after disabling the encoder, the client may still receive further Replit Audio
Server Frame Response Message
messages, but once the server acknowledges the receipt of the Start
Encoder message, no further audio
frames will be sent.
This registers the following message types:
Number | Name |
---|---|
245 | Replit Audio Server Message |
This message may only be sent if the client has previously sent a Replit Audio
Client Message that confirms support for the
intended message-type. Every Replit Audio Server Message
begins with a
standard header
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
submessage-type | |
2 | U16 |
payload-length |
This header is then followed by arbitrary data of length payload-length, and whose format is determined by the submessage-type. Possible values for submessage-type and their associated minimum versions are
Submessage Type | Minimum version | Description |
---|---|---|
0 | 0 | Start Encoder |
1 | 0 | Frame Request |
2 | 0 | Start Continuous Updates |
This submessage is a response to the Replit Audio Client Start Encoder Message, and acknowledges the receipt and/or support for the requested configuration
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
0 | submessage-type |
2 | U16 |
1 | payload-length |
1 | U8 |
enabled |
If the parameters in the Replit Audio Client Start Encoder Message were valid and the server was able to successfully start an audio capture session, the value of enabled will be 1. Otherwise it will be 0.
After receiveing this message with enabled set to 1, the client can send other Replit Audio Client Message messages.
This submessage contains audio data for a single audio frame wrapped in the container format associated with it. The length of an audio frame is codec-dependent, but is typically between 5 and 40 milliseconds. The frame is encoded with the parameters chosen by the Start Encoder message. This is a response to either the Replit Audio Client Frame Request Message or the Replit Audio Client Start Continuous Updates Message.
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
1 | submessage-type |
2 | U16 |
4 + data-length | payload-length |
4 | U32 |
timestamp | |
data-length | U8 array |
data |
The most significant bit of timestamp denotes whether the audio frame contains a start-of-stream header or is otherwise a keyframe, which enables clients to use this information for seeking purposes. Servers SHOULD send keyframes every few seconds / minutes to allow clients to re-synchronize with the stream. The 31 least significant bits of timestamp contain the number of milliseconds from the first audio frame that was captured in the session since the Start Encoder message was acknowledged by the server. data SHOULD be a self-contained audio frame, and all the audio frames should be concatenable into a valid audio stream. Furthermore, dropping of a non-keyframe SHOULD not cause the client to de-synchronize, and SHOULD be recoverable by inserting silence for the duration of the dropped frame.
This submessage is a response to the Replit Audio Client Start Continuous Updates Message, and acknowledges the receipt of it and signals the client that the server will send Replit Audio Server Frame Request Message messages continuously.
No. of bytes | Type | [Value] | Description |
---|---|---|---|
1 | U8 |
245 | message-type |
1 | U8 |
0 | submessage-type |
2 | U16 |
1 | payload-length |
1 | U8 |
enabled |
enabled will be set to 1 when the stream of Replit Audio Server Frame Request Message messages will start. enabled will be set to 0 if client had not sent a Start Encoder message beforehand, or if there was any other problem starting the stream. If there is an error at any future point, or if the client sent a Start Encoder with the enabled field set to 0, the server will send an additional Replit Audio Server Start Continuous Updates Message with enabled set to 0 after sending the last audio frame.