tursodatabase/libsql

SQLD replica is super slow (due to blocking)?

mattatjff opened this issue · 0 comments

I've got a test setup of LibSQL SQLD running with a primary and a replica in a docker image. The docker-compose.yml is as follows:

version: "3"
services:
  lsql-1:
    image: ghcr.io/tursodatabase/libsql-server:latest
    platform: linux/amd64
    container_name: lsql-1
    restart: no
    ports:
      - "8051:8051"
    expose:
      - 5001
    volumes:
      - ./data/1:/var/lib/sqld
    environment:
      - SQLD_NODE=primary
      - SQLD_HTTP_AUTH=basic:YWRtaW46YWRtaW4=
      - SQLD_GRPC_LISTEN_ADDR=0.0.0.0:5001
      - SQLD_HTTP_LISTEN_ADDR=0.0.0.0:8051


  lsql-2:
    image: ghcr.io/tursodatabase/libsql-server:latest
    platform: linux/amd64
    container_name: lsql-2
    restart: no
    ports:
      - "8052:8052"
    expose:
      - 5002
    volumes:
      - ./data/2:/var/lib/sqld
    environment:
      - SQLD_NODE=replica
      - SQLD_HTTP_AUTH=basic:YWRtaW46YWRtaW4=
      - SQLD_PRIMARY_URL=http://lsql-1:5001/
      - SQLD_GRPC_LISTEN_ADDR=0.0.0.0:5002
      - SQLD_HTTP_LISTEN_ADDR=0.0.0.0:8052
networks:
  lsql:
    driver: bridge

This is running on a Minisforum EM780 with 32 GB of RAM. Specs can be found here: https://store.minisforum.com/products/minisforum-em680

Using a test script that performs a number of basic read/write operations (visible here: https://github.com/hiraeth-php/turso/blob/master/test/index.php) I get wildly disparate execution times running it against the primary vs. the replica. The latter being an order of magnitude (over 1 second), slower. Here's the results from time runs:

Running on the Primary
________________________________________________________
Executed in   75.91 millis    fish           external
   usr time   26.64 millis    1.46 millis   25.17 millis
   sys time   15.02 millis    0.63 millis   14.38 millis

Running on the Replica
________________________________________________________
Executed in    1.16 secs      fish           external
   usr time   39.42 millis  833.00 micros   38.59 millis
   sys time   10.52 millis    0.00 micros   10.52 millis

You can see the replica is over 1 second, however its usr and sys times are not wildly different, which suggests the replica is doing a lot of blocking waiting on and/or syncing with the primary. None of this should really be network latency, as both instances are just two docker containers on the local system talking to one another.