Poor REST performance vs. eXist

Question

Poor REST performance vs. eXist

IanDavey opened this issue 4 years ago · 9 comments

I have been benchmarking different types of requests to the RESTful API for both Fusion and eXist, using the latest USLM versions of titles 1 (a short one) and 42 (the longest one) of the US Code. Each of these tests had 100 repetitions, and both eXist and Fusion were identically configured on the same VM.

First, I tested a simple PUT of a document to /exist/rest/db/test:

usc01.xml eXist:
    Minimum: 0.43300747871398926 s
    Maximum: 1.388643503189087 s
    Mean:    0.6491883111000061 s
    Median:  0.6234605312347412 s
    Stddev:  0.15249426666857233 s
usc01.xml Fusion:
    Minimum: 0.5359818935394287 s
    Maximum: 2.7659928798675537 s
    Mean:    1.2113053154945375 s
    Median:  1.1449862718582153 s
    Stddev:  0.47974903573925054 s
usc42.xml eXist:
    Minimum: 35.580822229385376 s
    Maximum: 54.87896108627319 s
    Mean:    41.51840656995773 s
    Median:  41.32154309749603 s
    Stddev:  4.195144231762311 s
usc42.xml Fusion:
    Minimum: 188.74798727035522 s
    Maximum: 1077.4066643714905 s
    Mean:    404.10182027816774 s
    Median:  320.5594769716263 s
    Stddev:  191.0485587590224 s

Next, I tested GET:

usc01.xml eXist:
    Minimum: 0.03500247001647949 s
    Maximum: 0.2610175609588623 s
    Mean:    0.05440183639526367 s
    Median:  0.0410001277923584 s
    Stddev:  0.04378882484633538 s
usc01.xml Fusion:
    Minimum: 0.5689961910247803 s
    Maximum: 0.7550196647644043 s
    Mean:    0.6137096118927002 s
    Median:  0.6049911975860596 s
    Stddev:  0.030363894050856537 s
usc42.xml eXist:
    Minimum: 13.213871717453003 s
    Maximum: 15.583616971969604 s
    Mean:    13.67682737827301 s
    Median:  13.544020891189575 s
    Stddev:  0.40090442662817194 s
usc42.xml Fusion:
    Minimum: 264.0210030078888 s
    Maximum: 293.1509966850281 s
    Mean:    272.6948435425758 s
    Median:  269.29898858070374 s
    Stddev:  7.9907570945070505 s

Then, I POSTed some XQuery to get a specific section of a document:

1 U.S.C. 1 eXist:
    Minimum: 0.004994392395019531 s
    Maximum: 0.011027336120605469 s
    Mean:    0.006702215671539307 s
    Median:  0.0060079097747802734 s
    Stddev:  0.0012504552254434967 s
1 U.S.C. 1 Fusion:
    Minimum: 0.00799870491027832 s
    Maximum: 0.03599071502685547 s
    Mean:    0.010678362846374512 s
    Median:  0.009988903999328613 s
    Stddev:  0.0033822034856866566 s
42 U.S.C. 2000e eXist:
    Minimum: 0.012010574340820312 s
    Maximum: 0.04998445510864258 s
    Mean:    0.016210291385650635 s
    Median:  0.014001250267028809 s
    Stddev:  0.005966727038671187 s
42 U.S.C. 2000e Fusion:
    Minimum: 0.02099442481994629 s
    Maximum: 0.06598377227783203 s
    Mean:    0.026210627555847167 s
    Median:  0.02399766445159912 s
    Stddev:  0.006877559298661334 s

(the XQuery I used was an adapted version of a giant block we're using for a client that does a bunch of pre- and post-processing, but the most relevant piece is //*[@identifier=$identifier], and this is an indexed attribute)

Finally, I ran a DELETE (PUTting the document back between repetitions, but only timing the DELETE):

1 U.S.C. eXist:
    Minimum: 0.4656083583831787 s
    Maximum: 1.5618107318878174 s
    Mean:    0.6475286889076233 s
    Median:  0.5726385116577148 s
    Stddev:  0.2086682498874921 s
1 U.S.C. Fusion:
    Minimum: 2.4692959785461426 s
    Maximum: 4.827873945236206 s
    Mean:    3.485269944667816 s
    Median:  3.424243688583374 s
    Stddev:  0.5818232474143926 s
42 U.S.C. eXist:
    Minimum: 18.95114755630493 s
    Maximum: 26.387588024139404 s
    Mean:    21.3156515455246 s
    Median:  21.073147296905518 s
    Stddev:  1.509445739774873 s
42 U.S.C. Fusion:
    Minimum: 1957.4530036449432 s
    Maximum: 2816.1365151405334 s
    Mean:    2314.854492249489 s
    Median:  2259.181742668152 s
    Stddev:  224.80505757124126 s

It appears that Fusion's running time across the board grows faster than eXist with respect to document size. As several of Xcential's use cases involve handling large-ish documents, as a result we currently can't recommend Fusion over eXist to clients (which is a unfortunate, because in other POST XQuery tests, such as generating a small top-level outline for all USC titles, Fusion is faster).

Answer 1 · 2021-01-25T14:57:41.000Z

@IanDavey Could you provide a link to the data and, if possible, the scripts used for these tests?

Answer 2 · 2021-01-25T14:59:50.000Z

The data can be found here:

https://uscode.house.gov/download/download.shtml

(you want the XML link)

Answer 3 · 2021-01-25T15:04:38.000Z

The script was continuously changed for each test, but currently I have (with certain items redacted):

#!/usr/bin/env python

import base64
import requests
import statistics
import time

REPETITIONS = 100
EXIST_AUTH = 'Basic ' + base64.b64encode(b'*****:*******').decode()


XQUERY = '''<?xml version="1.0" encoding="utf-8"?>
<query xmlns="http://exist.sourceforge.net/NS/exist" cache="no">
    <text><![CDATA[
            declare default element namespace "http://xml.house.gov/schemas/uslm/1.0";

            declare boundary-space preserve;

            (# exist:batch-transaction #) {
                (: redacted :)
            }
    ]]></text>
</query>
'''

def timed_send(fn, port):
    start = time.time()
    response = requests.delete(f'http://localhost:{port}/exist/rest/db/test/{fn}', headers={'Authorization': EXIST_AUTH})
    assert response.status_code < 300, response.reason + '\n' + response.text
    result = response.text
    end = time.time()
    with open(fn, 'rb') as f:
        response = requests.put(f'http://localhost:{port}/exist/rest/db/test/{fn}', f, headers={'Content-Type': 'application/xml', 'Authorization': EXIST_AUTH})
        assert response.status_code < 300, response.reason + '\n' + response.text
    return end - start

def output_stats(label, dataset):
    print(f'{label}:')
    print(f'\tMinimum: {min(dataset)} s')
    print(f'\tMaximum: {max(dataset)} s')
    print(f'\tMean:    {statistics.mean(dataset)} s')
    print(f'\tMedian:  {statistics.median(dataset)} s')
    if REPETITIONS > 1: print(f'\tStddev:  {statistics.stdev(dataset)} s')

usc01_exist, usc01_fusion, usc42_exist, usc42_fusion = [], [], [], []
for _ in range(REPETITIONS):
    usc01_exist.append(timed_send('usc01.xml', 8080))
    usc01_fusion.append(timed_send('usc01.xml', 4059))
    usc42_exist.append(timed_send('usc42.xml', 8080))
    usc42_fusion.append(timed_send('usc42.xml', 4059))

output_stats('1 U.S.C. eXist', usc01_exist)
output_stats('1 U.S.C. Fusion', usc01_fusion)
output_stats('42 U.S.C. eXist', usc42_exist)
output_stats('42 U.S.C. Fusion', usc42_fusion)

Answer 4 · 2021-01-25T15:19:04.000Z

Hi @IanDavey can you tell me which version of FusionDB you tested this against? If you are not using a nightly build, then as recently discussed with Grant, the nightly builds should have much better performance than Alpha3.

Answer 5 · 2021-01-25T15:24:08.000Z

It was Alpha 3. That makes sense. I'll retest with the nightly.

Answer 6 · 2021-01-25T15:26:06.000Z

Thanks @IanDavey much appreciated.

Answer 7 · 2021-01-25T17:17:32.000Z

@adamretter Just to confirm — is the latest nightly from 11/25? That's what's showing up on the link you sent us.

Answer 8 · 2021-01-25T17:31:24.000Z

@IanDavey Yes, that's right. We have been working on a change which has taken more engineering than expected, but we hope to have that pushed in the next few days - so there will then be a new nightly.

Answer 9 · 2021-11-02T17:42:26.000Z

@adamretter Just checking in — has there been anything new on this?