Improving Qiskit Aer benchmarks

Question

Improving Qiskit Aer benchmarks

Closed this issue 5 years ago · 3 comments

Hello @Roger-luo, I am a developer of Qiskit Aer and was recently shown your rather nice benchmark repo. I have some suggestions for how the qiskit benchmarks could be improved, since I feel they are under representing the simulator.

Suggestions:

When you transpile the circuit in qiskit you need to include the backend so that it compiles to the native basis gates of the simulator, otherwise it will unroll all single-qubit gates to u3 gates.
You shouldn't be using the statevector simulator for benchmarks, rather you should be using the qasm_simulator. The statevector simulator has a lot of overhead in serializing the statevector via JSON, where as the qasm simulator does not (you can still ask for a snapshot of the statevector in the qasm simulator). This overhead has been improved somewhat in our next release due to replacing JSON with Pybind11, but it still under-represents the simulator if you are interested in timing how fast it is applying gates.
The qasm simulator has numerous options for method and parallelization that you may want to explicitly configure. Eg:
- It supports multiple simulation methods (eg statevector, Clifford stabilizer, density matrix, mps) so if you want to specifically benchmark the statevector method you can specify that.
- By default it will truncate a circuit to remove all non-active qubits, so you would want to disable that optimization to benchmark gate times
- You can set the maximum OpenMP threads to use as 1 to disable parallelization.
How you report the time taken depends on what you are trying to benchmark. Aer includes a lot of overhead in its result data output. So if you are trying to profile the time of a single gate, you can get a more accurate measure of that excluding the result serialization if desired. The different ways of timing include:
- The full wall-clock time as measured by Python for calling backend.run
- The run-only time measured in Python (accessible from Result.time_taken). This excludes the time in initializing and validating the Python result object from the output python dict of the simulator
- The full C++ execution time excluding conversion to Python objects (accessible from Result.metadata['time_taken']). This excludes the C++ -> Py result conversion overhead.
- The C++ circuit-only execution time (accessible from Result.results[0].time_taken). This excludes any overhead for validation and configuration settings in the C++ simulator, and any Py -> C++ conversion.

Depending on what you are trying to show in benchmarks different timing is more important. I would argue for the Gate-level benchmarks you should show the C++ times, but for circuit level benchmarks that include results you would actually use I would show the Python time.

If you like I could put in a PR to this repo to make some of the suggested changes, but below I've included a code snippet for applying theses suggestions to a manual implementation of your X-gate benchmark:

import numpy as np
from qiskit import *
import time
import matplotlib.pyplot as plt

def native_execute(circuit, backend, backend_options):
    experiment = transpile(circuit, backend)  # Transpile to simulator basis gates
    qobj = assemble(experiment, shots=1)  # Set execution shots to 1
    start = time.time()
    result = backend.run(qobj, backend_options=backend_options).result()
    stop = time.time()
    time_py_full = stop - start  # Total execution time in python
    time_py_run = result.time_taken  # C++ measured total execution time excluding conversion of C++ results to Py results, and Py qobj to C++ qobj
    time_cpp_full = result.metadata['time_taken']  # C++ measured total execution time excluding conversion of C++ results to Py results, and Py qobj to C++ qobj
    time_cpp_expr = result.results[0].time_taken  # C++ measured execution time of a single circuit (ie. state init, apply gates, excludes other setup overhead for config options etc)
    return time_py_full, time_py_run, time_cpp_full, time_cpp_expr

def benchmark_x(qubit_range, samples, backend_options=None):
    
    backend = Aer.get_backend('qasm_simulator')

    ts_py_full = np.zeros(len(nqs))
    ts_py_run = np.zeros(len(nqs))
    ts_cpp_full = np.zeros(len(nqs))
    ts_cpp_exp = np.zeros(len(nqs))

    for i, nq in enumerate(qubit_range):
        qc = QuantumCircuit(nq)
        qc.x(0)

        t_py_full = 0
        t_py_run = 0
        t_cpp_full = 0
        t_cpp_exp = 0

        for _ in range(samples):
            t0, t1, t2, t3  = native_execute(qc, backend, backend_options)
            t_py_full += t0
            t_py_run += t1
            t_cpp_full += t2
            t_cpp_exp += t3

        # Average time in ns
        ts_py_full[i] = 1e9 * t_py_full / samples
        ts_py_run[i] = 1e9 * t_py_run / samples
        ts_cpp_full[i] = 1e9 * t_cpp_full / samples
        ts_cpp_exp[i] = 1e9 * t_cpp_exp / samples
    
    return ts_py_full, ts_py_run, ts_cpp_full, ts_cpp_exp


# Benchark: X gate on qubit-0
backend_options = {
    # Force Statevector method so stabilizer (clifford) simulator isn't used
    "method": "statevector",
    
    # Disable parallelization
    "max_parallel_threads": 1,
    
    # Stop simulator truncating to 1-qubit circuit simulations
    "truncate_enable": False,  
}  

nqs = list(range(5, 26))
ts_py_full1, ts_py_run1, ts_cpp_full1, ts_cpp_expr1 = benchmark_x(nqs, 1000, backend_options)

plt.semilogy(nqs, ts_py_full1, 'o-', label='Python (full)')
plt.semilogy(nqs, ts_py_run1, 's-', label='Python (run-only)')
plt.semilogy(nqs, ts_cpp_full1, '^-', label='C++ (full)')
plt.semilogy(nqs, ts_cpp_expr1, 'd-', label='C++ (experiment-only)')
plt.legend()
plt.grid()
plt.savefig('aer_x_qasm_sv.pdf')

Here is an example of running the above on my laptop:

aer_x_qasm_sv.pdf

Answer 1 · 2020-02-04T21:53:33.000Z

Hi @chriseclectic Thanks for your comment! It'd be nice if you could include the suggested changes in a PR, so I could help review it (since I'm still not sure which kind of option you would like to put into the qiskit benchmark), and I think we could also include multiple benchmarks for qiskit too. Please feel free to open a PR first, then I could also help you edit it.

In principle, as a demonstration of actual running time for simulation in practice, I think we should use user interface as much as possible. It's fair to benchmark through C++ if the C++ interface is an official API for users. But if the interface is in Python, we should in principle benchmark the python interface (through the standard python benchmark framework pytest-benchmark). For single gate benchmarks, yes, this is a test for the implementation of each instruction, it shows if certain acceleration tricks e.g SIMD is applied, or if the simulation algorithm is correct or not. But we currently do not have a C++ benchmark setup. This was also discussed in QuEST's benchmark review: #5

This is what we do for other frameworks, but qiskit is the only exception at the moment. I had to write a custom execute function since I was not familiar with the Qiskit simulation backend, and I found the user interface will spawn a task that gives constant time in pybenchmark measurements.

Regarding to stabilizer simulator, is this a fair comparison for other simulators? since the benchmark was mainly made for variational circuits (at least at the moment) and all other frameworks are benchmarking the full amplitude simulation. Maybe we should let the stabilizer simulators benchmark with stabilizer simulators?

I'd love to have more professional benchmark scripts from developers themselves for sure (which was what we did for other frameworks). Thanks!

Answer 2 · 2020-02-09T04:50:00.000Z

I'm getting a PR ready that will update to the correct simulator backend, and will leave the timing as you have it currently setup with the native_execute function. We don't expose our C++ API directly so going through Python is fine. The Python overhead will just appear as a constant run-time for low qubit numbers. I think that for how Pytest works the native_execute function you have is the best approach since it bypasses the async job model for Qiskit providers.

As part of the PR I also enabled the QCBM circuit benchmarks since these are supported. The "native" gate set of the simulator doesn't matter since the Qiskit compiler handles conversion to supported basis gates (eg Rx -> u3, and Rz->u1). Another point is we just released an update to Qiskit Aer (version 0.4) a few days ago, which means the current native_execute function will no longer work (the internal method was renamed from _format_qobj_str to _format_qobj). This new release also included our first version of a GPU enabled simulator. Currently the GPU enabled simulator is only available for Linux and can be installed separately with pip install qiskit-aer-gpu. I can also add this to the benchmark scripts for the QCBM circuit.

One question with the benchmarks, and in particular GPU benchmarks, are you running them on the other configurations as single-precision or double-precision? We support both options for both CPU and GPU (default is double-precision).

With regards to the Stabilizer simulator, I agree it's not fair compare it to a statevector simulator because it can only simulate Clifford circuits, however since our simulator will choose it automatically if the input circuit if Clifford that is why you need to explicitly specify running on the statevector method.

Answer 3 · 2020-02-10T16:51:14.000Z

Here is the updated circuit benchmark for qiskit run on a server with a P100 GPU. Note I didn't re-run any of the other simulator benchmarks, just used the existing data in the repo.