awestlake87/pyo3-asyncio

Is it possible to return a future without acquiring gil?

Opened this issue ยท 10 comments

๐Ÿ› Question

Hi, I am trying to read a file in rust through and return an awaitable in python. I am able to use the sync function and return the response back to python but when I implement it as a non blocking function, it results in slower execution due locking and unlocking of gil.

Is it possible to be done without doing that?

Here is my code snippet

#[pyfunction]
pub fn async_static_files(py: Python, file_name: String) -> PyResult<PyObject> {
    pyo3_asyncio::tokio::into_coroutine(py, async move {
        let contents = fs::read(file_name.clone()).await.unwrap();
        let foo = String::from_utf8_lossy(&contents);
        Ok(Python::with_gil(|py| {
            let x = PyString::new(py, &foo);
            let any: &PyAny = x.as_ref();
            let any = any.to_object(py);
            any.clone()
        }))
    })
}

It's not possible to create and resolve a future without first interjecting python, (i.e. acquiring the GIL), because python is single-threaded. This is a limitation of cpython at the moment.

Thank you @ShadowJonathan ! Got it. ๐Ÿ˜„

@ShadowJonathan , sorry I closed the issue by mistake. I was wondering if it will be possible(through a clever bypass maybe) to not acquire gil everytime and maybe acquire it only once?

What do you mean? Can you elaborate what behaviour you're thinking about?

e.g. here, we have to acquire gil everytime when we call the function below:

    pyo3_asyncio::tokio::into_coroutine(py, async move {
        let contents = fs::read(file_name.clone()).await.unwrap();
        let foo = String::from_utf8_lossy(&contents);
        Ok(Python::with_gil(|py| {
            let x = PyString::new(py, &foo);
            let any: &PyAny = x.as_ref();
            let any = any.to_object(py);
            any.clone()
        }))
    })

Will it be possible for us to acquire it only once globally and then share it across function calls?

No, the GIL should only be held when you want to execute Python code through PyO3. Holding it for the entire time means that other Python code can't run at all.

The GIL is designed to be locked and released really quickly. You might want to reexamine the original premise that your code is running slowly because of the GIL. It might be something else that's causing your application to slow down.

@awestlake87 , I think I may have miscommunicated by what I meant by "slowly". I meant it was performing slower than the synchronous counterpart, i.e. fs crate.

Here are the performance stats below:

โžœ  ~ oha -n 10000 http://localhost:5000/test_async_python
Summary:
  Success rate:	1.0000
  Total:	6.5560 secs
  Slowest:	0.0677 secs
  Fastest:	0.0088 secs
  Average:	0.0327 secs
  Requests/sec:	1525.3293

  Total data:	2.32 MiB
  Size/request:	243 B
  Size/sec:	361.97 KiB

Response time histogram:
  0.005 [64]   |
  0.011 [205]  |โ– 
  0.016 [830]  |โ– โ– โ– โ– โ– โ– โ– 
  0.021 [2518] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.027 [3511] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.032 [1756] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.037 [631]  |โ– โ– โ– โ– โ– 
  0.043 [331]  |โ– โ– โ– 
  0.048 [92]   |
  0.054 [57]   |
  0.059 [5]    |

Latency distribution:
  10% in 0.0246 secs
  25% in 0.0283 secs
  50% in 0.0323 secs
  75% in 0.0364 secs
  90% in 0.0416 secs
  95% in 0.0460 secs
  99% in 0.0547 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0115 secs, 0.0042 secs, 0.0150 secs
  DNS-lookup:	0.0001 secs, 0.0000 secs, 0.0005 secs

Status code distribution:
  [200] 10000 responses
โžœ  ~ oha -n 10000 http://localhost:5000/test
Summary:
  Success rate:	1.0000
  Total:	4.0756 secs
  Slowest:	0.0624 secs
  Fastest:	0.0031 secs
  Average:	0.0203 secs
  Requests/sec:	2453.6563

  Total data:	2.32 MiB
  Size/request:	243 B
  Size/sec:	582.26 KiB

Response time histogram:
  0.005 [129]  |
  0.011 [977]  |โ– โ– โ– โ– โ– โ– โ– 
  0.016 [4325] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.021 [2264] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.026 [1185] |โ– โ– โ– โ– โ– โ– โ– โ– 
  0.032 [655]  |โ– โ– โ– โ– 
  0.037 [244]  |โ– 
  0.042 [126]  |
  0.047 [0]    |
  0.053 [1]    |
  0.058 [94]   |

Latency distribution:
  10% in 0.0135 secs
  25% in 0.0157 secs
  50% in 0.0184 secs
  75% in 0.0235 secs
  90% in 0.0304 secs
  95% in 0.0341 secs
  99% in 0.0443 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0034 secs, 0.0030 secs, 0.0040 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 10000 responses
โžœ  ~ oha -n 10000 http://localhost:5000/test_sync
Summary:
  Success rate:	1.0000
  Total:	3.6844 secs
  Slowest:	0.0591 secs
  Fastest:	0.0019 secs
  Average:	0.0184 secs
  Requests/sec:	2714.1283

  Total data:	2.32 MiB
  Size/request:	243 B
  Size/sec:	644.08 KiB

Response time histogram:
  0.005 [57]   |
  0.010 [2079] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.015 [4429] |โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.019 [495]  |โ– โ– โ– 
  0.024 [365]  |โ– โ– 
  0.029 [1342] |โ– โ– โ– โ– โ– โ– โ– โ– โ– 
  0.034 [713]  |โ– โ– โ– โ– โ– 
  0.039 [254]  |โ– 
  0.044 [168]  |โ– 
  0.048 [57]   |
  0.053 [41]   |

Latency distribution:
  10% in 0.0108 secs
  25% in 0.0119 secs
  50% in 0.0139 secs
  75% in 0.0264 secs
  90% in 0.0321 secs
  95% in 0.0362 secs
  99% in 0.0454 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0022 secs, 0.0011 secs, 0.0029 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 10000 responses

and here are the three identical python snippets calling them

@app.get("/test")
async def test():
    import os
    path = os.path.abspath(os.path.join(os.path.dirname(os.path.realpath(__file__)), "index.html"))
    return await async_static_files(path)

@app.get("/test_sync")
async def test_sync():
    import os
    path = os.path.abspath(os.path.join(os.path.dirname(os.path.realpath(__file__)), "index.html"))
    return static_file(path)

@app.get("/test_async_python")
async def test_async_python():
    import os
    path = os.path.abspath(os.path.join(os.path.dirname(os.path.realpath(__file__)), "index.html"))
    return await async_static_files_python(path)

Surprisingly the test_sync route is the fastest here. Even though the async file reading using tokio is faster than the async file reading in python but it was a bit surprising to me that tokio implementation was slower than the sync one.

Oh and this is the implementation of async_static_files_python is

async def async_static_files_python(filename):
    async with aiofiles.open(filename, mode='r') as f:
        contents = await f.read()
    return contents

Python async/await is not necessarily faster than sync code. Essentially, performance in Python usually boils down to how much Python it has to run and how thin the FFI layer is. Here's a more detailed explanation.

There's no magic wand for performance. Context matters a lot. Python is backed by a lot of native code already, so replacing parts of it with Rust may not make it any faster.

One thing I will say about your async_static_files function is that it's making at least 3 copies of the data. One copy is the original file contents, the second copy is the utf-8 decoded String foo, and the third is the PyString x. The reason I can tell is that all of these variables are passed by reference when the next copy is created. If you can find a way to pass them by value, then the buffer could potentially be reused instead of cloned which might make things faster.

then the buffer could potentially be reused instead of cloned which might make things faster.

Thank you for the explanation @awestlake87 . Which buffer are you talking about ?

An internal memory buffer of the object, a low-level representation of that memory, which can then be re-used efficiently in-place.