[BUG] Memory growth when using PyGWalker with Streamlit
Opened this issue · 3 comments
Describe the bug
I observe RAM growth when using PyGWalker with Streamlit framework. The RAM usage constantly grow on page reload (on every app run). When using Streamlit without PyGWalker, RAM usage remain constant (flat, does not grow). It seems like memory is never released, this was observed indirectly (we tracked growth locally, see reproduction below, but we also observe same issue in Azure web app and RAM usage never decline).
To Reproduce
We tracked down the issue with isolated Streamlit app with PyGwalker and memory profile (run with python -m streamlit run app.py
):
# app.py
import numpy as np
np.random.seed(seed=1)
import pandas as pd
from memory_profiler import profile
from pygwalker.api.streamlit import StreamlitRenderer
@profile
def app():
# Create random dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
render = StreamlitRenderer(df)
render.explorer()
app()
Observed output for a few consequent reloads from browser (press R
, rerun):
Line # Mem usage Increment Occurrences Line Contents
13 302.6 MiB 23.3 MiB 1 render.explorer()
13 315.4 MiB 23.3 MiB 1 render.explorer()
13 325.8 MiB 23.3 MiB 1 render.explorer()
Expected behavior
RAM usage to remain at constant level between app reruns.
Screenshots
On screenshot you may observe a user activity peaks (cause CPU usage) and growing RAM usage (memory set).
On this screenshot a debug app memory profiling is displayed.
Versions
streamlit 1.38.0
pygwalker 0.4.9.3
memory_profiler (latest)
python 3.9.10
browser: chrome 128.0.6613.138 (Official Build) (64-bit)
Tested locally on Windows 11
Thanks for support!
Update
It seems like I may have misinterpreted observations. I continued to track production app and did some more testing and results point away from PyGWalker as I originally thought (potentially to Azure web app or our production code other issues). I will do local tests with memory profiler to see how it behaves overtime to rule out this observation as well.
I'm sorry for disturbance, I will continue debug with new evidences.
Production app observations
Health endpoint has been added to our production version and now we observe strange memory behaviour even without opening PyGWalker explorer (PyGWalker was still imported as package). Health opens empty Streamlit page every 5 mins and over last 24h a RAM usage was gradually growing (on image you can observe used memory getting closer to 500Mb without spikes with constant increase rate related to health calls).
Sample app deployment
I also tested sample app deployment on Azure to exclude Azure resource virtualization issues, but results did not confirm original hypothesis.
Without PyGWalker
Sample app without PyGWalker on Azure
# app.py
import numpy as np
np.random.seed(seed=1)
import pandas as pd
import streamlit as st
def app():
# Create random dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
st.table(df)
app()
With PyGWalker
Sample app with PyGWalker was also deployed to Azure (it is running for few hours now). How ever it behaves as expected and release memory when objects are destroyed. Which makes me think, that the problem with our production version lays somewhere else.
Sample app with PyGWalker on Azure
import numpy as np
np.random.seed(seed=1)
import pandas as pd
from pygwalker.api.streamlit import StreamlitRenderer
def app():
df = pd.DataFrame(
np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
)
render = StreamlitRenderer(df)
render.explorer()
app()
Hi @ChrnyaevEK , Thanks for your feedback.
Using pygwalker latest version, and try to cache StreamlitRenderer, it may avoid memory growth.
from pygwalker.api.streamlit import StreamlitRenderer
import pandas as pd
import streamlit as st
@st.cache_resource
def get_pyg_renderer() -> "StreamlitRenderer":
df = pd.read_csv("xxx")
return StreamlitRenderer(df)
renderer = get_pyg_renderer()
renderer.explorer()
There are several reasons why pygwalker memory grows:
StreamlitRenderer(df)
will parse the dataframe and infer the data type.render.explorer()
It will render the ui using html iframe(0.4.9.8 version has used the streamlit custom component to render pygwalker ui. The streamlit component has optimized this part of the memory overhead)- For data calculation communication, the calculated data needs to complete http communication through the customized tornado endponit.(This will also be optimized in future versions)
In the next period of time, pygwalker will optimize the user experience of the streamlit component. Thank you again for your feedback.
Hi @longxiaofei ! Thanks for your attention.
Caching
I'm afraid that caching is not an option in this case, our data change with every request and thus cached function should look more like this:
@st.cache_resource
def get_pyg_renderer(key: str) -> "StreamlitRenderer":
df = pd.read_csv(key)
...
which basically is equivalent for no cache at all. ttl
and max_entries
will not help either.
I did however test this approach and I'm still facing the same strange behavior.
import numpy as np
import pandas as pd
import streamlit as st
from pygwalker.api.streamlit import StreamlitRenderer
@st.cache_resource(max_entries=3, ttl=20)
def get_render(key: int):
df = pd.DataFrame(
np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
)
return StreamlitRenderer(df)
def app():
render = get_render(np.random.randint(1, 100))
render.explorer()
app()
Running this app locally (windows, as described in first massage with pygwalker 0.4.9.3 as this is our production version) results in constantly growing memory (it seems to occasionally release insignificant amount of memory, but it does not return to initial values).
RAM used by python process with streamlit server with cached pygwalker render
Other local tests
I did also test few other code snippets locally to confirm that memory will eventually be released, but it seems like it's not.
Bare Streamlit
Code
import numpy as np
import pandas as pd
import streamlit as st
def app():
df = pd.DataFrame(
np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
)
st.dataframe(df)
app()
Debug sequence
streamlit server start (python -m streamlit run ...) - 12:25 (memory increase due to initial object initialization)
restart (R) - 12:27 (memory increased)
restart (R) - 12:28 (memory increased)
restart (R) - 12:29 (memory increased)
restart (R) - 12:30 (memory increased)
restart (R) - 12:31 (memory did not react)
page close - 12:32 (memory decreased, but not to initial level)
stop - 12:58 (before stop a few slight memory decreases were observed without any external trigger)
Total test time: ~30min
Graph
See attached PDF
debug.pdf
Streamlit with PyGWalker
Code
import numpy as np
import pandas as pd
from pygwalker.api.streamlit import StreamlitRenderer
def app():
df = pd.DataFrame(
np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
)
render = StreamlitRenderer(df)
render.explorer()
app()
Debug sequence
start - 13:09
restart - 13:11 (significant memory increase)
restart - 13:12 (memory increase)
restart - 13:13 (memory increase)
restart - 13:14 (memory increase)
restart - 13:15 (memory increase)
page close - 13:16 (memory decrease, not to initial values)
stop - 13:40 (no memory decrease observed)
Graph
See attached PDF
debug.pdf, same as above
Conclusions up to the moment
Apps with and without PyGWalker both hold memory. PyGWalker allocate memory on every rerun, bare Streamlit seems to eventually saturate (may not allocate noticeable amount of memory).
There is no issue openning multiple Streamlit apps without PyGWalker, but as soon as PyGWalker is used we run out of memory (even with cache). This seems to be confirmed locally and on Azure.
I still suspect some issue with PyGWalker on Streamlit (may be PyGWalker just misuse Streamlit caching mechanisms), can you please check steady memory growth when running minimal PyGWalker app locally?
Thanks!