Pathy.exists() check might impact performance due to partial startswith check
Opened this issue · 1 comments
yaelmi3 commented
env: python3.10, tested with GS
Consider the following case:
Pathy("gs://bucket/blob-not-there")
In this case we check whether the exact blob exists , but in case it doesn't exist, we continue to checking partial blob appearance, in all bucket files using startswith
. This introduces 2 possible issues:
- In case of bucket with high amount of blob (in our case we have bucket with hundred of thousands blobs), this check might be unreasonably long
- In case we have a prefix match,
exists
will returnTrue
, but it might not be the blob we are referring to
Possible solutions
- Avoid looking for blob prefix
- Add a flag to
exists
, something likeexact_match
justindujardin commented
@yaelmi3 thanks for providing this review/analysis! 🙇
Could you construct a performance test that measures how slow it is and compare it with your suggested change? I can run it on all the cloud providers to get a sense of the impact if you write a script that works with the local-mode implementation.