Cannot run open_soma() from Europe server
Alex2975 opened this issue · 15 comments
Dear Authors,
Thank you so much for developing this tool. I tried to open_soma() from Europe servers, but I kept getting the following error. If I run open_soma() from US servers, I do not have the following error. Could you please share some insights? I need to have it run on Europe servers.
File "tiledb/libtiledb.pyx", line 3706, in tiledb.libtiledb.object_type
File "tiledb/libtiledb.pyx", line 348, in tiledb.libtiledb.check_error
File "tiledb/libtiledb.pyx", line 342, in tiledb.libtiledb._raise_ctx_err
File "tiledb/libtiledb.pyx", line 327, in tiledb.libtiledb._raise_tiledb_error
tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/__schema/' and delimiter '/'[Error Type: 99] [HTTP Response Code: -1] : curlCode: 28, Timeout was reached
I just tried to replicate this on an AWS instance running on eu-north-1
, but did not see this error. Here's what I did:
mamba create -yn cellxgene-census "python=3.11"
conda activate cellxgene-census
pip install ipython cellxgene-census
ipython
import cellxgene_census
census = cellxgene_census.open_soma()
census
The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
<Collection 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/' (open for 'r') (2 items)
'census_info': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_info' (unopened)
'census_data': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data' (unopened)>
@Alex2975, is this roughly similar to what you did? And is it intermittent? I was able to at least connect to this back when I was in Germany, but that was on an institutional connection.
It would also be great if you could report some library versions here? You can do this by running:
import cellxgene_census, session_info
session_info.show(html=False, dependencies=True)
And paste the output here like:
-----
IPython 8.25.0
cellxgene_census 1.14.0
session_info 1.0.0
-----
aiobotocore 2.13.0
aiohttp 3.9.5
aioitertools 0.11.0
aiosignal 1.3.1
anndata 0.10.7
asttokens NA
attr 23.2.0
attrs 23.2.0
botocore 1.34.106
certifi 2024.06.02
charset_normalizer 3.3.2
cython_runtime NA
dateutil 2.9.0.post0
decorator 5.1.1
executing 2.0.1
frozenlist 1.4.1
fsspec 2024.5.0
h5py 3.11.0
idna 3.7
jedi 0.19.1
jmespath 1.0.1
llvmlite 0.42.0
multidict 6.0.5
natsort 8.4.0
numba 0.59.1
numpy 1.26.4
packaging 24.0
pandas 2.2.2
parso 0.8.4
prompt_toolkit 3.0.45
pure_eval 0.2.2
pyarrow 16.1.0
pyarrow_hotfix NA
pygments 2.18.0
pytz 2024.1
requests 2.32.3
s3fs 2024.5.0
scipy 1.13.1
six 1.16.0
somacore 1.0.11
stack_data 0.6.3
tiledb 0.29.0
tiledbsoma 1.11.3
traitlets 5.14.3
typing_extensions NA
urllib3 2.2.1
wcwidth 0.2.13
wrapt 1.16.0
yarl 1.9.4
-----
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-6.8.0-1008-aws-x86_64-with-glibc2.39
-----
Session information updated at 2024-06-03 17:34
Thank you for getting back to me so quickly, @ivirshup .
I followed your instructions, and still got the same timeout error.
File "tiledb/libtiledb.pyx", line 3706, in tiledb.libtiledb.object_type
File "tiledb/libtiledb.pyx", line 348, in tiledb.libtiledb.check_error
File "tiledb/libtiledb.pyx", line 342, in tiledb.libtiledb._raise_ctx_err
File "tiledb/libtiledb.pyx", line 327, in tiledb.libtiledb._raise_tiledb_error
tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/__schema/' and delimiter '/'[Error Type: 99] [HTTP Response Code: -1] : curlCode: 28, Timeout was reached
Here is the session info:
session_info.show(html=False, dependencies=True)
cellxgene_census 1.14.0
session_info 1.0.0
aiobotocore 2.13.0
aiohttp 3.9.5
aioitertools 0.11.0
aiosignal 1.3.1
anndata 0.10.7
attr 23.2.0
attrs 23.2.0
botocore 1.34.106
certifi 2024.06.02
charset_normalizer 3.3.2
cython_runtime NA
dateutil 2.9.0.post0
frozenlist 1.4.1
fsspec 2024.5.0
h5py 3.11.0
idna 3.7
jmespath 1.0.1
llvmlite 0.42.0
multidict 6.0.5
natsort 8.4.0
numba 0.59.1
numpy 1.26.4
packaging 24.0
pandas 2.2.2
pyarrow 16.1.0
pyarrow_hotfix NA
pytz 2024.1
requests 2.32.3
s3fs 2024.5.0
scipy 1.13.1
six 1.16.0
somacore 1.0.11
tiledb 0.29.0
tiledbsoma 1.11.3
typing_extensions NA
urllib3 2.2.1
wrapt 1.16.0
yarl 1.9.4
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
I also tried: census = cellxgene_census.open_soma(mirror='s3-eu-north-1')
But I got this error:
.../python3.11/site-packages/cellxgene_census/_open.py", line 224, in open_soma
raise ValueError("Mirror not found.")
ValueError: Mirror not found.
Ah yeah, there aren't actually any mirrors up yet.
For the continued failures, is it possible there's a firewall on your end?
Yes, there is a firewall on the servers. Do you think that potentially cause the error? Could that be a time out or access error? If it is time out, how can I increase the waiting time?
That would definitely cause the error. It may just always block the connection, but it just looks like a the connection takes a while for you.
Could you try:
import s3fs
fs = s3fs.S3FileSystem()
fs.ls("s3://cellxgene-census-public-us-west-2")
If this also doesn't work, you would probably need to ask your IT team about this.
Could also confirm by trying this on a different network without the firewall?
Thank you, @ivirshup . When I ran the fs.ls, as you described, without firewall, I got the error:
PermissionError: Access Denied.
When I ran aws s with no sign request, I did get certain results back (with or without firewall, I got the same answer):
aws s3 ls --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/
PRE 2023-05-15/
PRE 2023-07-25/
PRE 2023-10-30/
PRE 2023-12-04/
PRE 2023-12-06/
PRE 2023-12-15/
PRE 2024-04-29/
PRE 2024-05-06/
PRE 2024-05-13/
PRE 2024-05-20/
PRE 2024-05-27/
2023-12-13 10:28:59 190 mirrors.json
2024-05-28 07:11:43 3642 release.json
Hm. That's odd. And you're definitely not passing any other arguments here, and consistently get a timeout? I may ping a couple more people to see if there's something they recognize here.
And cellxgene_census.open_soma()
still doesn't work without the firewall?
Could you also show the full traceback? It should have enough to see the line you called before getting this error.
@ivirshup and @pablo-gar , thank you so much for helping me. I still cannot access open_soma() from Europe cluster that I use. But I am currently access it from USA cluster. We are internally investigating the network proxy connections to see if anything is blocked from inside. Please close this issue if you would. I am all good for now calling API from USA side. Thank you.
@pablo-gar , may I ask you a question regarding geneformer. Would you please share if possible, what are the differences between the datasets for training geneformer and the datasets listed in cellxgene_census? Are the geneformer training datasets all included in cellxgene_census, or only partially overlapping? Thank you so much.
Also, @pablo-gar , would you please comment on when the new LTS data release will happen? The current LTS is from 12-15-2023, you mentioned a new version will be released and the "normalized" X_name="normalized" expression data will be released in the new release. Will that be released soon? Thank you very much for the help.
If you are referring to pre-trained Geneformer mode it pre-trained with Genecorpus. I understand there is some non-significant level of overlap between Census and Genecorpus, but I recommend you reach out directly to the Geneformer developers for more details.
The new LTS was published this week, you can access it via census_version = "stable"
or census_version = "2024-07-01"
I'm closing this ticket since the original issue doesn't seem to be a problem with our API
Thank you very much, @pablo-gar .