Cannot run open_soma() from Europe server

Question

Cannot run open_soma() from Europe server

Alex2975 opened this issue 7 months ago · 15 comments

Dear Authors,

Thank you so much for developing this tool. I tried to open_soma() from Europe servers, but I kept getting the following error. If I run open_soma() from US servers, I do not have the following error. Could you please share some insights? I need to have it run on Europe servers.

File "tiledb/libtiledb.pyx", line 3706, in tiledb.libtiledb.object_type
File "tiledb/libtiledb.pyx", line 348, in tiledb.libtiledb.check_error
File "tiledb/libtiledb.pyx", line 342, in tiledb.libtiledb._raise_ctx_err
File "tiledb/libtiledb.pyx", line 327, in tiledb.libtiledb._raise_tiledb_error
tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/__schema/' and delimiter '/'[Error Type: 99] [HTTP Response Code: -1] : curlCode: 28, Timeout was reached

Answer 1 · 2024-06-03T17:36:17.000Z

I just tried to replicate this on an AWS instance running on eu-north-1, but did not see this error. Here's what I did:

mamba create -yn cellxgene-census "python=3.11"
conda activate cellxgene-census
pip install ipython cellxgene-census
ipython

import cellxgene_census
census = cellxgene_census.open_soma()
census

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.

<Collection 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/' (open for 'r') (2 items)
    'census_info': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_info' (unopened)
    'census_data': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data' (unopened)>

@Alex2975, is this roughly similar to what you did? And is it intermittent? I was able to at least connect to this back when I was in Germany, but that was on an institutional connection.

It would also be great if you could report some library versions here? You can do this by running:

import cellxgene_census, session_info
session_info.show(html=False, dependencies=True)

And paste the output here like:

-----
IPython             8.25.0
cellxgene_census    1.14.0
session_info        1.0.0
-----
aiobotocore         2.13.0
aiohttp             3.9.5
aioitertools        0.11.0
aiosignal           1.3.1
anndata             0.10.7
asttokens           NA
attr                23.2.0
attrs               23.2.0
botocore            1.34.106
certifi             2024.06.02
charset_normalizer  3.3.2
cython_runtime      NA
dateutil            2.9.0.post0
decorator           5.1.1
executing           2.0.1
frozenlist          1.4.1
fsspec              2024.5.0
h5py                3.11.0
idna                3.7
jedi                0.19.1
jmespath            1.0.1
llvmlite            0.42.0
multidict           6.0.5
natsort             8.4.0
numba               0.59.1
numpy               1.26.4
packaging           24.0
pandas              2.2.2
parso               0.8.4
prompt_toolkit      3.0.45
pure_eval           0.2.2
pyarrow             16.1.0
pyarrow_hotfix      NA
pygments            2.18.0
pytz                2024.1
requests            2.32.3
s3fs                2024.5.0
scipy               1.13.1
six                 1.16.0
somacore            1.0.11
stack_data          0.6.3
tiledb              0.29.0
tiledbsoma          1.11.3
traitlets           5.14.3
typing_extensions   NA
urllib3             2.2.1
wcwidth             0.2.13
wrapt               1.16.0
yarl                1.9.4
-----
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-6.8.0-1008-aws-x86_64-with-glibc2.39
-----
Session information updated at 2024-06-03 17:34

Answer 2 · 2024-06-03T19:05:33.000Z

Thank you for getting back to me so quickly, @ivirshup .
I followed your instructions, and still got the same timeout error.

File "tiledb/libtiledb.pyx", line 3706, in tiledb.libtiledb.object_type
File "tiledb/libtiledb.pyx", line 348, in tiledb.libtiledb.check_error
File "tiledb/libtiledb.pyx", line 342, in tiledb.libtiledb._raise_ctx_err
File "tiledb/libtiledb.pyx", line 327, in tiledb.libtiledb._raise_tiledb_error
tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/__schema/' and delimiter '/'[Error Type: 99] [HTTP Response Code: -1] : curlCode: 28, Timeout was reached

Here is the session info:

session_info.show(html=False, dependencies=True)

cellxgene_census 1.14.0
session_info 1.0.0

aiobotocore 2.13.0
aiohttp 3.9.5
aioitertools 0.11.0
aiosignal 1.3.1
anndata 0.10.7
attr 23.2.0
attrs 23.2.0
botocore 1.34.106
certifi 2024.06.02
charset_normalizer 3.3.2
cython_runtime NA
dateutil 2.9.0.post0
frozenlist 1.4.1
fsspec 2024.5.0
h5py 3.11.0
idna 3.7
jmespath 1.0.1
llvmlite 0.42.0
multidict 6.0.5
natsort 8.4.0
numba 0.59.1
numpy 1.26.4
packaging 24.0
pandas 2.2.2
pyarrow 16.1.0
pyarrow_hotfix NA
pytz 2024.1
requests 2.32.3
s3fs 2024.5.0
scipy 1.13.1
six 1.16.0
somacore 1.0.11
tiledb 0.29.0
tiledbsoma 1.11.3
typing_extensions NA
urllib3 2.2.1
wrapt 1.16.0
yarl 1.9.4

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17

Answer 3 · 2024-06-03T19:09:22.000Z

I also tried: census = cellxgene_census.open_soma(mirror='s3-eu-north-1')
But I got this error:
.../python3.11/site-packages/cellxgene_census/_open.py", line 224, in open_soma
raise ValueError("Mirror not found.")
ValueError: Mirror not found.

Answer 4 · 2024-06-03T21:07:53.000Z

Ah yeah, there aren't actually any mirrors up yet.

For the continued failures, is it possible there's a firewall on your end?

Answer 5 · 2024-06-03T21:12:28.000Z

Yes, there is a firewall on the servers. Do you think that potentially cause the error? Could that be a time out or access error? If it is time out, how can I increase the waiting time?

Answer 6 · 2024-06-03T21:31:08.000Z

That would definitely cause the error. It may just always block the connection, but it just looks like a the connection takes a while for you.

Could you try:

import s3fs

fs = s3fs.S3FileSystem()
fs.ls("s3://cellxgene-census-public-us-west-2")

If this also doesn't work, you would probably need to ask your IT team about this.

Could also confirm by trying this on a different network without the firewall?

Answer 7 · 2024-06-04T01:39:22.000Z

Thank you, @ivirshup . When I ran the fs.ls, as you described, without firewall, I got the error:
PermissionError: Access Denied.

Answer 8 · 2024-06-04T01:41:10.000Z

When I ran aws s with no sign request, I did get certain results back (with or without firewall, I got the same answer):

aws s3 ls --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/

                       PRE 2023-05-15/
                       PRE 2023-07-25/
                       PRE 2023-10-30/
                       PRE 2023-12-04/
                       PRE 2023-12-06/
                       PRE 2023-12-15/
                       PRE 2024-04-29/
                       PRE 2024-05-06/
                       PRE 2024-05-13/
                       PRE 2024-05-20/
                       PRE 2024-05-27/

2023-12-13 10:28:59 190 mirrors.json
2024-05-28 07:11:43 3642 release.json

Answer 9 · 2024-06-04T18:07:15.000Z

Hm. That's odd. And you're definitely not passing any other arguments here, and consistently get a timeout? I may ping a couple more people to see if there's something they recognize here.

And cellxgene_census.open_soma() still doesn't work without the firewall?

Could you also show the full traceback? It should have enough to see the line you called before getting this error.

Answer 10 · 2024-06-16T15:29:10.000Z

@Alex2975 seems like you are able to access Census now, is that correct?

see #1195

Answer 11 · 2024-06-17T16:14:31.000Z

@ivirshup and @pablo-gar , thank you so much for helping me. I still cannot access open_soma() from Europe cluster that I use. But I am currently access it from USA cluster. We are internally investigating the network proxy connections to see if anything is blocked from inside. Please close this issue if you would. I am all good for now calling API from USA side. Thank you.

Answer 12 · 2024-07-10T22:12:03.000Z

@pablo-gar , may I ask you a question regarding geneformer. Would you please share if possible, what are the differences between the datasets for training geneformer and the datasets listed in cellxgene_census? Are the geneformer training datasets all included in cellxgene_census, or only partially overlapping? Thank you so much.

Also, @pablo-gar , would you please comment on when the new LTS data release will happen? The current LTS is from 12-15-2023, you mentioned a new version will be released and the "normalized" X_name="normalized" expression data will be released in the new release. Will that be released soon? Thank you very much for the help.

Answer 13 · 2024-07-12T01:43:35.000Z

@Alex2975

If you are referring to pre-trained Geneformer mode it pre-trained with Genecorpus. I understand there is some non-significant level of overlap between Census and Genecorpus, but I recommend you reach out directly to the Geneformer developers for more details.

The new LTS was published this week, you can access it via census_version = "stable" or census_version = "2024-07-01"

Answer 14 · 2024-07-12T01:44:10.000Z

I'm closing this ticket since the original issue doesn't seem to be a problem with our API

Answer 15 · 2024-07-12T02:10:47.000Z

Thank you very much, @pablo-gar .

cellxgene_census 1.14.0 session_info 1.0.0

cellxgene_census 1.14.0
session_info 1.0.0