skrub-data/skrub

Failing fetch ken embeddings test

Opened this issue · 2 comments

This test has started failing, saying it does not find the footer of the downloaded parquet file.

I cannot reproduce locally; I am wondering if it could be that the file is only partly downloaded but it is weird that it only seems to happen in the python 3.11 environment.
edit: actually it happens randomly in the several macos CI runs, either during the first or second fetch_ken_embeddings call in the test, and while loading varying parts of the parquet table. So that would confirm that it is a failing download. We should make the download more robust and compare a checksum to get a better error message when it fails, and mock downloads in the tests. In the meanwhile I would suggest that we skip this test as it is bound to be flaky and it is currently causing CI to fail in all PRs that are unrelated to ken embeddings, WDYT @jovan-stojanovic ?

In any case we should avoid tests that rely on network resources and mock the downloads, especially tests that download large files. So we need a quick fix for this test but in the medium term we should mock downloads in the main test suite

the ken embedding tests download 864M of data and takes over a minute to run locally so we probably want to start with those.
A good part of the data seems to be the embedding type ID (file 39266300) which is 652M; @jovan-stojanovic do you know if this is one of the files that were recently updated on figshare?

I am also wondering about the way files are downloaded: the remote file is opened with pyarrow, then chunks of it are loaded into pandas dataframes and written in separate files with pandas.to_parquet.
However, in all cases the whole file is downloaded, and when using the skrub fetcher, the whole file is loaded into memory. So why do we use this approach rather than streaming the download into a single local file?