apache/arrow

[R] read_parquet from s3 is slow and often flakey

1beb opened this issue · 14 comments

1beb commented

Describe the bug, including details regarding any error messages, version, and platform.

R version 4.3.0
Ubuntu 20.04.6

Sometimes it fails, other times it hangs, but it always takes longer than similar methods using aws.s3 and it's unclear why. Here are some examples of the errors I see. Otherwise it's just a significant delay like it's running into a time out somewhere and retrying:

Error: IOError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached
Error in url(file, open = "rb") : URL scheme unsupported by this method

Reprex:

The two scripts below, in my example point to a small 5mb file. I can run this on a machine with arrow 8.0.0 or a machine with arrow 12.0.0 and the result is the same (significant delay with s3:// addresses)

library(dotenv)
library(tictoc)

load_dot_env()

tic(); df <- read_parquet("s3://somefile.parquet"); toc()
# 70.211 sec elapsed

vs.

library(aws.s3)
library(dotenv)

load_dot_env()

load_aws <- function(f) {
    read_parquet(
        get_object(
            object = f,
            bucket = "s3://bucket-name",
            region = "eu-west-1"
        )
    )
}

tic(); df <- load_aws("s3://somefile.parquet"); toc()
# 1.394 sec elapsed

On the same machines I can run a similar test in Python and the delay is not there leading me to believe it is something explicitly with the implementation in R. Happy to help debug! Tell me what you need.

Component(s)

R

Thanks for the report. One difference between your arrow and aws.s3 code is that the arrow code first has to find the region(s) the object exists in so it will always be a bit slower. But 70sec is extreme. Can you add ?region=eu-west-1 to your S3 URI which I think will skip the bit where we try to find the region? Like:

read_parquet("s3://arrow-s3-testing-eu-west-1/starwars.parquet?region=eu-west-1")

That said, the below error seems like something else:

Error in url(file, open = "rb") : URL scheme unsupported by this method

I first tested to make sure read_parquet was working for me, which it was, but I noticed you're in eu-west-1 so I made a bucket in eu-west-1 and tried again and got the above error. I knew I didn't set a bucket ACL so I thought it might be related to that so I stepped through it in the debugger and found the issue: read_parquet is eating an AWS authentication error in this code:

arrow/r/R/io.R

Lines 253 to 260 in 8b6688a

file <- tryCatch(
{
fs_and_path <- FileSystem$from_uri(file)
fs_and_path$fs$OpenInputFile(fs_and_path$path)
},
error = function(e) {
MakeRConnectionInputStream(url(file, open = "rb"))
}

> fs_and_path$fs$OpenInputFile(fs_and_path$path)
Error: IOError: When reading information for key 'starwars.parquet' in bucket 'arrow-s3-testing-eu-west-1': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

Once I set an ACL the error went away and reads were against fast (1-2sec). So I guess my question is why you might be intermittently getting an error that read_parquet is eating and what is that error? Unfortunately #35398 isn't merged just yet but once it is it might help. No matter what we find here, it would be good to tweak the code so errors aren't eaten.

1beb commented

Can confirm that this was wildly faster when I set the paramter on the s3 path. Very good catch! A general comment here is that I am setting an .env var for AWS_DEFAULT_REGION along with access id and secret keys so perhaps trying the default region first would be useful. If you are familiar with where this happens I'm happy to submit a PR.

What happens when you remove AWS_DEFAULT_REGION and just have the access/secret keys in your .env and try with the S3 URI as you did initially (ie w/o ?region)? I'm curious if AWS_DEFAULT_REGION is reacting antagonistically with our region guessing code.

1beb commented

It's the same (delay). I have two envs, only one of which has the AWS_DEFAULT_REGION set and it was slow on both without the param ?region=

Okay, that's good to know. The region lookup code is triggered in the underlying Arrow C++ implementation around https://github.com/amoeba/arrow/blob/8b6688acca6d14ea69c533c150b2cc05a1403f91/cpp/src/arrow/filesystem/s3fs.cc#L343 and below but I'm not sure what change we actually want here yet. It's the weekend so others will probably see this next week. Despite what we've learned here, timeouts and 70sec delays are still very strange.

1beb commented

Oh, apologies, I don't have too much experience with that language. I thought the check might happen in R. Strange that it translates so slowly either way depending on SDK.

Are you accessing a public file? My guess is that you're running into the default AWS credentials checker. By default, even if the file is public, AWS is going to try and figure out what credentials you are using before making any request.

Unfortunately, if you don't have a credentials file stored somewhere that it expects (e.g. ~/.aws/credentials) then it will start trying some unpleasant alternatives. The main problematic alternative is to make a request to 169.254.169.254 which is a special magic IP address that all EC2 containers are configured to serve that contains EC2 metadata (e.g. in case your EC2 instance has credentials assigned to it). Depending on your system's networking configuration it may take some time for this request to fail.

So, for example, if I delete my credentials file, turn on AWS trace logging, and run your test then the first option takes 13 seconds and my logs are filled with messages like this:

[INFO] 2023-06-13 17:28:26.917 ProcessCredentialsProvider [140191537006528] Failed to find credential process's profile: default
[TRACE] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Checking HOME for the home directory.
[DEBUG] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Environment value for variable HOME is /home/pace
[DEBUG] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Home directory is missing the final / appending one to normalize
[DEBUG] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Final Home Directory is /home/pace/
[DEBUG] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Loading token from: /home/pace/.aws/sso/cache/da39a3ee5e6b4b0d3255bfef95601890afd80709.json
[DEBUG] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Preparing to load token from: /home/pace/.aws/sso/cache/da39a3ee5e6b4b0d3255bfef95601890afd80709.json
[INFO] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Unable to open token file on path: /home/pace/.aws/sso/cache/da39a3ee5e6b4b0d3255bfef95601890afd80709.json
[TRACE] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Access token for SSO not available
[DEBUG] 2023-06-13 17:28:26.917 InstanceProfileCredentialsProvider [140191537006528] Checking if latest credential pull has expired.
[INFO] 2023-06-13 17:28:26.917 InstanceProfileCredentialsProvider [140191537006528] Credentials have expired attempting to re-pull from EC2 Metadata Service.
[TRACE] 2023-06-13 17:28:26.917 EC2MetadataClient [140191537006528] Getting default credentials for ec2 instance from http://169.254.169.254
[TRACE] 2023-06-13 17:28:26.917 EC2MetadataClient [140191537006528] Retrieving credentials from http://169.254.169.254/latest/meta-data/iam/security-credentials
[TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] Making request to http://169.254.169.254/latest/meta-data/iam/security-credentials
[TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] Including headers:
[TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] host: 169.254.169.254
[TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] user-agent: aws-sdk-cpp/1.10.13 Linux/5.19.0-43-generic x86_64 GCC/10.4.0
[DEBUG] 2023-06-13 17:28:26.917 CurlHandleContainer [140191537006528] Attempting to acquire curl connection.
[INFO] 2023-06-13 17:28:26.917 CurlHandleContainer [140191537006528] Connection has been released. Continuing.
[DEBUG] 2023-06-13 17:28:26.917 CurlHandleContainer [140191537006528] Returning connection handle 0x563be150dc20
[DEBUG] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] Obtained connection handle 0x563be150dc20
[DEBUG] 2023-06-13 17:28:26.918 CURL [140191537006528] (Text)   Trying 169.254.169.254:80...
[DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) After 1000ms connect time, move on!
[DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) connect to 169.254.169.254 port 80 failed: Connection timed out
[DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) Connection timeout after 1001 ms
[DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) Closing connection 0\
[ERROR] 2023-06-13 17:28:27.919 CurlHttpClient [140191537006528] Curl returned error code 28 - Timeout was reached

There is a way to specify anonymous credentials when creating an S3 filesystem (which should prevent S3 from trying to recalculate credentials on each request) but I don't know enough of the R library to know how to plumb that through to something like read_parquet.

1beb commented

I am accessing a file that is not public. Credentials are loaded in env variables AWS_SECRET_ACCESS_KEY, AWS_ACCESS_KEY_ID. It works much better now that I've added the parameter ?region.

Ok. I hadn't noticed the default region was configured for me in ~/.aws/config (and not credentials). If I remove that file then I see a delay even if I have credentials. It appears to be the same basic idea, AWS is attempting to get the region from some EC2 address and, depending on how your network is configured, it may take a long time for it to realize that doesn't work:

[TRACE] 2023-06-13 18:55:34.572 EC2MetadataClient [139814276814784] Getting current region for ec2 instance
[TRACE] 2023-06-13 18:55:34.572 EC2MetadataClient [139814276814784] Retrieving credentials from http://169.254.169.254/latest/meta-data/placement/availability-zone
[TRACE] 2023-06-13 18:55:34.572 CurlHttpClient [139814276814784] Making request to http://169.254.169.254/latest/meta-data/placement/availability-zone
[TRACE] 2023-06-13 18:55:34.572 CurlHttpClient [139814276814784] Including headers:
[TRACE] 2023-06-13 18:55:34.572 CurlHttpClient [139814276814784] host: 169.254.169.254
[TRACE] 2023-06-13 18:55:34.573 CurlHttpClient [139814276814784] user-agent: aws-sdk-cpp/1.10.13 Linux/5.19.0-43-generic x86_64 GCC/10.4.0
[TRACE] 2023-06-13 18:55:34.573 CurlHttpClient [139814276814784] x-aws-ec2-metadata-token: 
[DEBUG] 2023-06-13 18:55:34.573 CurlHandleContainer [139814276814784] Attempting to acquire curl connection.
[INFO] 2023-06-13 18:55:34.573 CurlHandleContainer [139814276814784] Connection has been released. Continuing.
[DEBUG] 2023-06-13 18:55:34.573 CurlHandleContainer [139814276814784] Returning connection handle 0x5615e85f8ff0
[DEBUG] 2023-06-13 18:55:34.573 CurlHttpClient [139814276814784] Obtained connection handle 0x5615e85f8ff0
[DEBUG] 2023-06-13 18:55:34.573 CURL [139814276814784] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-06-13 18:55:35.107 CURL [139814276814784] (Text) connect to 169.254.169.254 port 80 failed: No route to host

[DEBUG] 2023-06-13 18:55:35.107 CURL [139814276814784] (Text) Failed to connect to 169.254.169.254 port 80 after 533 ms: Couldn't connect to server

it would be good to tweak the code so errors aren't eaten

Good catch there @amoeba ; I've opened up #36726 to fix this.

Adding ?region did not help as I'm still experiencing timeouts. I'm trying to read a large directory of parquet files but only bring back the one file that matches a particular ID and I'm using arrow to do so.

My S3 URI looks like this s3://archetypetx-bucket1/workflow_data/outputs/pubchem/latest/mcrpc/results_summary?region=us-east-1, but I can see in the shinyapps.io logs that it's not retaining the region parameter. I've also set AWS_DEFAULT_REGION in my .Renviron file. Thank you for any help.

Error in compute.arrow_dplyr_query: IOError: Could not open Parquet input source 'archetypetx-bucket1/workflow_data/outputs/pubchem/latest/mcrpc/results_summary/partition=10/c81b17e8f4e94be19905c47ffee281ca-0.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached

@tcash21 If you haven't yet I'd recommend trying aws.s3::s3read_using() while setting FUN = arrow::read_parquet. I spent a lot of time trying to debug this but arrow for R still proved too unreliable in the cloud because of its region and file path methodology (it'd often timeout and eat error). I've been following this thread hoping for a clean fix, but as of now the aws.s3 has proven more reliable.

@JerePlum99 thank you I'm trying this but does it work with partitioned parquet files? I'm not querying one specific object so it's throwing an error at me. I've resorted to using duckDB for now instead but it's much slower than arrow.

pc <- aws.s3::s3read_using(bucket = base_url, object = "partition*/*.parquet", FUN = arrow::read_parquet)

@tcash21 I'm a bit rusty but aws.s3 isn't extremely well maintained and I don't think it has the more sophisticated partitioning logic available. Sorry this doesn't solve your issue, I can confirm it's not just you and there doesn't appear to be a reliable workaround within the current arrow package version.