ex-aws/ex_aws_s3

Support for S3 Select

Ivor opened this issue ยท 10 comments

Ivor commented

Do you guys plan to support S3 Select any time soon?

I've hacked at it a bit and thought it was working until I started validating what I get back and then I realised the response is chunked and streaming and to be honest its over my head at this stage.

Will be fantastic if you could add this functionality.
The api could simply pass through the request XML and expression. The challenging part is parsing the response.

The project is much appreciated, with or without this. Thanks!

I'd welcome a PR for it, but I don't have time in the foreseeable future to do so myself sorry.

Ivor commented

Understood. I will scratch around a bit and see if there are bits of the existing library that I can reuse. The Download module seems useful, although the S3 Select request is a post while the download file request is a get request.

If you have any bigger picture perspective or tips to share that will be appreciated but I will see what I can do either way.

Again, much appreciated, the library is very very useful as it is.

@Ivor Did you proceed with this?

Ivor commented

I played around a bit but ended up not using it. The operation below worked if passed to ExAws.request(operation) if there were few records. The response can be split on end-of-line character and then parsed from JSON. However, the streaming/chunking aspect escaped me so this failed on bigger record sets.

%ExAws.Operation.S3{
  body: build_xml(expression),
  bucket: "select-bucket-store",
  headers: %{},
  http_method: :post,
  params: %{},
  parser: &ExAws.Utils.identity/1,
  path: "#{path}?select&select-type=2", #path to s3 object
  resource: "",
  service: :s3,
  stream_builder: nil
}

I suspect the only useful part here is that I embedded the query (expression) in the correctly formatted XML and that I added the select&select-type=2 to the path. Besides that this is just a normal S3 request I think. I might have needed to build a stream_builder to deal with bigger data sets.

The XML that I built looked like this:

"<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<SelectRequest>
  <Expression>#{expression}</Expression>
  <ExpressionType>SQL</ExpressionType>
  <InputSerialization>
    <JSON>
      <RecordDelimiter>\n</RecordDelimiter>
      <Type>DOCUMENT</Type>
    </JSON>
  </InputSerialization>
  <OutputSerialization>
    <JSON>
      <RecordDelimiter>\n</RecordDelimiter>
    </JSON>
  </OutputSerialization>
  <RequestProgress>
    <Enabled>FALSE</Enabled>
  </RequestProgress>
</SelectRequest>"

Hope this helps :)

@Ivor thank you, I'll have a stab at the stream_builder :)

@madshargreave any luck with this?

Does any other resource within ex_aws/ex_aws_s3 include a Transfer-Encoding header with chunked as its value in the response?

https://docs.aws.amazon.com/AmazonS3/latest/API/RESTSelectObjectAppendix.html

Trying to do a simple request/response gives me back this as a header:

     {"Transfer-Encoding", "chunked"}

I am trying to work on this.
eventstream + async chunk streaming is the hard part. I'm studying how boto does this. I got the request working.

UPDATE on 2/11/2023 -

  • Response from S3 here is on type {"Transfer-Encoding", "chunked"} and {"Content-Type", "application/octet-stream"}.
  • Each chunk is encoded in AWS's EventStream format (a lot of binary decoding).
  • This request does not accept Range http header unlike get_object.
  • This means we have to Stream chunks as they arrive with an unknown total size.
  • We'll know when to stop based on the EventStream metadata.
  • I'll have to modify ExAws.Request.Hackney to enable true streaming (need to use hackney.stream_body, and hackney.request without :with_body opt)

UPDATE on 3/11/2023 -

  • I got chunked octet response streaming working. On to decoding EventStream!

    EDIT: cc @bernardd LMK if this sounds good. I'm still working on this. I think I can get a working PR pretty soon.

Hi @avinayak I handed off maintenance of this and other ExAws libraries many years ago to @bernardd

@bernardd I have a PRs up for this in this Repo and ex_aws
#236
ex-aws/ex_aws#1012