[SIP-150] Use existing Athena Presto functionality for large downloads from S3
Opened this issue · 2 comments
[SIP-150] Proposal for using Athena Presto functionality for large downloads of CSVs
Motivation
Superset inbuilt download functionality is slow for larger file, attempting to download a 100,000 row CSV from Presto Athena AWS DB took just shy of 9 mins from start to completion.
However, this can be vastly sped up with little extra overhead cost to Superset (for Presto DB users). Presto DB can be configured to automatically persist query results to S3, this takes seconds to query DB and save CSV in S3 bucket.
By starting the query through the Athena API and then checking in to get the output location of the CSV, this URL can be returned to user immediately without the raw results, resulting in the real world use case we had, the 8 min 45 second download time was reduced down to 11 seconds.
Proposed Change
The proposed change is to add in functionality to return only the existing CSV file output_location to user when the user requests a download of a chart that uses an Athena Presto DB.
This change will be protected with feature flags to only turn on for Presto DB users and they will need to set environment variables for
AWS region/Athena workgroup and Athena DB name, example below from .env file
SUPERSET_REGION=eu-west-1
SUPERSET_WORKGROUP=superset-etl
SUPERSET_ATHENA_DB=my_superset_db
Feature flags are to enable S3 download functionality and to hide existing CSV/XLSX default options (S3 download is faster than default download), from featureFlags.ts
DownloadCSVFromS3 = 'DOWNLOAD_CSV_FROM_S3',
ShowDefaultCSVOptions = 'SHOW_DEFAULT_CSV_OPTIONS',
Option will appear in right click context menu
New or Changed Public Interfaces
Reusing existing data endpoint used by CSV and XLSX default and full download.
Passing in 'result_location', a new parameter to specify if exported file is to be built within Superset (current export) or S3 (new flow for Presto Athena).
Changes to model for output_location, the returned presigned URL which is used to specify the file inside S3 bucket.
All other changes in PR code changes.
New dependencies
No changes here
Migration Plan and Compatibility
No DB changes
Rejected Alternatives
Describe alternative approaches that were considered and rejected.
Using lambda to get file from S3 and returning through proprietary application, rejected as users are already using Superset and it makes sense to allow them to download large files through Superset UI.
This is similar to a reverted PR here: #29164
In a perfect world, it seems like there should be a world in which installing the Athena driver/dbapi would magically enable this Save To S3 option when configured. This probably requires a plugin architecture where the Athena plugin would offer a Save to S3 export plugin as a dependency... but this is a long way off.
Agree, but setting output location for query runs is not default functionality, it has to be configured at the DB level/workgroup setting in AWS. I think this would be outside the remit of Superset.
The plugin would need to be able to query Athena API to check this is set or we would need to check a checkbox/set field value in UI at DB connection level when adding connection.
OutputLocation
The location in Amazon S3 where your query and calculation results are stored, such as s3://path/to/query/bucket/. To run the query, you must specify the query results location using one of the ways: either for individual queries using either this setting (client-side), or in the workgroup, using WorkGroupConfiguration. If none of them is set, Athena issues an error that no output location is provided. If workgroup settings override client-side settings, then the query uses the settings specified for the workgroup. See WorkGroupConfiguration:EnforceWorkGroupConfiguration.
As it is, once Athena output location has been set, queries against the Athena tables will automatically persist the results in desired format (CSV, Avro, Parquet, ORC, JSON, delimited text)
https://docs.aws.amazon.com/athena/latest/ug/creating-databases-prerequisites.html
it makes sense to use what is already available and download the file that's already been created in S3, rather than get the raw results, process to dataframe, write to file format and return the file through API.
Whether this PR should be expanded to download the other available file types if different format has been set is a good question.