/aws-glue-astra-loader

How to load data from AWS S3 to AstraDB/Cassandra using AWS GLue

Primary LanguagePython

AWS Glue/Pyspark for loading data into Astra

Environment:

  • Spark 3 or 4
  • Dataframes (not RDD)
  • Source data stored on S3.

Started with the Awesome-Astra example for Glue (https://awesome-astra.github.io/docs/pages/data/explore/awsglue/). However, it does provide a model for loading data into Astra DB, but the other way around.

Anyway, I leveraged the security setup and secret management from the Awesome-Astra page. I just added Astra DB's token to the secrets manager.

Python Script

The script is simple. Main points to note:

  • Handle the secret manager connection for accessing the Astra DB's token.
  • Format specification.
  • Loading the S3 source file (CSV format).

Job Details

Here, we can set the versions to be used—attention to the Spark version.

Files

For this case, the script needs two files to be stored in a S3 Bucket:

These files should be referenced in the "Job Details" page:

Job parameters

–conf
spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

and

--packages
com.datastax.spark:spark-cassandra-connector_2.12:3.5.0

It is possible to add the Astra credentials to the –conf parameter if needed.

Execution

Monitor the execution on "Runs" screen.