AWS Glue/Pyspark for loading data into Astra

Environment:

Started with the Awesome-Astra example for Glue (https://awesome-astra.github.io/docs/pages/data/explore/awsglue/). However, it does provide a model for loading data into Astra DB, but the other way around.

Python Script

The script is simple. Main points to note:

Here, we can set the versions to be used—attention to the Spark version.

For this case, the script needs two files to be stored in a S3 Bucket:

spark-cassandra-connector-assembly_2.12-3.3.0.jar
- Downloaded from: https://search.maven.org/artifact/com.datastax.spark/spark-cassandra-connector-assembly_2.12/3.3.0/jar
The Spark & Scala versions used in AWS Glue must match the DataStax Spark Cassandra Connector's version.
- It needs to use the assembly version, which has all the dependencies included in it.
Secure connection bundle
- Generated on Astra dashboard and uploaded into S3

These files should be referenced in the "Job Details" page:

–conf
spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

and

--packages
com.datastax.spark:spark-cassandra-connector_2.12:3.5.0

It is possible to add the Astra credentials to the –conf parameter if needed.

Monitor the execution on "Runs" screen.