/pyspark-cassandra

Utilities and examples to asssist in working with PySpark and Cassandra.

Primary LanguagePythonApache License 2.0Apache-2.0

pyspark-cassandra

Utilities and examples to asssist in working with Cassandra and PySpark.

Currently contains an updated and much more robust example of using a SparkContext's newAPIHadoopRDD to read from and an RDD's saveAsNewAPIHadoopDataset to write to Cassandra 2.1. Demonstrates usage of CQL collections: lists, sets and maps.

Working on proper integration with the DataStax Cassandra Spark Connector.

Building

You'll need Maven in order to build the uberjar required for the examples.

mvn clean package

Will create an uberjar at target/pyspark-cassandra-<version>-SNAPSHOT.jar.

Using with PySpark

spark-submit --driver-class-path /path/to/pyspark-cassandra.jar myscript.py ...

Using examples

pip install -r requirements.txt

Then run examples either directly with spark-submit, or use the run_script.py utility.

Running the PySpark Cassandra Hadoop Example

The example can first create the schema it requires via:

./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py init test

The init command initializes the keyspace, table and inserts sample data. "test" is the name of the keyspace. A users table will be created in this keyspace with two sample users to enable reading.

Afterwards, you can run:

./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test

Which runs a sample PySpark driver program that reads the existing values in the users table and then writes two new users to this table.