/JupyterSparkGraphFrames

Jenkins pipeline poc

Primary LanguageDockerfileApache License 2.0Apache-2.0

JupyterSparkGraphFrames Docker Image

Available on DockerHub: https://hub.docker.com/r/tomerlevi/jupyter-spark-graphframes

This image includes Jupyter notebook server including Spark and GraphFrames lib. The image is based on the official jupyter/all-spark-notebook

Running:

docker run -p 8888:8888 -p 4040:4040 -v ~:/home/jovyan/workspace --name jupyter tomerlevi/jupyter-spark-graphframes

Running providing custom Spark options:

docker run -p 8888:8888 -p 4040:4040 -v ~:/home/jovyan/workspace -e SPARK_OPTS='--spark.executor.memory=1.5g --spark.driver.memory=0.5g' --name jupyter tomerlevi/jupyter-spark-graphframs

Follow my blogpost for more details: https://medium.com/fundbox-engineering/overview-d3759e83969c

Example code (python):

from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import SparkSession
from graphframe import *

spark = SparkSession.builder.master("local").appName("Spark-GraphFrames").getOrCreate()
spark.sparkContext.setCheckpointDir('/tmp/graphframes_cps') 

# Create DataFrames manually for testing purposes
from pyspark.sql import Row

v = spark.createDataFrame([
    Row(id='1',label='fedex'), 
    Row(id='2',label='mcDonalds'), 
    Row(id='3',label='verizon'), 
    Row(id='4',label='facebook'), 
    Row(id='5',label='googleAds'),
])

e = spark.createDataFrame([ 
    ('1', '2', 'friend'), 
    ('3', '2', 'friend'),
    ('4', '5', 'friend')
],['src', 'dst', 'type'])

# Now lets construct the graph
g = GraphFrame(v,e)

connected = g.connectedComponents()
connected.show()