This section describes the steps used to setup the neo4j ecosystem
-
Install/configure Apache Spark (for scala)
- https://intellipaat.com/blog/tutorial/spark-tutorial/downloading-spark-and-getting-started/
- Apache Spark 2.4.5 with Hadoop Pre-built 2.7 (comes with Scala 2.11.12)
- Scala 2.11.12 (GCP image 1.4-ubuntu18 uses Spark 2.4.5 w/ Scala 2.11.12)
- Apache Maven (for Compiling Scala): https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/spark_building.html#building
- create directory structure for Maven projects
- update pom.xml to include dependencies (this is an iterative process usually, with the the following command)
- run
mvn clean install
- (update) also run
mvn assembly:assembly -DdescriptorId=jar-with-dependencies
to include dependencies (specifically, BigQuery) - to see dependencies included run
mvn dependency:tree
- use this to copy to GCP bucket:
gsutil cp {file_to_copy} gs://{bucket_name}/{location_to_save_to}
- Examples: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples
- Run standalone cluster: https://supergloo.com/spark-scala/apache-spark-cluster-run-standalone/
- More on standalone: https://spark.apache.org/docs/latest/spark-standalone.html
- Has some info on setting Spark Conf: https://mbonaci.github.io/mbo-spark/
spark-submit --class WordCount --master spark://zeus:7077 target/sparkwordcount-0.0.1.jar
-
Install/configure docker
- (used Option 1) https://phoenixnap.com/kb/how-to-install-docker-on-ubuntu-18-04
- Add yourself to the docker group (?)
-
Use neo4j docker run command (I added this to a script)
- https://neo4j.com/developer/docker-run-neo4j/
- {PROJECT_DIR}/scripts/create_neo4j_docker.sh
- Use
docker ps -a
to status of container (running or failed) - Created start/stop scripts in
[course_dir]/scripts
-
Access the DB
localhost:7474
-
Connect Spark to neo4j
https://spark.apache.org/docs/latest/
Followed these instructions:
-
Create new project and enable Compute Engine API *
-
Set up local GCP project:
- if configured from env vars (which mine is):
export CLOUDSDK_CORE_PROJECT=eecs-e6895-edu
Note: edit~/.bashrc
! - if configured via gcloud:
gcloud config set project eecs-e6895-edu
- if configured from env vars (which mine is):
-
https://neo4j.com/developer/neo4j-cloud-google-image/
- https://neo4j.com/google-cloud-resources/
- https://cloud.google.com/sdk/install
- list neo4j images:
gcloud compute images list --project launcher-public | grep neo4j
reference: https://community.neo4j.com/t/neo4j-3-5-1-added-to-google-cloud-platform-cluster-and-single-node-community-and-enterprise/4174/3
-
https://medium.com/neo4j/running-neo4j-on-google-cloud-6592c1b4e4e5
I created scripts in [course_dir]/scripts
- Running cluster locally https://spark.apache.org/docs/latest/spark-standalone.html
Next steps:
-
Using GraphQL in Cloud Run (docker container):
-
Build a custom docker container with our DB
-
Self healing graph DB using clusters?
Look into caching https://spark.apache.org/docs/latest/quick-start.html#caching
Psychology: https://ipip.ori.org/newPublications.htm