8 weekly training sessions to grasp the basic intricacies of Neo4j
graph database
- Seeing the big picture
- Best practices
- Secret sauces
- For non-developers (30 mins): informal, enjoyable, real-life example
- For developers (30 mins): hands-on & reproducible
- Step by step help attendees perfoming tasks needed for an end-to-end showcase.
- Providing an approach how to deal with a real-life example with a set of selective technologies and tools,properly constructed, and tested.
- Identifying pieces that can be reused later in related projects.
- Simple to reproduce so every attendee can self-repeat the hands-on whenever and wherever needed.
- Document well enough so it would be easy to use.
- One session per week
- A few online slides
- Most are reproducible dockers, github repos, etc.
- An example of
Rumor spreading: JOIN-ing infinite number of SQL tables
is discussed to showcase needs for handling graph data is discussed. What traditional do SQL databases offer for these cases, and how can it be solved withNeo4j
. - Three business cases of why, how, and with what results organizations such as
US Army
,Mosanto
, andInternational Consortium of Investigative Journalists (ICIJ)
tranform some of their data, processes, and applications toNeo4j
. - A brief discussion of what the
Neo4j
ecosystem consists of, how to get more information, and how to actively start using for your business cases.
- A dataset of
James Bond
movies is collected by harvesting data from theIMBD
(www.imdb.com) site and merged with a previously gathered data fromWikipedia
. This dataset contains basic information about the movies:IMDB
url, movie name, year of release, synopsis,IMDB
votes, directors, and actors. - The dataset is sent through a data processing pipeline consisting of two
Stanford NLP
taggers - Part-of-Speech and Named Entity - to extract Key Phrases and Named Entities from the synopses. - All entities are persisted into a graph managed by a local Dockerized
Neo4j
container, creatded by using neo4j-algo-apoc github repo. - A few Cypher queries are used to showcase: how to aggregate data, how to find out relationships, and how to detect similarities in the dataset.
- How can this graph database can be enriched and what purposes it might serve?
- Get the dataset:
- harvesting data (movie, directors, actors, rarting, votes, ...) from
imdb.com
withscrapper-0.1:imdb
as a micro service - reusing existing data from
Wikipedia
from a tab-separated file
- Process the harvested data through a data pipeline called
pipeline-0.1:imdb
stanford-nlp-3.9.2:pos
: Stanford NLP Part-of-Speech tagger docker (as micro service)stanford-nlp-3.9.2:ner
: Stanford NLP Named Entity tagger docker (as micro service)- store the entities and their relationships in a storage called
neo4j-3.5.5:algo-apoc
, which is a neo4j docker with APOC and ALGO libraries.
- Visualization and queries with Neo4j browser, and answering a few questions
- how the meta graph look like?
- what are the representing features of some movies?
- what is the number pf average votes per movie of an actor?
- what actors, directors participated in most of the film production?
- what actors could have worked with same directors many times (strong influence)
- Objective: recommending
natural next jobs
from 1M+ job ads. - Approach:
- Suggesting jobs based on
collaborative-based filtering
- top most common transitions from a job - Using a standard occupation classification system with
tree-like
structure andclusters
of verified job titles serving as anchor points for matching users' and job ads' job titles. Matching job titles based oncontent-based filtering
using Natural Language Processing with ML models for sentence tagging to extractkey phrases
from job title such assoftware developer
,project manager
, ... - Improve recommendation quality by using geographical information of user's and job ads' locations
- Suggesting jobs based on
- Datasets/APIs/toolkits:
Kaggle Job Recommendation Challenge
contributed byCareerBuilder
.- Standard Occupaction Classifications (SOC) 2010 from US Bureau of Labour and Statistics (BLS) and O*NET organization for occupations, job titles, tech skills, tools used.
- Stanford Core Natural Language Processing -
Staford CoreLNP
version3.9.2
for Part-of-Speech tagging job titles, turning them intokey phrases
. - Natural Language Toolkit -
NLTK
version3.4.3
for lemmatization and stemming of English words. - OpenStreetMap -
OSM
server for geocoding location's coordinates.
- Technology framework:
- Neo4j graph database
- Docker containers as Micro Services:
- Neo4j
CE
version3.5.5
, exposing via bothHTTP
(7474) andBolt
(7687) interfaces - Stanford
3.9.2
POS Tagger, exposing via socket (8001) - NLTK-wrapped around by Python
bottle
web framework andwaitress
WSGI server
- Neo4j
- Neo4j browser for query executions
- Objective: provide simple statistics for school survey response rate
- Approach:
- Importing current dataset of registrations and survey reponses
- Providing simple statistics
- Using
neo4j-admin
specialimport
feature for large data import at high performance. - Preparing data - normalize, entities, relationships, headers, etc for import
- Using
apoc.csv.export.query
to export data incsv
format.
- Objective: connecting various apps into the Kafka - Neo4j streaming framework, reducing connectivity complexity, and enabling data consuming for different scenarios.
- Approach:
- Integration of Kafka with Neo4j
- Showcasing ability using different clients (command-line, Java, Python, HTTP)
- Creating and running a complete
Kafka
cluster on a single machine. - Testing connectivity and data producing/consuming features for various clients.
- Showcasing Neo4j
just-in-time data warehousing
.
-
Objectives:
- Demonstrate capabilities of visualizing to gain insights by discovery.
- Introduction of different visualization tools and how to use them.
-
Approach:
- CSPS registration/survey dataset
- Browser
- Tools:
GraphXR
cloud, JS scripts from browser connect to localNeo4j
instanceGephi
standalone app, data streaming from aNeo4j
instanceNode.js
app withNeovis.js
, performCypher
queries to backendNeo4j
instance.
-
Objectives:
- Demonstrate capabilities of exploring the graph visually
- Using virtual graph to group, delegate, refactor subgraphs
- Showing capabilities with tabular data
-
Approach:
- CSPS registration/survey dataset
- Browser
- Tools:
- Exploration with pure
Cypher
- Using
virtual
nodes
andrelationships
- Using
Harmonic Centrality
andLouvain Community Detection
- Exploration with pure
- No presentation for this session, only hands-on.
Modern tools for building full stack apps
GRANDstack is a combination of technologies that work together to enable developers to build data intensive full stack applications. The components of GRANDstack are:
- GraphQL - A new paradigm for building APIs, GraphQL is a way of describing data and enabling clients to query it.
- React - A JavaScript library for building component based reusable user interfaces.
- Apollo - A suite of tools that work together to create great GraphQL workflows.
- Neo4j Database - The native graph database that allows you to model, store, and query your data the same way you think about it: as a graph.
- Tools:
- React with Apollo
- Neo4-GraphQL database plugin
- neo4j-graphql-js (& Neo4j Javascript Bolt driver)
- GraphiQL
- No presentation for this session, only hands-on.
-
CSPS data consists of course/offering/registration/learner/survey/etc entities
-
GRANDstack
is a combination of technologies that work together to enable developers to builddata intensive full stack applications
.