/neo4j-runway

End to end solution for migrating CSV data into a Neo4j graph using an LLM for the data discovery and graph data modeling stages.

Primary LanguagePythonApache License 2.0Apache-2.0

Neo4j Runway

Neo4j Runway is a Python library that simplifies the process of migrating your relational data into a graph. It provides tools that abstract communication with OpenAI to run discovery on your data and generate a data model, as well as tools to generate ingestion code and load your data into a Neo4j instance.

Key Features

  • Data Discovery: Harness OpenAI LLMs to provide valuable insights from your data
  • Graph Data Modeling: Utilize OpenAI and the Instructor Python library to create valid graph data models
  • Code Generation: Generate ingestion code to easily load your data
  • Data Ingestion: Load your data using Runway's built in implementation of PyIngest - Neo4j's popular ingestion tool
  • Exploratory Data Analysis: Run analytics over your graph to discover potential data quality issues

Requirements

Runway uses Graphviz to visualize data models. To enjoy this feature please download graphviz.

You'll need a Neo4j instance to fully utilize Runway. Start up a free cloud hosted Aura instance or download the Neo4j Desktop app.

Get Running in Minutes

Follow the steps below or check out any of the Neo4j Runway end-to-end examples

pip install neo4j-runway

Now let's walk through a basic example.

Here we import the modules we'll be using.

from neo4j_runway import Discovery, GraphDataModeler, PyIngest, UserInput
from neo4j_runway.code_generation import PyIngestConfigGenerator
from neo4j_runway.llm.openai import OpenAIDiscoveryLLM, OpenAIDataModelingLLM

Discovery

Now we...

  • Define a general description of our data
  • Provide brief descriptions of the columns of interest
  • Provide any use cases we'd like our data model to address
  • Load our csv via Runway's load_local_files function
data_directory = "../../../data/countries/"

data_dictionary = {
                'id': 'unique id for a country.',
                'name': 'the country name.',
                'phone_code': 'country area code.',
                'capital': 'the capital of the country.',
                'currency_name': "name of the country's currency.",
                'region': 'primary region of the country.',
                'subregion': 'subregion location of the country.',
                'timezones': 'timezones contained within the country borders.',
                'latitude': 'the latitude coordinate of the country center.',
                'longitude': 'the longitude coordinate of the country center.'
                }

use_cases = [
        "Which region contains the most subregions?",
        "What currencies are most popular?",
        "Which countries share timezones?"
    ]

data = load_local_files(data_directory=data_directory,
                        data_dictionary=data_dictionary,
                        general_description="This is data on countries and their attributes.",
                        use_cases=use_cases,
                        include_files=["countries.csv"])

We may also preview our csv data before running any processes

data.tables[0].dataframe.head()
id name phone_code capital currency_name region subregion timezones latitude longitude
0 1 Afghanistan 93 Kabul Afghan afghani Asia Southern Asia [{zoneName:'Asia\/Kabul',gmtOffset:16200,gmtOf... 33.000000 65.0
1 2 Aland Islands +358-18 Mariehamn Euro Europe Northern Europe [{zoneName:'Europe\/Mariehamn',gmtOffset:7200,... 60.116667 19.9
2 3 Albania 355 Tirana Albanian lek Europe Southern Europe [{zoneName:'Europe\/Tirane',gmtOffset:3600,gmt... 41.000000 20.0
3 4 Algeria 213 Algiers Algerian dinar Africa Northern Africa [{zoneName:'Africa\/Algiers',gmtOffset:3600,gm... 28.000000 3.0
4 5 American Samoa +1-684 Pago Pago US Dollar Oceania Polynesia [{zoneName:'Pacific\/Pago_Pago',gmtOffset:-396... -14.333333 -170.0

We may then initialize our discovery and data modeling LLMs. By default we use GPT-4o and define our OpenAI API key in an environment variable.

llm_disc = OpenAIDiscoveryLLM(model_name='gpt-4o-mini-2024-07-18', model_params={"temperature": 0})
llm_dm = OpenAIDataModelingLLM(model_name='gpt-4o-2024-05-13', model_params={"temperature": 0.5})

And we run discovery on our data.

disc = Discovery(llm=llm_disc, data=data)disc.run()

disc.run(show_result=True, notebook=True)

Preliminary Analysis of Country Data

Overall Data Characteristics:

  1. Data Size: The dataset contains 250 entries (countries) and 10 attributes.
  2. Data Types: The attributes include integers, floats, and objects (strings). The presence of both numerical and categorical data allows for diverse analyses.
  3. Missing Values:
    • capital: 5 missing values (2% of the data)
    • region: 2 missing values (0.8% of the data)
    • subregion: 3 missing values (1.2% of the data)
    • Other columns have no missing values.

Important Features:

  1. id: Unique identifier for each country. It is uniformly distributed from 1 to 250.
  2. name: Each country has a unique name, which is crucial for identification.
  3. phone_code: There are 235 unique phone codes, indicating that some countries share the same code. This could be relevant for understanding regional telecommunications.
  4. capital: The capital city is a significant attribute, but with 5 missing values, it may require attention during analysis.
  5. currency_name: There are 161 unique currencies, with the Euro being the most common (35 occurrences). This suggests a potential clustering of countries using the same currency, which could be relevant for economic analyses.
  6. region: There are 6 unique regions, with Africa having the highest frequency (60 countries). This could indicate a need to explore regional characteristics further.
  7. subregion: 22 unique subregions exist, with the Caribbean being the most frequent (28 occurrences). This suggests that some regions have more subdivisions than others.
  8. timezones: The dataset contains 245 unique timezones, indicating that many countries share timezones. This could be useful for understanding global time coordination.

Use Case Insights:

  1. Regions and Subregions: To determine which region contains the most subregions, we can analyze the region and subregion columns. The region with the highest number of unique subregions will be identified.
  2. Popular Currencies: The currency_name column can be analyzed to find the most frequently occurring currencies, highlighting economic ties between countries.
  3. Shared Timezones: The timezones column can be examined to identify countries that share the same timezone, which may have implications for trade, communication, and travel.

Conclusion:

The dataset provides a rich source of information about countries, their geographical locations, and economic attributes. The most important features for analysis include region, subregion, currency_name, and timezones, as they directly relate to the use cases outlined. Addressing the missing values in capital, region, and subregion will also be essential for a comprehensive analysis.

Data Modeling

We can now use our Discovery object to provide context to the LLM for data model generation. Notice that we don't need to pass our actual data to the modeler, just insights we've gathered so far.

gdm = GraphDataModeler(llm=llm_dm, discovery=disc)

We may now generate our first graph data model.

gdm.create_initial_model()

If we have graphviz installed, we can take a look at our model.

gdm.current_model.visualize()

countries-first-model.png

Our data model seems to address the three use cases we'd like answered:

  • Which region contains the most subregions?
  • What currencies are most popular?
  • Which countries share timezones?

If we would like the data model modified, we may request the LLM to make changes.

gdm.iterate_model(corrections="Create a Capital node from the capital property.")
gdm.current_model.visualize()

countries-second-model.png

Code Generation

We can now use our data model to generate some ingestion code.

gen = PyIngestConfigGenerator(data_model=gdm.current_model,
                         username=os.environ.get("NEO4J_USERNAME"),
                         password=os.environ.get("NEO4J_PASSWORD"),
                         uri=os.environ.get("NEO4J_URI"),
                         database=os.environ.get("NEO4J_DATABASE"),
                         file_directory=data_directory, source_name="countries.csv")

pyingest_yaml = gen.generate_config_string()

Ingestion

We will use the generated PyIngest yaml config to ingest our data into our Neo4j instance.

PyIngest(config=pyingest_yaml, verbose=False)

We can also save this as a .yaml file and use with the original PyIngest.

gen.generate_config_yaml(file_name="countries.yaml")

Here's a snapshot of our new graph!

countries-graph.png

Graph Exploratory Data Analysis

Runway offers a module for easily running analyics over an existing graph to gain insights such as finding isolated nodes and ranking top node degrees.

Check here for an example of Runway's GraphEDA module.

Limitations

Runway is currently in beta and under rapid development. Please raise GitHub issues and provide feedback on any features you'd like. The following are some of the current limitations:

  • More complex data modeling is under development
  • Nodes may only have a single label
  • Only uniqueness and key constraints are supported
  • Only OpenAI models may be used at this time
  • Runway only supports ingesting local files, though it supports code generation for other ingest methods