Systems and Methods for Big and Unstructured Data - Delivery #1 - AA 2021/2022 - Prof. Marco Brambilla
Considering the scenario in which there’s the need to build a system for managing the COVID-19 pandemic in a specific country, our project focuses on the data perspective level. This is why we designed and implemented a Neo4j data structure to face the need of contact tracing functionality, to monitor the viral diffusion.
- ⚙ System requirements️
- 🚀 Setup instructions
- 📜 Report
- 👨💻 Usage
- 🗄️ Database dump
- 📊 Diagrams
- 📷 Relationships Visualizations
- 💡 About database population scripts
- 📝 License
- Python 3.8 or higher (only if you want to perform manual load from CSVs)
- Neo4J database
- Python modules in requirements.txt (only if you want to perform manual load from CSVs)
git clone https://github.com/pablogiaccaglia/neo4j-covid-tracing
cd neo4j-covid-tracing/
From the project's directory run the following commands:
pip install -r requirements.txt
This operation is advise only if you want to have full control of the database from the data collection and generation point of view, since the process of populating the database takes a lot of time, as stated here.
The first step is to move the CSV files inside the import folder into the corresponding Neo4j folder, whose location changes as follows:
Then info of a connection to the Neo4j database is needed.
As you can see in the main method of the main.py file, a CovidGraphHandler
object is created in the following way:
handler = CovidGraphHandler("URI", "USER", "PASSWORD")
the data passed to the class' constructor is used in the init method to establish a connection through a driver:
self.driver = GraphDatabase.driver(uri, auth = (user, password), max_connection_lifetime = 1000)
Different settings can be specified by changing that line of code. More info available here
After this step all you need to do is execute the main method and wait the routine to complete.
The Python code manipulates several CSV files which can be found in different versions inside the datasets folders. If you want to do further changes to them, make sure to substitute the older version with the new one inside the Neo4j import folder. Detailed information of the manipulation process which lead to the final state of the database can be found in the Report.
If you dont' want to use Python or install the requiered dependencies, you can quickly start using the database by loading the dump available here. The following section shows how to do so.
This section describes how to restore a database dump in a live Neo4j deployment.
A database dump can be loaded to a Neo4j instance using the load
command of neo4j-admin
.
The neo4j-admin load
command loads a database from an archive created with the neo4j-admin dump
command.
Alternatively, neo4j-admin load
can accept dump from standard input, enabling it to accept input from neo4j-admin dump
or another source.
The command can be run from an online or an offline Neo4j DBMS.
If you are replacing an existing database, you have to shut it down before running the command.
If you are not replacing an existing database, you must create the database (using CREATE DATABASE
against the system
database) after the load operation finishes.
neo4j-admin load
must be invoked as the neo4j
user to ensure the appropriate file permissions.
neo4j-admin load --from=<archive-path>
[--verbose]
[--expand-commands]
[--database=<database>]
[--force]
[--info]
Option | Default | Description |
---|---|---|
|
Path to archive created with the |
|
|
Enable verbose output. |
|
|
Allow command expansion in config value evaluation. |
|
|
|
Name for the loaded database. |
|
Replace an existing database. |
|
|
Print meta-data information about the archive file, such as, file count, byte count, and format of the load file. |
The following is an example of how to load the dump of the neo4j
database created in the section Back up an offline database, using the neo4j-admin load
command.
When replacing an existing database, you have to shut it down before running the command.
bin/neo4j-admin load --from=/dumps/neo4j/neo4j-<timestamp>.dump --database=neo4j --force
Unless you are replacing an existing database, you must create the database (using CREATE DATABASE
against the system
database) after the load operation finishes.
When using the load
command to seed a Causal Cluster, and a previous version of the database exists, you must delete it (using DROP DATABASE
) first.
Alternatively, you can stop the Neo4j instance and unbind it from the cluster using neo4j-admin unbind
to remove its cluster state data.
If you fail to DROP or unbind before loading the dump, that database’s store files will be out of sync with its cluster state, potentially leading to logical corruptions.
For more information, see Seed a cluster from a database backup (online).
WENT TO | TOOK |
---|---|
RECEIVED | PART OF |
---|---|
MET | LOCATED |
---|---|
LIVES WITH | LIVES IN |
---|---|
The creation script, which can be executed invoking the populateDatabase method of class CovidGraphHandler located inside file main.py, takes approximately
6 hours to complete.
What it creates inside the Neo4j database are:
-
12014 nodes :
- 5000 Person nodes
- 4883 Place nodes
- 2123 City nodes
- 4 Vaccine nodes
- 3 Test nodes
- 1 Country node
-
296682 directed (593364 undirected) relationships:
- 2123 directed (4246 undirected) PART OF relationships
- 1139 directed (2278 undirected) LOCATED relationships
- 3752 directed (7504 undirected) RECEIVED relationships
- 6537 directed (13074 undirected) TOOK relationships
- 8441 directed (16882 undirected) LIVES WITH relationships
- 5000 directed (10000 undirected) LIVES IN relationships
- 119651 directed (239302 undirected) MET relationships
- 150040 directed (300080 undirected) WENT TO relationships
Information on how the data has been produced can be found on the report
This file is part of "Noe4j Covid Tracing Database".
"Neo4j Covid Tracing Database" is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
"Neo4j Covid Tracing Database" is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program (LICENSE.txt). If not, see http://www.gnu.org/licenses/