Nebula Importer is a CSV import tool for Nebula Graph. It reads data from CSV files and inserts it into Nebula Graph.
Before you start Nebula Importer, ensure:
- Nebula Graph is deployed
- Schema is created
Currently, there are three ways to deploy Nebula Graph:
The quickest way to deploy Nebula Graph is using
docker-compose
.
After completing the configuration of the YAML file and the preparation of the (to be imported) CSV data file, you can use this tool to batch write to Nebula Graph.
Nebula Importer is compiled with golang higher than >=1.13, so make sure that golang is installed on your system. The installation and configuration tutorial is referenced here.
Use git
to clone the repository to local, go to the nebula-importer/
directory and make build
.
$ git clone https://github.com/vesoft-inc/nebula-importer.git
$ cd nebula-importer
$ make build
$ ./nebula-importer --config /path/to/yaml/config/file
--config
is used to pass in the path to the YAML configuration file.
With Docker, you don't have to install golang locally. Pull Nebula Importer's Docker Image to import. The only thing to do is to mount the local configuration file and the CSV data files into the container as follows:
$ docker run --rm -ti \
--network=host \
-v {your-config-file}:{your-config-file} \
-v {your-csv-data-dir}:{your-csv-data-dir} \
vesoft/nebula-importer
--config {your-config-file}
{your-config-file}
: Replace with the absolute path of the local YAML configuration file{your-csv-data-dir}
: Replace with the absolute path of the local CSV data file.
Note: It is recommended to use relative paths in
files.path
. But if you use a local absolute path, you need to carefully check the path mapped to Docker with this path.
Nebula Importer reads the CSV file to be imported and Nebula Graph Server data through the YAML configuration file. Here's an example of the configuration file and the CSV file. Detail descriptions for the configuration file see the following section.
version: v1
description: example
removeTempFiles: false
version
is a required parameter that indicates the configure file's version, the default version isv1
.description
is an optional parameter that describes the configure file.removeTempFiles
is an optional parameter that confirms whether to remove generated temporary log and data files, default value:false
.clientSettings
takes care of all the Nebula Graph related configurations.
clientSettings:
retry: 3
concurrency: 10
channelBufferSize: 128
space: test
connection:
user: user
password: password
address: 192.168.8.1:3699,192.168.8.2:3699
postStart:
commands: |
UPDATE CONFIGS storage:wal_ttl=3600;
UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
afterPeriod: 8s
preStop:
commands: |
UPDATE CONFIGS storage:wal_ttl=86400;
UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
clientSettings.retry
is an optional parameter that shows the number of retrying to execute failed nGQL in Nebula Graph client.clientSettings.concurrency
is an optional parameter that shows the concurrency of Nebula Graph Client, i.e. the connection number of Nebula Graph Server, the default value is 10.clientSettings.channelBufferSize
is an optional parameter that shows the buffer size of the cache queue for each Nebula Graph Client, the default value is 128.clientSettings.space
is a required parameter that specifies whichspace
the data will be importing into. Do not import data to multiple spaces at one time for performance sake.clientSettings.connection
is a required parameter that contains theuser
,password
andaddress
information of Nebula Graph Server.clientSettings.postStart
is an optional parameter that describes post scripts after connecting Nebula Graph Server:clientSettings.postStart.commands
define some commands to run after connecting Nebula Graph Server.clientSettings.postStart.afterPeriod
define the period time between running above commands and inserting data to Nebula Graph Server.
clientSettings.preStop
is an optional parameter that describes prescripts to run before disconnecting Nebula Graph Server.clientSettings.preStop.commands
define some commands to run before disconnecting Nebula Graph Server.
The log and data file related configurations are:
logPath
: Optional. Specifies log directory when importing data, default path is/tmp/nebula-importer-{timestamp}.log
.files
: Required. An array type to configure different CSV files. You can also import data from a HTTP link by input the link in the file path.
logPath: ./err/test.log
files:
- path: ./edge.csv
failDataPath: ./err/edge.csv
batchSize: 100
type: csv
csv:
withHeader: false
withLabel: false
delimiter: ","
One CSV file can only store one type of vertex or edge. Vertices and edges of the different schema should be stored in different files.
-
path
: Required. Specifies the path where the CSV data file is stored. If a relative path is used, the path and directory of the current configuration file are spliced. -
failDataPath
: Required. Specifies the file to insert the failed data output so that the error data is appended later. -
batchSize
: Optional. Specifies the batch size of the inserted data, the default value is 128. -
type & csv
: Required. Specifies the file type. Currently, only CSV is supported. You can specify whether to include the header and the inserted and deleted labels in the CSV file.withHeader
: The default value is false, the format of the header is described below.withLabel
: The default value is false, the format of the label is described below.delimiter
: Optional. The delimiter to separate different columns, default value is","
.
-
schema
: Required. Describes the metadata information of the current data file. The schema.type has only two values: vertex and edge.- When type is specified as vertex, details should be described in the vertex field.
- When type is specified as edge, details should be described in edge field.
schema:
type: vertex
vertex:
tags:
- name: student
props:
- name: name
type: string
- name: age
type: int
- name: gender
type: string
schema.vertex
is a required parameter that describes the schema information such as tags of the inserted vertex. Since sometimes one vertex contains several tags, different tags should be given in the schema.vertex.tags
array.
Each tag contains the following two properties:
name
: The tag's name.prop
: The tag's properties. Each property contains the following two fields:name
: The property name, the same with the tag property in Nebula Graphtype
: The property type, currently supportbool
,int
,float
,double
,timestamp
andstring
.
Note: The order of properties in the above props must be the same as that of the corresponding data in the CSV data file.
schema:
type: edge
edge:
name: choose
withRanking: false
props:
- name: grade
type: int
schema.edge
is a required parameter that describes the schema information of the inserted edge. Each edge contains the following three properties:
name
: The edge's name.withRanking
: Specifies therank
value of the given edge, used to tell different edges to share the same edge type and vertices.props
: Same as the above tag. Please be noted the property order here must be the same with that of the corresponding data in the CSV data file.
Details of all the configurations please refer to Configuration Reference.
Usually, you can add some descriptions in the first row of the CSV file to specify each column's type.
If the csv.withHeader
is set to false
, the CSV file only contains the data (no descriptions of the first row). Example of vertices and edges are as follow:
Take tag course
for example:
101,Math,3,No5
102,English,6,No11
The first column is the vertex ID, the following three columns are the properties, corresponding to the course.name, course.credits and building.name in the configuration file. (See vertex.tags.props
).
Take edge type choose
for example:
200,101,5
200,102,3
If the csv.withHeader
is set to false
, the CSV file only contains the data (no descriptions of the first row). Example of vertices and edges are as follow:
The first two columns indicate source vertex and dest vertex ID, the third is the property, corresponding to choose.likeness in the configuration file. (If ranking is included, the third column should be rankings. The properties should follow behind ranking column.)
There will be two CSV data formats supported in the future. But now please use the first format which has no header line in your CSV data file.
If the csv.withHeader
is set to true
, the first row of the CSV file is the header.
The format of each column is <tag_name/edge_name>.<prop_name>:<prop_type>
:
<tag_name/edge_name>
is the name of the vertex or edge.<prop_name>
is the property name.<prop_type>
is the property type. It can bebool
,int
,float
,double
,string
andtimestamp
, the default type isstring
.
In the above <prop_type>
field, the following keywords contain special semantics:
:VID
is the vertex ID.:SRC_VID
is the source vertex VID.:DST_VID
is the dest vertex VID.:RANK
is the rank of the edge.:IGNORE
indicates this column will be ignored.:LABEL
indicates the columns that insert/delete+/-
.
If the CSV file contains the header, the importer parses the schema of each row according to the header and ignores the
props
in YAML.
Take vertex course as example:
:LABEL,:VID,course.name,building.name:string,:IGNORE,course.credits:int
+,"hash(""Math"")",Math,No5,1,3
+,"uuid(""English"")",English,"No11 B\",2,6
:LABEL,
+,
-,
Indicates the column is inserting (+) or deleting (-) operation.
:VID
123,
"hash(""Math"")",
"uuid(""English"")"
In the :VID
column, in addition to the common integer values (such as 123), you can also use the two built-in functions hash
and uuid
to automatically calculate the VID of the generated vertex (for example, hash("Math")).
Note that the double quotes (") are escaped in the CSV file. For example,
hash("Math")
should be written as"hash(""Math"")"
.
course.name,:IGNORE,course.credits:int
Math,1,3
English,2,6
:IGNORE
is to specify the column that you want to ignore when importing data. All columns except the :LABEL
column can be in any order. Thus, for a large CSV file, you can flexibly select the columns you need by setting the header.
Because a VERTEX can contain multiple TAGs, the TAG name should be added to the header of the specified column (for example, it must be
course.credits
, rather than the abbreviatedcredits
).
Take edge follow
for example:
:DST_VID,follow.likeness:double,:SRC_VID,:RANK
201,92.5,200,0
200,85.6,201,1
In the preceding example, the source vertex of the edge is :SRC_VID
(in column 4), the dest vertex of the edge is :DST_VID
(in column 1), and the property on the edge is follow.likeness:double
(in column 2), the ranking field of the edge is :RANK
(in column 5, the default value is 0 if you do not specify).
+
means inserting-
means deleting
The same with vertex, you can specify label in edge CSV file.
- Summary statistics of response
- Write error log and data
- Configure file
- Concurrent request to Graph server
- Create space and tag/edge automatically
- Configure retry option for Nebula client
- Support edge rank
- Support label for add/delete(+/-) in first column
- Support column header in the first line
- Support vid partition
- Support multi-tags insertion in vertex
- Provide docker image and usage
- Make header adapt to props order defined in the schema of the configuration file
- Handle string column in an elegant way
- Update concurrency and batch size online
- Count duplicate vids
- Support VID generation automatically
- Output logs to file