Nebula Importer is a CSV importing tool for Nebula Graph. It reads data in the local CSV files and imports data into Nebula Graph.
Before you start Nebula Importer, make sure that:
- Nebula Graph is deployed.
- A schema, composed of space, tags, and edge types, is created.
Currently, there are three methods to deploy Nebula Graph:
The quickest way to deploy Nebula Graph is using Docker Compose.
The rpc protocols (i.e., thrift) in Nebula Graph 1.x, v2, v3 are incompatible. Nebula Importer master and v3 branch can only connect to Nebula Graph 3.x.
Do not mismatch.
After configuring the YAML file and preparing the CSV files to be imported, you can use this tool to batch write data to Nebula Graph.
Nebula Importer is compiled with Go 1.13 or later, so make sure that Go is installed on your system. See the Go installation document for the installation and configuration tutorial.
- Clone the repository
- For Nebula Graph 3.x, clone the master branch.
$ git clone https://github.com/vesoft-inc/nebula-importer.git
- Go to the
nebula-importer
directory.
$ cd nebula-importer
- Build the source code.
$ make build
- Start the service
$ ./nebula-importer --config /path/to/yaml/config/file
The --config
option in the preceding command is used to pass the path of the YAML configuration file.
If you are using Docker, you don't have to install Go locally. Pull the Docker image for Nebula Importer. Mount the local configuration file and the CSV data files into the container and you are done.
$ docker run --rm -ti \
--network=host \
-v {your-config-file}:{your-config-file} \
-v {your-csv-data-dir}:{your-csv-data-dir} \
vesoft/nebula-importer:{image_version}
--config {your-config-file}
{your-config-file}
: Replace with the absolute path of the local YAML configuration file.{your-csv-data-dir}
: Replace with the absolute path of the local CSV data file.{image_version}
: Replace with the image version you need(e.g.v1
,v2
,v3
)
NOTE: We recommend that you use the relative paths in the
files.path
file. If you use the local absolute path, check how the path is mapped to Docker carefully.
Nebula Importer uses the YAML configuration file to store information for the CSV files and Nebula Graph server. Here's an example for v2 and an example for v1 for the configuration file and the CSV file. You can find the explanation for each option in the following:
version: v2
description: example
removeTempFiles: false
version
: Required. Indicates the configuration file version, the default value isv2
. Note thatv2
config can be used with both 2.x and 3.x Nebula service.description
: Optional. Describes the configuration file.removeTempFiles
: Optional. Whether to delete the temporarily generated log and error data files. The default value isfalse
.clientSettings
: Stores all the configurations related to the Nebula Graph service.
clientSettings:
retry: 3
concurrency: 10
channelBufferSize: 128
space: test
connection:
user: user
password: password
address: 192.168.8.1:9669,192.168.8.2:9669
postStart:
commands: |
UPDATE CONFIGS storage:wal_ttl=3600;
UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
afterPeriod: 8s
preStop:
commands: |
UPDATE CONFIGS storage:wal_ttl=86400;
UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
clientSettings.retry
: Optional. Shows the failed retrying times to execute nGQL queries in Nebula Graph client.clientSettings.concurrency
: Optional. Shows the concurrency of Nebula Graph Client, i.e. the connection number between the Nebula Graph Client and the Nebula Graph Server. The default value is 10.clientSettings.channelBufferSize
: Optional. Shows the buffer size of the cache queue for each Nebula Graph Client, the default value is 128.clientSettings.space
: Required. Specifies whichspace
the data is imported into. Do not import data to multiple spaces at the same time because it causes a performance problem.clientSettings.connection
: Required. Configures theuser
,password
, andaddress
information for Nebula Graph Server.clientSettings.postStart
: Optional. Stores the operations that are performed after the Nebula Graph Server is connected and before any data is inserted.clientSettings.postStart.commands
: Defines some commands that will run when Nebula Graph Server is connected.clientSettings.postStart.afterPeriod
: Defines the interval between running the preceding commands and inserting data to Nebula Graph Server.
clientSettings.preStop
: Optional. Configures the operations before disconnecting Nebula Graph Server.clientSettings.preStop.commands
: Defines some command scripts before disconnecting Nebula Graph Server.
The following three configurations are related to the log and data files:
workingDir
: Optional. If you have multiple directories containing data with the same file structure, you can use this parameter to switch between them. For example, the value ofpath
andfailDataPath
of the configuration below will be automatically changed to./data/student.csv
and./data/err/student
. If you change workingDir to./data1
, the path will be changed accordingly. The param can be either absolute or relative.logPath
: Optional. Specifies the log path when importing data. The default path is/tmp/nebula-importer-{timestamp}.log
.files
: Required. It is an array type to configure different data files. You can also import data from a HTTP link by inputting the link in the file path.
workingDir: ./data/
logPath: ./err/test.log
files:
- path: ./student.csv
failDataPath: ./err/student
batchSize: 128
limit: 10
inOrder: false
type: csv
csv:
withHeader: false
withLabel: false
delimiter: ","
lazyQuotes: false
One CSV file can only store one type of vertex or edge. Vertices and edges of the different schema must be stored in different files.
-
path
: Required. Specifies the path where the data files are stored. If a relative path is used, thepath
and current configuration file directory are spliced. Wildcard filename is also supported, for example:./follower-*.csv
, please make sure that all matching files with the same schema. -
failDataPath
: Required. Specifies the directory for data that failed in inserting so that the failed data are reinserted. -
batchSize
: Optional. Specifies the batch size of the inserted data. The default value is 128. -
limit
: Optional. Limits the max data reading rows. -
inOrder
: Optional. Whether to insert the data rows in the file in order. If you do not specify it, you avoid the decrease in importing rate caused by the data skew. -
type & csv
: Required. Specifies the file type. Currently, only CSV is supported. Specify whether to include the header and the inserted and deleted labels in the CSV file.withHeader
: The default value is false. The format of the header is described in the following section.withLabel
: The default value is false. The format of the label is described in the following section.delimiter
: Optional. Specify the delimiter for the CSV files. The default value is","
. And only a 1-character string delimiter is supported.lazyQuotes
: Optional. IflazyQuotes
is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.
Required. Describes the metadata information for the current data file. The schema.type
has only two values: vertex and edge.
- When type is set to vertex, details must be described in the vertex field.
- When type is set to edge, details must be described in edge field.
Required. Describes the schema information for vertices. For example, tags.
schema:
type: vertex
vertex:
vid:
index: 1
function: hash
prefix: abc
tags:
- name: student
props:
- name: age
type: int
index: 2
- name: name
type: string
index: 1
- name: gender
type: string
defaultValue: "male"
- name: phone
type: string
nullable: true
- name: email
type: string
nullable: true
nullValue: "__NULL__"
- name: address
type: string
nullable: true
alternativeIndices:
- 7
- 8
# concatItems examples
schema:
type: vertex
vertex:
vid:
concatItems:
- "abc"
- 1
function: hash
Optional. Describes the vertex ID column and the function used for the vertex ID.
index
: Optional. The column number in the CSV file. Started with 0. The default value is 0.concatItems
: Optional. The concat item can bestring
,int
or mixed.string
represents a constant, andint
represents an index column. Then connect all items.If set, the aboveindex
will have no effect.function
: Optional. Functions to generate the VIDs. Currently, we only support functionhash
.type
: Optional. The type for VIDs. The default value isstring
.prefix
: Optional. Add prefix to the original vid. Whenfunction
is specified also,prefix
is applied to the original vid beforefunction
.
Optional. Because a vertex can have several tags, different tags are described in the schema.vertex.tags
parameter.
Each tag contains the following two properties:
name
: The tag name.prop
: A property of the tag. Each property contains the following two fields:name
: Required. The property name, must be the same with the tag property in Nebula Graph.type
: Optional. The property type, currentlybool
,int
,float
,double
,string
,time
,timestamp
,date
,datetime
,geography
,geography(point)
,geography(linestring)
andgeography(polygon)
are supported.index
: Optional. The column number in the CSV file.nullable
: Optional. Whether this prop property can beNULL
, optional values istrue
orfalse
, defaultfalse
.nullValue
: Optional. Ignored whennullable
isfalse
. The property is set toNULL
when the value is equal tonullValue
, default""
.alternativeIndices
: Optional. Ignored whennullable
isfalse
. The property is fetched from csv according to the indices in order until not equal tonullValue
.defaultValue
: Optional. Ignored whennullable
isfalse
. The property default value, when all the values obtained byindex
andalternativeIndices
arenullValue
.
NOTE: The properties in the preceding
prop
parameter must be sorted in the same way as in the CSV data file.
Required. Describes the schema information for edges.
schema:
type: edge
edge:
name: choose
srcVID:
index: 0
function: hash
dstVID:
index: 1
function: hash
rank:
index: 2
props:
- name: grade
type: int
index: 3
The edge parameter contains the following fields:
name
: Required. The name of the edge type.srcVID
: Optional. The source vertex information for the edge. Theindex
andfunction
included here are the same as that of in thevertex.vid
parameter.dstVID
: Optional. The destination vertex information for the edge. Theindex
andfunction
included here are the same as that of in thevertex.vid
parameter.rank
: Optional. Specifies therank
value for the edge. Theindex
indicates the column number in the CSV file.props
: Required. The same as theprops
in the vertex. The properties in theprop
parameter must be sorted in the same way as in the CSV data file.
See the Configuration Reference for details on the configurations.
Usually, you can add some descriptions in the first row of the CSV file to specify the type for each column.
If the csv.withHeader
is set to false
, the CSV file only contains the data (no descriptions in the first row). Example for vertices and edges are as follows:
Take tag course
for example:
101,Math,3,No5
102,English,6,No11
The first column is the vertex ID, the following three columns are the properties, corresponding to the course.name, course.credits and building.name in the configuration file. (See vertex.tags.props
).
Take edge type choose
for example:
200,101,5
200,102,3
The first two columns are the source VID and destination VID. The third column corresponds to the choose.likeness property. If an edge contains the rank value, put it in the third column. Then put the edge properties in order.
If the csv.withHeader
is set to true
, the first row of the CSV file is the header information.
The format for each column is <tag_name/edge_name>.<prop_name>:<prop_type>
:
<tag_name/edge_name>
is the name for the vertex or edge.<prop_name>
is the property name.<prop_type>
is the property type. It can bebool
,int
,float
,double
,string
,time
,timestamp
,date
,datetime
,geography
,geography(point)
,geography(linestring)
andgeography(polygon)
. The default type isstring
.
In the above <prop_type>
field, the following keywords contain special semantics:
:VID
is the vertex ID.:SRC_VID
is the source vertex VID.:DST_VID
is the destination vertex VID.:RANK
is the rank of the edge.:IGNORE
indicates the column is ignored.:LABEL
indicates the column is marked as inserted/deleted+/-
.
NOTE: If the CSV file contains the header, the importer parses the schema of each row according to the header and ignores the
props
in YAML.
Take vertex course as example:
:LABEL,:VID,course.name,building.name:string,:IGNORE,course.credits:int
+,"hash(""Math"")",Math,No5,1,3
+,"hash(""English"")",English,"No11 B\",2,6
:LABEL,
+,
-,
Indicates the column is the insertion (+) or deletion (-) operation.
:VID
123,
"hash(""Math"")",
"hash(""English"")"
In the :VID
column, in addition to the common integer values (such as 123), you can also use the two built-in function hash
to automatically generate the VID for the vertices (for example, hash("Math")).
NOTE: The double quotes (") are escaped in the CSV file. For example,
hash("Math")
must be written as"hash(""Math"")"
.
course.name,:IGNORE,course.credits:int
Math,1,3
English,2,6
:IGNORE
is to specify the column that you want to ignore when importing data. All columns except the :LABEL
column can be sorted in any order. Thus, for a large CSV file, you can select the columns you need flexibly by setting the header.
Because a VERTEX can contain multiple TAGs, when specifying the header, you must add the tag name. For example, it must be
course.credits
, rather than the abbreviatedcredits
).
Take edge follow
for example:
:DST_VID,follow.likeness:double,:SRC_VID,:RANK
201,92.5,200,0
200,85.6,201,1
In the preceding example, the source vertex of the edge is :SRC_VID
(in column 4), the destination vertex of the edge is :DST_VID
(in column 1), and the property on the edge is follow.likeness:double
(in column 2), the ranking field of the edge is :RANK
(in column 5. The default value is 0 if you do not specify).
+
means inserting-
means deleting
Similar to vertex, you can specify label for header in the edge CSV file .