[Feature][Mongodb]Using MongoDB for data migration,How to ensure that the data written to the target MongoDB is consistent with the original data
Opened this issue · 4 comments
Search before asking
- I had searched in the feature and found no similar feature requirement.
Description
How to ensure that the data written to the target MongoDB is consistent with the original data when using MongoDB for data migration? For example, in the document of momgodb, there are three records, each containing the following fields: 1. id, name, age 2. id, name 3. id, name, gender. When using SEATUNEL for data synchronization, field mapping is required, which requires configuring all fields: id, name, age, and gender. It can be seen that the first record does not have a gender field, but after being written to the target database, the first record will have an additional gender field with a null value. Similarly, the second record will have two null fields, age and gender, which are clearly inconsistent with the original data. How to ensure that fields with null values are automatically filtered out when writing to the target database. Secondly, this pre specified field mapping method requires users to know which specific fields are included, but some documents may have been created by other developers and do not know which specific fields are included when synchronizing data? There may be hundreds of millions of data in a document, and it is impossible to view which fields are included in each one before configuring the mapping relationship.
(使用MongoDB进行数据迁移,如何确保写入目标MongoDB的数据与原始数据一致?例如momgodb的文档中有三条记录,每条记录包含以下字段:1.id、name、age 2.id、name3.id、name、gender使用SEATUNEL进行数据同步时,需要进行字段映射,这需要配置全部字段:id、name、age和gender。可以看出第一条记录没有gender字段,但在写入目标数据库后,第一条记录将有一个null值的额外gender字段。同样,第二条记录将有两个空值字段,age和gender,这与原始数据明显不一致。如何能保证在写入目标数据库时,将值为null的字段自动过滤掉。其次,这种提前指定字段映射的方式,需要使用者知道具体都有哪些字段,但是有些文档可能是其他开发人员创建的,同步数据时并不知道具体的字段有哪些?可能一个文档中有几亿条数据,不可能每一条都查看有哪些字段后。再汇总所有字段,去配置映射关系)
Usage Scenario
Used for MongoDB cluster data migration, hoping to keep the migrated data consistent with the original data. Because when using MongoDB, the number of fields contained in each record in the document may not be consistent. The current record may contain five fields, and some fields in the next record may have null values, which we will not write to the database. This may become three fields. This is also an advantage of MongoDB storage. But this will result in inconsistent numbers of fields in each record
(用于MongoDB集群数据迁移,希望迁移的数据与原始数据保持一致。因为使用MongoDB时,文档中每条记录中包含的字段数量可能不一致。当前记录可能包含五个字段,下一条记录中的某些字段可能为空值,我们不会将其写入数据库。这可能会变成三个字段。这也是MongoDB存储的一个优势。但这将导致每条记录中的字段数量不一致)
The configuration file is as follows:
env {
parallelism = 1
job.mode = "BATCH"
}
source {
MongoDB {
uri = "mongodb://XXX.XX.0.XXX:20003/device"
database = "device"
collection = "oaidmd5_${num}"
match.projection = "{id:0}"
partition.split-key = "oaidmd5"
partition.split-size = 2048
schema = {
fields {
oaidmd5 = String
age = {
qtt = {
0 = Int
1 = Int
2 = Int
3 = Int
}
}
brand = String
gender = String
model = String
oaid = String
osv = String
upts = Int
clk1= {
vip = {
51 = Int
ttc = Int
}
}
interest = {
5 = Double
7 = Double
18 = Double
14 = Double
}
interest_1 = {
9 = Double
}
interest_3 = {
9 = Double
}
interest_7 = {
5 = Double
7 = Double
9 = Double
}
interest_14 = {
9 = Double
}
pkg_list = "array"
}
}
}
}
sink {
MongoDB{
uri = "mongodb://xxx.xxx.xx.xxx:20003/device"
database = "device"
collection = "oaidmd5${num}"
buffer-flush.max-rows = 2000
buffer-flush.interval = 1000
schema = {
fields {
oaidmd5 = String
age = {
qtt = {
0 = Int
1 = Int
2 = Int
3 = Int
}
}
brand = String
gender = String
model = String
oaid = String
osv = String
upts = Int
clk1= {
vip = {
51 = Int
ttc = Int
}
}
interest = {
5 = Double
7 = Double
18 = Double
14 = Double
}
interest_1 = {
9 = Double
}
interest_3 = {
9 = Double
}
interest_7 = {
5 = Double
7 = Double
9 = Double
}
interest_14 = {
9 = Double
}
pkg_list = "array"
}
}
}
}
The result is shown in the figure:
Related issues
no
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
I hope the community can provide some solutions. Thank you to every contributor in the open source community
mongodb -> mongodb , using mongodump
is not better ?
MongoDB is not self built,moving from Huawei Cloud to Volcano Cloud.There may be issues with using mongodump, and currently using Mongo's mongoexport and mongoimport commands for data migration is a bit slow.So I want to use SEATUNEL to directly read and write to it, but currently using SEATUNEL to synchronize MongoDB is not very effective
Moreover, MongoDB is not a relational database, and its advantage is that it does not require storing fields with null values, which leads to inconsistent fields in each record. Therefore, when using a mapping method to synchronize MongoDB, it is necessary to first know what all fields are? Secondly, after data synchronization, there will be a large number of fields with null values, which will store many useless fields. So I feel like there might be some issues with synchronizing mongo data through mapping