Check for conflicts / duplications [Feature request]
Closed this issue · 11 comments
I have just found this plugin and it is really useful for inserting additional records in Geopackage layers, and probably other database types. THANKS!
I would like to suggest a feature to check if the record exists - I have a table that has an 'external key' / foreign key that corresponds with the data that I am trying to append and it would be great if the I could select a column in the two layers for the plugin to check 'if exists'. If the plugin could exclude the existing rows, and only append new rows that would be fantastic.
Hi @walking-the-talk !
thanks for the message and your suggestion. It would be certainly a nice addition.
Glad you find the plugin useful.
Regards,
Germán
@walking-the-talk Are you willing to help testing this new feature? I'm currently implementing this.
Hi @gacarrillor
Very happy to help - let me know how I can assist. I am very good at breaking things :-)
Regards
I attach a beta version of the plugin that supports the following actions when duplicate features are found:
- Skip duplicate feature
- Update duplicate feature
- Append feature, nonetheless
Please give it a try and let me know any suggestion you have or issue you find.
Got it, installed it, partially tested it...
This is excellent once I worked out the terminology. I would suggest using the term input / output 'matching key field' or 'field to compare' instead of just 'input field' / 'output field'. Or a hint that it can compare a field within the input and output layers.
However, the plugin does not read the input file to provide the list of fields if it is an external file (e.g. csv). First time I tried it imported duplicates because I only had an output field selected (as the dropdown didn't show the fields within the csv so I left it blank, hoping that it would find the 'key' in both layers if they had the same name).
I worked out two ways to get my data to be processed correctly:
1: add the csv to the project as a delimited text layer, then run the plugin.
2: select the destination (output) layer from within the project as the input layer first, then select the external file - the dropdown retains the list of fields from the layer within the project, and this works if I select the name of the corresponding column in the csv!
Number 2 is probably not what you intended. Can you read the fields of an external file when it is selected and populate the dropdown? In my case I have a csvt file from when I originally imported the csv, Maybe that would help with determining the field types, but maybe that becomes more complex?
Could the plugin read, for example the dbf of a shapefile set, or should you advise the user to add the layer to the project before running the plugin? Probably easiest to only use layers within the project, and you could then disable selecting an external layer?
I can confirm that it was able to update and skip the existing records according to the option selected - this is REALLY useful and more than I'd requested!! It can also ignore records and add them anyway. Which I guess is fine, but not sure whether anyone would want to do that?
My layers have a 36 character text key, and one has 500 records - the plugin coped with that no problem. It is now viable to append records from this csv that incrementally changes as more field data is collected, without fear of duplicate data being added.
Suggestions to make it even better:
It may be worth adding a check routine that if input or output field is has been selected, ask user to confirm before running process?
Currently no check that the matching field is the same type (so duplicates can be imported if the wrong matching fields are selected) - this would be a good way of catching silly mistakes by the user, but obviously not foolproof.
Happy to try harder to break it, if you can suggest ways of making life more difficult for the plugin.
Great work, thank you.
Thanks for the feedback!
This is excellent once I worked out the terminology. I would suggest using the term input / output 'matching key field' or 'field to compare' instead of just 'input field' / 'output field'. Or a hint that it can compare a field within the input and output layers.
I agree, just didn't figure out what name to use, but I'll take one of your suggested names.
Oh, what a problem with the CSV! Actually, it should be solved in QGIS. I imagined the algorithm would work only for QGIS layers/tables, I'll see if I can prevent users from loading external files from the algorithm GUI.
I can confirm that it was able to update and skip the existing records according to the option selected - this is REALLY useful and more than I'd requested!! It can also ignore records and add them anyway. Which I guess is fine, but not sure whether anyone would want to do that?
Yes, in that case one wouldn't even choose input and output unique-values fields and use the current behaviour of the plugin (i.e., always append). I'll give this option a second thought and probably remove it.
My layers have a 36 character text key, and one has 500 records - the plugin coped with that no problem. It is now viable to append records from this csv that incrementally changes as more field data is collected, without fear of duplicate data being added.
Happy to read that!
It may be worth adding a check routine that if input or output field is has been selected, ask user to confirm before running process?
Currently no check that the matching field is the same type (so duplicates can be imported if the wrong matching fields are selected) - this would be a good way of catching silly mistakes by the user, but obviously not foolproof.
I don't know if processing algorithms allow such interaction. I do check a couple of things before running the algorithm, though:
- If you choose an "action on duplicate", you should select input/output fields.
- If you choose input/output fields, you should select an "action on duplicate".
If such conditions are not met, an error/warning message is displayed in the algorithm log, telling the user what parameter combination is expected.
I'm not sure about the field type check. For instance, the algorithm is flexible with geometry types. You can import multi-part layers to single-part layers or even polygon layers to line layers with no effort. I think it might be useful to be flexible on field types as well, although I didn't really test if the algorithm runs well with different field types in input/output fields.
I like how this plugin is evolving, I think this new functionality will help QGIS users. Thanks again!
I'm getting back to these changes. I'd like to finally publish them in a new release of AppendFeaturesToLayer.
As for the issues you were facing when selecting a CSV as external file, QGIS now handles it in a correct way. So the fields of the CSV are listed (that's done by Processing, I had nothing to do with it). Also the case of the .dbf works as expected, listing the fields of the external file. Problem solved!
I'm also writing unit tests for this plugin as the combination of possibilities gives a large number of scenarios. I'm into that process.
Performance will also be improved dramatically by avoiding QgsVectorLayer.addFeatures()
and using QgsVectorDataProvider.addFeatures()
instead.
Are you still around @walking-the-talk? Are you available to give the plugin a new round of tests?
Hi @gacarrillor
I'm still around and would be happy to try testing the plugin. Pleased to hear that the problem with external files has been solved (and that you didn't have to spend time working on it!)
Please let me know any scenarios you'd like testing, and I can also use my imagination to see if I can use it in a way you haven't anticipated (again) :-)
great! Could you please test the attached version of the plugin? You can use your imagination and try to catch unforeseen errors or strange things!
Apologies for the delay in testing - it has been a busy month!
I have given the new version of the plugin a few tricky puzzles to play with and so far I can't make it break :-)
Here are a few Logs so you can see what I asked - I forgot to save the logs for some tests.
I have tried duplicates, files not already loaded to QGIS (I know that's a core feature rather than the plugin!). I've also used the basic model to transpose fields. I have removed some data within these logs relevant to the location of files on my hard drive - just because I'm paranoid.
The one 'problem' is where the two fields don't actually match (for example I selected KEY in one layer and FID in another) - because KEY is a string and FID is an integer, there are no duplicates found (expected result!). However, this is something you highlighted earlier so I just wanted to confirm that there is no 'fail safe' confirmation to the user if the field types are different.
Append-test#1.txt
Append-test#2.txt
Append-test#3.txt
Append-test#4.txt
Append-test-model#1.txt
Great work. Much appreciated and a useful tool for me as I commonly have incremental datasets and I need to make sure I don't duplicate things through being stupid.
Best Wishes.
Thanks a lot for the review!
As for the field type issue, I'll add a flexible approach there and try to compare values in a smarter way to find duplicate. I think it could be useful.
Regards.