GRAAL-Research/deepparse

Retraining With Our Own Data

zubairshahzadarain opened this issue · 5 comments

Dear Team
I have an address dataset of countries that are not on your list.
I want to train a model, but I am facing an issue with how to prepare data so I can use it for training.
I have an addresses dataset, but how I can give tags to all addresses; it's 4.5 million addresses.
I have seen the sample file you use to tag every word in the address, but is there any way to tag all datasets
please guide

Thanks
sorry about my English

Thank you for you interest in improving Deepparse.

can you support how i can prepare dataset from addresses
how i can taging to 4.5 million addresses
i have addresses

Hi @ZubairShahzad,

We offer code examples to

  1. retrain a parsing model that guides you into how one can use annotated (already parsed address) to improve performance, and
  2. retrain with new prediction tags that will guide you on how one can change the tag set and retrain the last prediction layer to predict which tags are good for you.

Regarding a strategy for developing a dataset for your country, I would recommend a bootstrapping approach. Namely,
start by parsing something like a thousand addresses, manually fix the errors, retrain the model with those new examples and parse new addresses, validate the annotation of those new examples, and retrain and so on until performance is enough. Each step should help improve performance and reduce the time needed to validate the parsed address. But depending on the address's country, we could use other tricks to speed up the annotation (domain transfer).

After that, if you are willing to share the dataset, we would be more than happy to introduce your dataset in our public one available here.

Hello @ZubairShahzad,

I think the strategy proposed by @davebulaval is your best bet. If I could add one thing it would be to try and clean your data a little bit by removing unnecessary punctuation and lowercasing your addresses to better match the state of addresses in the original dataset. This should limit model errors which would be due to the address structure.

As pointed out by @davebulaval, if you are willing to share your data, we could also recommend a preprocessing strategy.

This issue is stale because it has been open 60 days with no activity.
Stale issues will automatically be closed 30 days after being marked Stale
.