GRAAL-Research/deepparse

Question: Importing the model locally causes errors with AddressParser

steve-itb opened this issue · 6 comments

Hi,

Thank you for making this library.
I do have a question around the setup. I am not always online when i run the script and I notice the library goes to the web to download the latest model. I have stored the model in a local folder and want to import it into my working environment but i get the following

str_path = os.getcwd() + \\mod\\ + "retrained_modeltype_address_parser.ckpt"
address_parser = AddressParser(model_type="bpemb", device="cpu", path_to_retrained_model=str_path)

And get the error:

RuntimeError: You are not using the proper retrained checkpoint. When we retrain an AddressParser, by default, we create a checkpoint name 'retrained_modeltype_address_parser.ckpt'. Be sure to use thatcheckpoint since it includes some metadata for the reloading.See AddressParser.retrain for more details.

I am not a deep learning expert but used pytorch to import the checkpoint and it seemed ok.

Any help you can recommend would be great

Thanks

Thank you for you interest in improving Deepparse.

Hello @steve-itb,

You are getting an error because the handling of our pretained models and that of the user retrained ones isn't the same. As for running inside your environment, as long as you are successful in running the parser once you should be good because the library will simply use the cached model if it's not able to retrieve the latest one (i.e: if you're offline).

If it's impossible for you to be online even once to allow the running of the first download, you can download the model and put it in ~/.cache/deepparse, that way the library will know where to fetch it.

Hello @steve-itb, as the error message state, you don't use the proper checkpoint for Deepparse. We export not only the model but also other stuff, such as a tag_converter. Thus, even if you can load this specific checkpoint in Torch, for Deepparse, it is not enough information. You should see in the same repository as the checkpoint you try to load the one named something like 'retrained_modeltype_address_parser.ckpt' (where modeltype is the model type, in your case, based on the code provided, bpemb). You need to use this one.

Why we don't remove another checkpoint? Since we want to allow retraining.
Could we handle this kind of error by looking in the directory to see if we can find a good checkpoint? I am unsure if it is too much error handling and creating wrong expectations on our API.

Do you have suggestions on how we can improve the error to be more clear @steve-itb ?

Hi both,

Thanks for taking the time out to look into this
I looked into the .cache folder and found the following

C:\Users\foo\.cache\deepparse

multi
bpemb.ckpt
bpemb.version

I then tried to force the AddressParser to use the local cached model

AddressParser(model_type="bpemb", device="cpu", path_to_retrained_model="C:/Users/foo/.cache/deepparse/bpemb.ckpt")

I am still getting the same error. Could you tell me how to force the code to just look locally. I am not always offline but sometimes I am behind a firewall so it fails to download.

I suppose to improve the error, it could be more explicit in how to solve it. I admit I am only a passable Python Programmer so its possible this error message is perfectly fine. If i were to see something like this in a language i was more proficient at, i would change it to read something like

RuntimeError: It appears you are not using the proper retrained checkpoint. Has the model downloaded correctly in ~/.cache/deepparse. Perhaps you could try running it from the cache directly using AddressParser(model_type="bpemb", device="cpu", path_to_retrained_model="C:/Users/foo/.cache/deepparse/bpemb.ckpt")

Hello @steve-itb,

Thanks for your feedback!

The issue is that you're trying to use our base pre-trained models as if you had trained them yourself which is not the case.

So essentially there are two ways to use Deepparse: either by using the models which we have already trained on our data out of the box, in which case you're not supposed to specify any value for the path_to_retrained_model argument, you just need to have the files in .cache\deepparse. The second way is to retrain our models on your own data, in which case you need to start by retaining a model (using the retrain() method of the AddressParser) and then specify the path to the retrained model.

In your case, since you already have the files in .cache\deepparse and are not looking to retrain on your own data, you simply need to ignore the path_to_retrained_model argument.

Does this help clarify things?