relatio-nlp/relatio

output_path is also an input_path?

Closed this issue · 3 comments

Thanks for your work on this. It's been fun to play with. I was trying to build a narrative model of tweets using the glove-twitter-200 embedding from GenSim, and I got the following error:

ValueError: X has 200 features, but KMeans is expecting 300 features as input.

Did I miss an option, or does the current version require 300-dim embeddings?

I figured it out.

It had to do with me using the output_path option in the build_narrative_model() function, which, it turns out, is BOTH an output path AND and an input path. It was reading in the kmeans object from a .pkl file. Since that old file was created w/ the 300-dim embedding I used on earlier runs we got a dimension mismatch.

I had wrongly assumed it would overwrite the old file with the new one, rather than read in the old file. Is that fact documented somewhere? I had to work my way through the code to figure it out. If you want to keep the functionality, I would document it more explicitly, and probably rename the option to something other than "output", because that seems misleading.

Hi Patrick,

Thanks for your feedback. I agree this isn't straightforward and should be documented.

This feature mainly comes from our applications to very large datasets, where building the model may take quite some time. We wanted to save the progress made along the way.

Best,

PS: As a disclaimer, we are currently redesigning the user API, so this feature is likely to disappear in the near future.

I definitely like the ability to save the output so I don't have to rerun the model. I was just surprised by how it worked. I assumed it just pickled the final narrative_model output and that was it. The behavior with the kmeans model was surprising. I'm all sorted now, but I could easily imaging someone else making this mistake.