Evovest/EvoTrees.jl

Create MLJ-compliant doc strings

ablaom opened this issue · 5 comments

@jeremiedb Saw your recent doc PR and thought I'd flag this. I understand you may not have the bandwidth just now, but should be aware there is now an "official" format for these strings, so you don't work against it unwittingly.

Would you have reference / example regarding these docstring format?

Yes, we have both. See https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Document-strings .

If you also have a "native" interface for your MLJ objects, this can of course be worked into the doc string. Happy to review

@ablaom If you have time, let me know what you think of https://evovest.github.io/EvoTrees.jl/dev/. Models in particular are found under the Models section.

Working out the docs, I had a few observations:

  • StatsBase overloading for predict, predict!, fit, fit!. My understanding is that it's a best practice to import methods from more generic base methods when possible to avoid name conflicts. It seems like importing these functions from StatsBase would be an applicable scenario.
  • Models terminology: that's a trickier one to tackle as it's somewhat engrained in the MLJ syntax, yet I have discomfort about what is referred to as a model. For example, whether it's EvoTreeRegressor, GLM's LinearRegressor or MLJFlux's NeuralNetworkClassifier, I have a hard time associated these structs with "models". For example, none of these so called "models" could be used with the StatsBase's predict(model, X), which is also found both in R and Pythons worlds. I feel there's some ambiguity in MLJ world's between the above and the machine's fittedresults. It seems to come up also explicitly with the MLJFlux section title Accessing the Flux chain (model). I think it would be interesting to get other users perception on that matter as I'd tend to associate the notion of model with the GBTree object, or the fitted results of GLM or the Flux's chain. I pointed this out as I recall reading some time ago you were looking for some simplification refactoring and I thought this may be a little piece of that challenging puzzle.

@jeremiedb Immensely appreciate this effort, and the feedback. Currently swamped with reviews, but I will get to this eventually. I'm guessing what you have is already can be safely assumed a big improvement, so not super urgent, right?

@ablaom If you have time, let me know what you think of https://evovest.github.io/EvoTrees.jl/dev/. Models in particular are found under the Models section.

Working out the docs, I had a few observations:

  • StatsBase overloading for predict, predict!, fit, fit!.

A conscious decision. We did not want StatsBase.jl as a dependency of MLJModelInterface.jl, which is supposed to be super lightweight. StatBase has a number of dependencies itself which we did not want to impose on implementers of our API. Is this design decision causing you some grief or just the usual inconvenience of living with some name conflicts?

My understanding is that it's a best practice to import methods from more generic base methods when possible to avoid name conflicts. It seems like importing these functions from StatsBase would be an applicable scenario.

I think there is some growing feeling that it's okay for packages to own separate methods, or at least this is increasingly difficult to avoid universally. After all, the StatsBase API for predict and fit is not the same as the MLJ API, at least for now. The transform in StatsBase is not the same as the transform in DataFrames or MLJ, and so forth.

  • Models terminology: that's a trickier one to tackle as it's somewhat engrained in the MLJ syntax, yet I have discomfort about what is referred to as a model. For example, whether it's EvoTreeRegressor, GLM's LinearRegressor or MLJFlux's NeuralNetworkClassifier, I have a hard time associated these structs with "models". For example, none of these so called "models" could be used with the StatsBase's predict(model, X), which is also found both in R and Pythons worlds. I feel there's some ambiguity in MLJ world's between the above and the machine's fittedresults. It seems to come up also explicitly with the MLJFlux section title Accessing the Flux chain (model). I think it would be interesting to get other users perception on that matter as I'd tend to associate the notion of model with the GBTree object, or the fitted results of GLM or the Flux's chain. I pointed this out as I recall reading some time ago you were looking for some simplification refactoring and I thought this may be a little piece of that challenging puzzle.

I understand the objection, but I feel this is one of those terminological decisions it is difficult to get consensus about. In any case, I think that boat has sailed as far as MLJ is concerned. I think the docs are pretty consistent within MLJ, no? I mean "In MLJ a model is ..." is scattered pretty ubiquitously in the docs and tutorials, no?