JuliaStats/Roadmap.jl

Machine Learning Roadmap

Closed this issue · 138 comments

Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacking.

Hopefully, we may coordinate our efforts through this issue. Below, I try to outline a tentative roadmap:

  • Generalized Linear Models

    • Linear Regression
    • Logistic Regression
    • Lasso, Elastic Net, and its variants
    • Stochastic Gradient Descent

    Current efforts: GLMNet, GLM, Regression

  • Support Vector Machines

    Current efforts: SVM, LIBSVM

  • DimensionalityReduction

    • PCA
    • ICA
    • CCA
    • Linear Discriminant Analysis
    • Kernel-based methods

    Current efforts: DimensionalityReduction

  • Non-negative Matrix Factorization

    This may be categorized into dimensionality reduction. However, NNMF in itself has a plethora of methodologies, and thus deserves a separate package.

  • Classification

    There are many techniques for classification. It may be useful to have multiple packages respective techniques (e.g. GLM, SVM, kNN), and have a meta-package Classification.jl to incorporate them all.

  • Clustering

    Current efforts: Clustering.jl

  • Many machine learning applications also require some supporting functionalities, such as performance evaluation, data preprocessing, etc. These can all go into MLBase

  • Probabilistic Modeling (e.g. Bayesian Network, Markov Random Field, etc)

    This is a huge field in itself, and may be discussed separately.

cc: @johnmyleswhite @dmbates @simonster @ViralBShah

Edit


I created an NMF.jl package, which is dedicated to non-negative matrix factorization.

Also, a detailed plan for DimensionalityReduction is outlined here.

I agree with all of this. I've got a lot of prototype SGD code already.

I like the idea of meta-packages. If we're going to have Classification.jl, maybe Regression.jl should be a similar meta-package?

I'm not an expert in this area, but I've been interested for awhile and am willing to help.

@johnmyleswhite: Will you please move Clustering, SVM, and DimensionalityReduction over to JuliaStats? These are very basic for machine learning. I recently get some time to work on those.

For regression, when there are several quite different techniques implemented, it will make sense to make a meta package.

I transferred Clustering and SVM over. I'm going to announce that I'm moving DimensionalityReduction over, then we can go ahead and make the move tomorrow.

Also, I think it is important to separate packages that provide core algorithms and those integrated with DataFrames.

We may consider to provide tools such that they can be worked nicely with machine learning algorithms. However, I think core machine learning packages should not depend on DataFrames -- which are not used as frequently in machine learning.

I agree completely. I would very strongly prefer that we implement integration with DataFrames in the following way throughout all packages:

  • Packages should always define algorithms that operate on Vector{Float64} and Matrix{Float64}.
  • DataFrames.jl exposes a set of tools via formulas that translate between DataFrame and Matrix{Float64}.

This makes it easy to work with pure numerical data without any dependencies on DataFrames, while making it easy for people working with DataFrames to take advantage of the core ML algorithms by efficiently translating DataFrames into matrices.

The only hiccup with what I just described is deciding whether the interfaces that mix DataFrames + ML should live. Arguably there should be one big package that does all of this by wrapping the other ML packages with a DataFrames interface.

@johnmyleswhite are there issues of providing these in DataFrames.jl ?

Providing what?

Sorry, I seemed to misread part of your comments. I agree with your suggestions.

Just that I am not sure whether we really another meta-package to couple DataFrames and ML, if the tools provided in DataFrames are convenient enough.

You're right: we could encourage users to explicitly call the DataFrame -> Matrix conversion routines. That would simplify things considerably.

The two main difficulties with this approach:

  • Getting the community to adopt this kind of strategy consistently.
  • Dealing with packages that legitimately need additional information to do their work. In GLM, for example, the entire model estimation steps need nothing more than access to the design matrix. But presenting the results in a convenient way requires access to the information about the original coefficient labels.

For GLM, my consideration is to have two packages:

  1. A package that provides the core algorithms that only work with numerical arrays.
  2. A higher-level package that builds on top of the core package that provides more friendly interface. (This package may depend on DataFrames)

So this is basically your idea of having a higher-level package that relies on core ML packages + DataFrames to provide useful tools for analyzing data frames.

On phone right now, but weren't there some CART/Random Forest packages if not in METADATA then just mentioned in mailing list?
One thing about those is that they can use factors quite well, so I imagine would be directly dependent on DataFrames as that is the package-of-choice for representing that kind of data. So when talking about best practices etc. it might be worth keeping in mind that some packages might really be most efficiently made on top of DatFrames instead of the Matrix{Float64} abstraction

Decision trees, by their nature, can work on heterogeneous data (each observation may be composed of variables of different kinds). For such methods, implementation based on DataFrames makes sense.
I don't mind a decision tree package depending on DataFrames.jl

There do exist a large number of machine learning methods (e.g. PCA, SVM, LASSO, K-means, etc) that are designed to work with real vectors/matrices. Heterogeneous data need to be converted to numerical arrays before such methods can apply. Packages that provide such methodologies are encouraged to be independent of DataFrames.

You're right: there's a DecisionTree package.

To me, working with factors is actually a really strong argument for pushing a representation of categorical data into an earlier layer of our infrastructure like StatsBase. But we're actively debating ways to do this in JuliaStats/DataArrays.jl/issues/73.

If we could avoid some of the issues @simonster raised in his issue, I think it would be a big help to move the representation of categorical data closer to Julia's Base.

Also worth keeping in mind that nominal data is often worked with using dummy variables, which do fit in the Matrix{Float64} abstraction. That's actually how GLM handles those kinds of variables.

If DecisionTree.jl needs DataFrames.jl, I fully agree with Dahua: that's not a problem. But if it only needs a simpler abstraction, pushing things towards that simpler abstraction seems desirable.

There are some cases where Matrix{Float64} is too specific an abstraction. I have experimented with fitting point process GLMs in Julia, where the design matrix is theoretically expressible as a Matrix{Float64}, but it would require a huge amount of memory (for my models, probably >100 GB). On the other hand, it is easy to express the design matrix as an AbstractMatrix{Float64} that efficiently implements A_mul_B! and At_mul_B!. I wrote code that does this and directly minimizes the negative log likelihood via L-BFGS using NLopt, which fits my model in a reasonable amount of time with reasonable memory requirements, but I'm not sure what to do with this code, since the GLM package is still about 3x faster with a Matrix{Float64} (for the benchmark included with the GLM package with the same convergence criterion, excluding the non-negligible time to construct the ModelFrame).

As far as the model fitting interface for DataFrames, it would be cool if we could get this to work on top of StatisticalModel. Packages could implement:

fit(::Type{MyModelType}, X::AbstractMatrix, y::AbstractVector, args...)

and DataFrames could implement:

function fit{T<:StatisticalModel}(::Type{T}, f::Formula, df::DataFrame, args...)
   mf = ModelFrame(f, df)
   DFStatisticalModel(mf, fit(T, ModelMatrix(mf).m, model_response(mf), args...)
end

or similar. DFStatisticalModel could provide a wrapper that maps between coefficients and their labels when calling coef, predict, etc. Of course, doing this right requires that we have a reasonable StatisticalModel interface (#4) so that we can make the relevant functionality accessible for DataFrames.

There are some cases where Matrix{Float64} is too specific an abstraction.

This sounds a lot like the discussion we had in JuliaLinearAlgebra/IterativeSolvers.jl#2 a little while ago.

@simonster GLM can use a sparse model model matrix, but I think you'll have to define your own subtype of LinPred.

It would be great if as part of the roadmap, we can also plan to put together some large datasets in place, so that the community can work on optimizing performance and designing APIs accordingly. Having RDatasets is so useful, and something that makes large public datasets easily available for people to work with will greatly help this effort.

@ViralBShah Good point. Datasets are important. I think we already have a MNIST package, we can definitely have more.

Just that we need to be cautious about the licenses that come with the datasets.

There are surprisingly few large data sets that are publicly available. I'd guess that the easiest way to generate "large" data is to do n-grams on something like the 20 Newsgroup data set. Classifying one of the newsgroup against all the others is a simple enough binary classification problem that we can scale out to arbitrarily high size (in terms of features) by working with 2-grams, 3-grams, etc. Other useful examples might be processing the old Audioscrobbler data (http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html) or something similar.

We also have CommonCrawl.jl. The point about the datasets is not as much to distribute them as julia packages, but to have easy APIs to access them, load them, and work with them. Often, I find that the pain of figuring out all the plumbing is enough to discourage people, and making the plumbing easy could get a lot more people to contribute.

Perhaps not too big, but there's also the Netflix and MovieLens datasets - which could be made easier to access.

The Netflix data set is illegal to distribute.

Question from an outsider - is there anything along the lines of Theano (from Python) in the works for Julia? Development of many deep learning models (RNNs) is sped up dramatically by AD-style software like Theano, and would allow the integration of deep learning into Julia much more easily...

There are several AD tools in the works. Check METADATA for a few. There are also some GPU code-gen tools as well, including OpenCL.jl. Eventually it should be possible to combine those two into something like Theano.

ccsv commented

I would like to see something for association rule learning and neural networks

Other things needed should are gridsearch for finding hyperparameters (if not already implemented), Naive Bayes (with the +1 smoothing cases), and Restricted Boltzmann machine

Just pinged @benhamner, saw he listed another machine learning package and I'm hopeful for collaboration on shared interfaces

My feeling is that most of the packages are better suited in a separated JuliaML group; that would support the consistency. What is the disadvantage of having an own group?

No disadvantage, although I think they will be tightly linked if common functionality is re-used as much as possible. But the number of incompatible ML packages is starting to worry me...

Agree, me too... would be a chance to unify things.

Lately, dimensionality reduction (MultivariateStats.jl), clustering (Clustering.jl), and nonnegative matrix factorization (NMF.jl) have been in a usable shape.

However, there is one big area which is still in a messy status, that is, Generalized Linear Models / Regression. There have been several packages alone this line, implementing more or less similar functionalities, but they do not work with each other.

I will open a new thread to discuss how we may proceed to unify the efforts in this domain.

@lindahua I'm not sure what the status is on this topic anymore, but I noticed this thread while comparing MLBase.jl and MachineLearning.jl as utility libraries for experimenting with Boltzmann.jl. I see @benhamner hasn't responded to the issue from 6 months ago, but would it make sense to integrate the two packages together? I'm particularly interested in some kind of pipeline api like this in one of the base libraries.

@Rory-Finnegan: Right now I feel like there's no one with the time to take control of this project and give it the direction it needs. If you're feeling like demoing something that out, that seems like a good idea to me.

@Rory-Finnegan just saw this - had missed the original ping (my Github notifications are spammed by internal Kaggle repos).

At this point, MachineLearning.jl has been a small playground I've touched here and there on the side as time permitted.

As @johnmyleswhite said, "right now I feel like there's no one with the time to take control of this project and give it the direction it needs." Definitely applies to me as well (at least for the visibility I have over the next 3 weeks, and likely longer). I've not looked closely at MLBase.jl yet, but need to. If you want to step up, go for it!

I'd love to have a look, but my julia skills are a bit below par ;) I'll see if I can find the time.

What can we do as part of a GSoC project this summer to make progress here. Can we pick a small set of things to target to go further from where we are. I suspect there are lots of potential contributors, but we need someone to build a bit more of a framework before others can jump in. Someone here mentoring a GSoC student could get quite a bit of work done.

Should we also think of wrapping existing R and python libraries, and get the APIs right to start with, and then piecemeal, replace the underlying implementations? This python ML document was trending on HN today:

https://docs.google.com/a/fourthlion.in/document/d/1YN6BVdReNAYc8B0fjQ84yzDflqmeEPj7S0Xc-9_26R0/preview?sle=true

So to keep the ball rolling on this topic I've created another repo with a README summarizing what I'm thinking I'd like in this base library. https://github.com/Rory-Finnegan/Learn.jl If you have time please take a look and post feedback. I'll admit that I'm coming from a sklearn background.

I'm new to Julia and starting to use some VLM and NN in my graduate research in mechanical engineering. I saw the reference to GSoC and am eager to potentially contribute to an ML or statistical learning project this summer. Anyone know if there will be a Julia project along these lines for GSoC?

I suspect there's not going to be an ML GSoC project since there's no one who's got time to mentor a student.

@Rory-Finnegan: I like your proposal a lot. I think the best thing you bring up is that we can use "interfaces" to solve the biggest blocker we came across earlier: the lack of a coherent hierarchy that we could place most models into.

Sorry for being inactive for months.

Since I assumed the job of being an assistant professor last September, I feel that my life completely changes. I am now leading a group of PhD students and I find it is difficult to spare time to write codes myself. When I talked to a faculty member at Univ. of Toronto, I was told that "once become a faculty, fun life of coding is over" -- this is true, but a bit sad.

Another problem is that with the thriving of the entire deep learning business, many classical machine learning methods, like many we are discussing here, are quickly becoming irrelevant for the ML community. We should seriously reconsider the way going forward. I believe that logistic regression, linear regression, SVM classifiers, etc should no longer be considered as standalone procedures, instead they should be treated as building blocks to construct more sophisticated systems. Therefore, the interactions between them should be taken seriously.

I see that Julia is no longer listed in the accepted organizations on GSoC (or maybe it was never listed this year and I was looking at last year previously). I may still try to get involved with GSoC, however, I'm interested in contributing to ML in Julia either way. My background is not CS and I have primarily just coded in Matlab for engineering projects. That said, I am working on school projects right now for an algorithms class and may try to implement some of the project work in Julia to get a feel for it.

In my masters research I plan to use ML techniques to help perform condition monitoring on either wind turbines or hydroelectric turbines (depends on funding). I'm working with Matlab now as I learn the statistics behind SVM and NN, however, I would like to use open source software for my research. I will be spending at least part of the summer developing my coding skills and Julia seems like a good fit with opportunities for contributing at a fundamental level while deepening my understanding of ML.

I realize I am an outsider and haven't had enough time to wet my feet in this community. That said, I am on spring break now and plan to start using Julia for a dynamic programming project due in a few weeks. If anyone out there has input on how I may be able to help this summer (preferably in ML/statistics and taking into account my skill set) please let me know. I was a senior controls engineer with 12 years experience before I decided to come back to school... so I don't necessarily need mentoring, just a point (or kick or shove) in the right direction and honest feedback.

We will most likely have other ways to have the equivalent of GSoC funding this summer. We are working on this, and while nothing is firm, we will announce once ready.

Cc: @alanedelman

@swgregg IMO, a practical way to start off with a github repo for the project you are working on, and file issues against relevant packages as you run into missing functionality, roadblocks, or design issues.

Thank you for the advise. I will do just that.

Date: Sun, 8 Mar 2015 05:35:32 -0700
From: notifications@github.com
To: Roadmap.jl@noreply.github.com
CC: engineer_gregg@hotmail.com
Subject: Re: [Roadmap.jl] Machine Learning Roadmap (#11)

@swgregg IMO, a practical way to start off with a github repo for the project you are working on, and file issues against relevant packages as you run into missing functionality, roadblocks, or design issues.


Reply to this email directly or view it on GitHub.

@lindahua So I definitely agree that part of the goal should be to standardize the interactions between models, since more and more people are stacking/combining different techniques (like using a dimensionality reduction alg in front of a classifier). The simplest way of dealing with this seems to be a model container type like Pipelines, in which the container is responsible for validating the interactions between models. If you have a better idea please let me know.

I'm not sure I agree that classic machine learning techniques are becoming irrelevant or that they shouldn't be used as standalone procedures. Especially since not everyone who uses these ML techniques are necessarily part of the ML community. Sometimes folks might just want to use a simple regression or SVM for their problems. Either way, I don't see any reason we can't support both approaches. :)

I'm still pretty new to deep learning so I could be missing an important step, but for deep belief networks, deep neural networks, etc couldn't we also just express them as a Pipeline? Specifically, that you'd build a deep learning architecture by subtyping Pipeline adding in whatever model units you want and then applying a fine tuning algorithm like backprop. We could probably also organize it so that Pipeline is a LearningModel, allowing for recursive nesting of Pipelines (once again, this would be made easier and cleaner with some kind of traits or interfaces system). Unfortunately, I'm not sure how this approach would relate to existing deep learning packages like Mocha.jl. Thoughts? FYI, I'm currently looking at deep learning architectures for my thesis, so I'll probably want to integrate my code with this package anyways.

If you're okay with that approach, I could probably find some time this week to flesh out the Pipeline stuff (or w/e we want to name the container type).

@Rory-Finnegan Classical machine learning techniques are still very important and widely used in many situations. What I was saying is that the focus of machine learning application has shifted gradually towards systems that integrate multiple components, and deep models are a case in point.

Hence, a sustainable strategy going forward should be to develop components such that they can nicely interact with each other through carefully crafted interfaces. As far as I can see, graphical models and neural networks are two most popular frameworks that allow one to put together a number of different components in a way that is mathematically valid.

@Rory-Finnegan The combined models in neural nets are usually trained in their entirety using backpropagation. The scikit-learn pipeline does not make any assumptions about differentiability, and is not able to backpropagate gradients. So I would see that as the major difference.
I agree about your point about machine learning outside of deep learning. I am pretty sure it will not and should not should go away anytime soon.

@amueller as someone starting to apply ML in the biomedical field I must point out the growing importance of inference in supervised learning (say, ensemble methods with decision trees for deep sequencing)... so I would disagree with the statement that ML outside of deep learning will or should go away, unless deep learning preserves full visibility and interpretability of feature importances. My anecdotal evidence suggests there is much excitement and democratisation of ML approaches beyond the core data-science crowd. I see this as an important strategic opportunity for Julia.

And with that in mind, I fully agree with @lindahua that it is extremely desirable to build in the ability to integrate multiple ML components from the outset, if nothing else to 'future proof' Julia's ML ecosystem.

My statement was missing a "not" whoops. I totally agree with you.

It does make sense to think through the APIs and composability. There are plenty of implementations in R and python that we can wrap to start with, eventually replacing them with Julia implementations. Has anyone used @svs14 's Orchestra, which seems to have wrappers for scikits-learn and caret?

https://github.com/svs14/Orchestra.jl

I have used Orchestra ;)

I have not touched this for 3 months along with everything open source though due to some legal ambiguities in my previous internship, will work on it again from next week - so it's not dead.

For my purposes, I'm not concerned whether a machine learner is developed in a specific language, as long as I can compare + compose it together in a larger system I'm happy. If I truly need performance then I can gracefully degrade (in terms of effort of creating something from scratch compared to using a library) to write the respective learner in Julia.

Speaking along these lines, if it's not already considered, I would argue that pre-processing transformers such as near zero variance filtering found in caret can be pretty handy if not equally important to the learner in use in terms of a composable API. It's also easier this way if you enjoy applying grid-search to the pipeline itself, including omission of pre-processors.

Hope this adds to the discussion.

Would it make sense to separate out Orchestra's wrappers around Python and R libraries into separate packages? I suspect they may receive more attention and can become common components in other projects.

+1 I completely missed Orchestra when I was scanning through existing packages :( On the same line of thinking would it also make sense to see if existing julia packages (like DecisionTrees.jl, SVM.jl, etc) could be updated to support Orchestra's API? @svs14 I'd be happy to help you work on this if you're interested.

In fact it was @aviks who showed me this package a few days back.

As a heads up, extracting Orchestra's scikit-learn + caret wrappers to packages that cover the full spectrum of each may require a design overhaul - right now the wrappers only target classifiers. Also the solutions I used were very specific to constraints I had at the time, I suspect there are better ways to wrap both libraries now (especially caret, I went through PyCall.jl to rpy2 as there was no direct functioning route back then). Orchestra's wrappers may be beneficial as some inspiration, to a wrapper library built from the ground up.

Orchestra's API is very unstable and immediately suited for only a limited sub-domain of machine learning. For instance I'm currently investigating/developing on handling the spectrum of learning settings including semi-supervised, transfer and multi-task learning which will not be backwards-compatible. As such, it'd probably be best not to have other packages depend on it.

I think there are a number of well thought out machine learning APIs targeting different priorities as evidenced in the excellent discussions within the Julia community, and through Python's scikit-learn, R's caret, Java's weka, and Go's golearn (I'd love to have wrapped all of these as it gives me their learners for free lol!). As long as the API is unstable, it'd probably be best for the API designer to consider inversion of control and build wrappers for each package, instead of having each package developer responsible for adhering to it. IMO, this makes it a lot easier to change the API at will without buy-in from dependents, along with not placing any unstable standards' effort on package developers.

@Rory-Finnegan , would be great to work with you! I really like your ideas in Learn.jl and this discussion, except I don't have your email/contact-details, you can ping me at svs14.41svs@gmail.com if you want.

Looking at MathProgBase.jl's design might be of interest too - its essentially a general interface for constrained optimization problems, with 11 registered packages using it and other experimental packages also using it as a way to plug in to the infrastructure. Its composable too: i.e. I can make a pseudo-solver that takes inputs via the interface, and then solves a series of subproblems through the interface.

+1 on consideration for inference aside from deep learning- Important for growing computational political/social science field.

I do not think that there is a big problem in designing a comprehensive ML framework. It is mainly a lack of commitment from the Julia ML community, including myself, in pushing forward with an initial design. I believe ML framework would evolve along with the language, and whether it will look as Learn.jl or MathProgBase.jl or scikit-learn, it is really circumstantial. If we are going to wait for interfaces or traits implementation in Julia which would certainly enforce particular standard on a ML packages, it will only postpone development of a general ML framework.
I believe that providing common thin type hierarchy, in a manner of Learn.jl or StatsBase.jl, is enough to start development of various libraries for particular implementation of ML algorithms (even multiple). After all, a correct implementation of an ML algorithm is a pure scientific endeavor.
And there should be some packages with engineering thought behind, like MathProgBase or Orchestra, which would have wrapping and data pipelining (including utilities and supporting functionality) for implemented ML algorithms without any particular preference. I value such packages more that any state-of-the-art learning algorithm, because they provide more benefits for a larger community.
Let's push for some initial draft of a common ML interface that everybody will start to adopt. I like Learn.jl interface as an umbrella interface. It is based on a well known separation of ML algorithms, which could be gradually extended in particular implementations of ML algorithms, and in their turn integrated into the umbrella interface if necessary.

I don't know that this will necessarily help, but there is the possibility of a position at MIT to push forward on this, if someone were interested to take it up full time. There are also some funds at NumFocus for Julia development, and this would qualify - but that would be for a much shorter duration. Perhaps someone who is focussing on this exclusively can be an anchor around which everyone can contribute.

I feel like JuliaOpt has taken this approach of nailing down the APIs and building a flexible composable infrastructure. Of course, we had @mlubin and @IainNZ who anchored that work and many others joined. We need the same here.

@ViralBShah I agree. A full time position on this at MIT seems ideal given the importance of the field... among many pluses, this should also ensure face-to-face interactions with Julia core developers when low level changes would benefit other computationally demanding fields (e.g. BioJulia @dcjones)

As a PhD student on statistics and machine learning, I'll keep a close eye on this issue, and am willing to contribute to it.

By the way, I think what we need is a Grammar of Machine Learning.

I look forward to someone being able to work full time on this, since I can only spare a few hours a week. In the mean time, I'm working with @svs14 on Orchestra.jl and maybe merging the common structure into Learn.jl. After we have the ensemble stuff refactored and working, I'll talk to the Mocha.jl folks about how deep learning should work as they seem to have a pretty popular approach.

Would this project include frequentist and bayesian inferential models as part of the hierarchy? Perhaps @dmbates , @Scidom and maybe @fonnesbeck can chime in.

@datnamer I'm inclined to suggest that frequentist and bayesian inferential models might make more sense as part of the StatsBase.jl package. I may include some wrapping functionality so that arbitrary models could be used as well, so long as they support the approriate methods.

Hi @DatName, sorry for the slow reply, have been abroad over the last two weeks. Not sure where we should get with this in the long run - there is already a placeholder in PGM.jl. I think for now it is better to let the inferential modelling frameworks mature at their own independent pace given that they are at an infancy level. We can discuss merging efforts in the future, on the basis of a broader and richer codebase.

@Rory-Finnegan and @Scidom - Makes sense

If you want to design a common interface, I think it is important to define the scope. I probably makes sense to leave graphical models, structured prediction and probabilistic programming out of scope.
But there are many other cases apart from classification and regression.
I'm not sure it makes sense to try to define a very strict interface, and starting with the actual algorithms as @wildart proposed might be more fruitful.

Just some API cases to consider:

  • dataframes vs matrices as inputs
  • multi-label, multi-output and multi-task prediction
  • algorithm evaluation in grid-search, interfaces for metrics (allow out-of bag scoring, allow ranking metrics for both classification and regression, ....)
  • semisupervised learning (how do you identify unlabeled points, now does cross-validation handle these)
  • missing value handling
  • categorical variable handling (hard with matrices, easier with dataframes)
  • recommendation system interfaces (the interface is quite different from classification / regression)
  • online and active learning interfaces
  • reinforcement learning (in scope?)
  • building pipelines (what are the interfaces here? Can we subsample data?)
  • Regularization path algorithms and their interaction with cross-validation and grid-searches
  • Last but not least: pipelines for online learning

This is part of the laundry list of API choices as well as unsolved / punted issues in scikit-learn ;)

I think @ViralBShah's idea is important. Technically, there can be many approaches to make this successful. The real problem is that we lack a person that can dedicate to this and drive the progress for long enough.

I would like to contribute too but the sheer number of packages scares me. It would be really nice if there was an experienced mentor willing to lead and guide the effort.

@amueller I agree that it would help to define the scope of the API, before starting and that it would help to work with particular implementations in mind. However, I don't think all of those points need to be addressed immediately.

Also, this thread is getting a little long now so I've opened a chat on Learn.jl for folks who are interested.

Yeah I agree that not all points need to be addressed immediately and maybe for the moment it is more important to actually get something going.

Shouldn't Reinforcement Learning be part of the ML roadmap? (sorry if I'm missing something obvious Julia newbie here). Do game playing algorithms (like those based on MCTS) fit into "machine learning"?

I know a research team applying RL to gameplaying who'd appreciate a solid Julia RL library that can deal with large datasets (with SMDPs ,MAXQ etc and some other bits and pieces like a distributed MCTS implementation) ,and am thinking of building something for them (and learn Julia into the bargain), but if some one's already working on it, perhaps as part of this roadmap, then I can probably fork/contribute to that rather than start from scratch.

Hi all,

I'm interested in contributing to achieving this roadmap as part of JSoC 2015. I was wondering if anybody is willing to mentor me on this project. The deadline is June 1st. That's too soon and a quick response would be great. If anybody has the time and are willing to do the same then please contact me(rinuboney@gmail.com) asap. I know Julia and machine learning. I can do this.

@rinuboney It would help to put together a concrete proposal. There is a lot of discussion here, and it would be great to take a chunk of this as a JSOC project. If you can put something together based on the discussion here - what you will work on in the next 3 months, it will be easier to find a mentor.

Ok. Please mail it to juliasoc@googlegroups.com when it is ready.

@rinuboney Thanks for your interest.

Your proposal is currently quite vague. It would be better if you could identify specific examples of ML techniques that you would want to run and show that the code is duplicated or redundant or has a complicated dependency stack. Let's say I'm interested in random forests - are there multiple current implementations? are the implementations too hard to use? or maybe not general purpose?

@jiahao Thanks for the feedback.

There are multiple implementations for various ML models. Eg : RandomForests - https://github.com/bicycle1885/RandomForests.jl, https://github.com/bensadeghi/DecisionTree.jl, GLM - GLMNet, GLM, Regression etc. I believe that it's a good thing to have multiple implementations but the problem arises when one wants to try out different models. Machine learning is about experimentation. When faced with a classification or regression or clustering problem, the user should try out different models and use the model that gives the best performance. When the models are implemented in separate packages, they operate on different data types and have different API. So the code has to be rewritten to try out each model. This is where a base library can help by providing interoperability between different implementations through an API. With an API, the user can switch between models and algorithms instantaneously. The same problems are present when a user wishes to stack models from different packages.

My proposal is more of coordination between the different machine learning packages scattered around so that they are can be used easily in a scikit-learn fashion. I thought that this is what the whole discussion was about.

I understand I should improve my proposal and I'm on it 👍

@IainNZ yeah my proposal is directly based on Learn.jl and this discussion.

My (somewhat delayed) 2 cents. Rather than "unifying interfaces" etc, I'd rather have implementation of non existing functionality, which (imo) adds more value. A comprehensive test suite for package X would provide much better value imo.

The ML ecosystem is quite young in Julia, and providing an API to rapidly evolving libraries is a bit premature .In any case, the above list is hardly a real 'roadmap' ,it is just a (very comprehensive) list of ML topics - A 'roadmap' needs to have a sequencing of tasks and at least rough dates of completion to be meaningful.

What we need at this stage (again just my 2 cents feel free to ignore) are solid, tested, scalable libraries with compelling usecases that will get more people actively using Julia on real world ML projects. Interfaces can be extracted when mature libraries are aplenty and are really not very valuable till then The arguments for interfaces/coordination etc aren't very convincing (to me, YMMV)

All that said, if one really really really wants to work on interfaces, the first thing to do would be to build a compelling concrete case . Write an interface to a specific (and limited) set of libraries X,Y,Z... so we can do specific tasks A, B, C with 2 lines of code vs 20 (or whatever).

@RaviMohan Thanks for your feedback. I'm thinking more along this line. Consider an API is designed for the Julia ML packages. Then the various existing implementations in different packages could be unified. Existing libraries like scikit-learn, weka etc could also be wrapped in the API. Then the whole set of packages supporting the API can conceptually be used similar to a machine library like scikit-learn. I'll try to list some advantages to this approach:

  • The packages supporting the API can be maintained separately.
  • It takes less time than developing a single sold, tested and scalable ML library from scratch.
  • It provides a higher level of flexibility. Somebody can independently implement a new algorithm and comply with the API. Automatically, the new implementation is part of the Julia ML ecosystem which means it can be tested easily and added to existing projects without any hassle.
  • mix and match implementations in different packages.

Then once the API is designed, I think the community should focus on solid, tested and scalable implementations in Julia. I believe this part can be done faster in a decentralized manner unified by the API as opposed to a centralized single library approach. If the community takes the API road, then in the end, the community will have a plethora of packages accessible through a unified API. If it's the solid, tested and scalable library road then community will have a good ML library just like all the other languages. I don't believe in a single perfect library for any purpose. trade-offs have to be made in all cases and a unified API and separate packages allows you to change the trade-offs immediately.

This whole idea of designing APIs first and then the implementations/wrappers rarely works in practice in ML (I'm cynical from decades of experience in navigating APIs designed before practical experience from actual library implementation)

(imo) people who don't do the implementation don't (generally) design good APIs and the above argument is too theoretical and lacking a real world perspective. Nobody, in any ecosystem has ever come up with one API that could wrap totally different libraries like scikit and weka and have it be useful in the real world. If you pull it off you'll be the pioneer.

That said, don't let me discourage you. If you think you can design an API for wildly different packages and/or packages not yet made, go for it. I'm skeptical about success,but don't let that affect your enthusiasm. Follow your vision.

Although the implementations are wildly different, they have the same functions and can be used with the same API. Not exactly the same but, In Clojure(programming language) there is a library called core.matrix. It is an API for working with matrices. Different implementations in native code,Java and Clojure support the API. Switching the implementation is trivial. I think it's possible to do something similar for ML packages. I know it's not possible to wrap up completely different libraries but a good number of them can be. Eg: scikit-learn and Go Learn have a similar API.

"Not exactly the same"? core.matrix has nothing to do with a generic ML API. Manylanguages have interfaces of some kind or the other.

as you well know core.matrix is a completely different beast from an ML API. You can design wrappers for datastructures, with different level of abstraction and tradeoffs. We know this from the 70s!! By this logic every Java interface in existence is evidence that a generic ML API can be designed?

If you have a real world example of common APIs for massively different ML libraries (scikit and weka as per the OP) , I'm all ears. Else this is a case of "should work in theory but never has in practice" (imo).

I had a very talented friend try very hard for years to write a generic wrapper just for Reinforcement Learning libraries and in the end it was an excessively generic mess no one wanted to use. However that doesn't mean someone else might not succeed tomorrow. As I said above, if you think you can do it, go for it, and more power to you for trying.

To repeat, if someone thinks wrappers can be built that unify real world ML packages and "switch implementations trivially" ,that's great.

I am very skeptical about this actually working, having worked on real world ML projects for years, but that shouldn't affect anyone's enthusiasm for the idea. I just don't think it is a workable idea is all. I'll be glad to be proved wrong.

I hope you succeed. Cheers.

Well I'm a student and I'm still learning what's possible and not possible. I just happen to like the idea and I'm willing to work on it.

Good for you.

We do need people to attempt the "impossible". That is how the world moves forward. You don't need anyone's approval to do what you want to do.

Go for it. Good Luck.

Check out caret and mlr, "two attempts to create a unified framework across all types of algorithms for the various steps of machine learning in R (pre-processing data, training, testing, hyper-parameter optimization, etc.)." Sounds similar to @rinuboney 's stated goals.

https://github.com/topepo/caret
https://github.com/berndbischl/mlr

I don't believe in a single perfect library for any purpose. trade-offs have to be made in all cases and a unified API and separate packages allows you to change the trade-offs immediately.

For scikit-learn, many of the trade-offs are in terms of API.

On the other hand, a unified API is what brings people to scikit-learn, even before we had as many [and as fast] algorithms as we have now.
One of the reasons people convert from R to Python is that scikit-learn provides a unified and simple API.

They are similar. Haven't noticed it. I'll look into them in detail. @datnamer Thank you for pointing them out.

@amueller I'm a scikit-learn user and I really like the API. It makes ML really simple for beginners. I hope to make Julia ML packages accessible through a similar API.

I'd be pretty damn happy with a Julia caret

It would be awesome if I get a chance to work on it as part of JSoC. Please do have a look at my proposal. Any feedback would help me refine the ideas.