Feature Extraction Pipelines

Feature selection is one of the toughest parts of machine learning -- oftentimes we're working in extremely high-dimensional space (as in the case of working with text data), and it's very difficult to identify and extract the features that will be most informative to a predictive model.

Grid search techniques can help us to find the best combination of features, but in a software engineering context, we also need a robust and repeatable means of integrating normalization, transformation, vectorization, feature union, and modeling into a single process -- thankfully the Scikit-Learn Pipeline object gives us just that!

Classifying News Articles using a Feature Extraction Pipeline

The accompanying code comes straight from the Scikit-Learn documentation, and gives a good procedural illustration of how you can use pipelines for feature extraction on a text data classification problem.

Challenge

But... what if we wanted to put something like this into production?

Step one

First, think about what would be necessary to support a real-world news classification application. For instance, how would real live news data likely flow in and out of a live application -- what is the input likely to be? What should the output look like? What are the major objects that will exist within the application, and what is their behavior? Sketch out a rough design of an architecture based on your ideas (feel free to use a whiteboard, paper, Google draw, or whatever you prefer).

Step two

Once you have some ideas about how to productionize the code, take a crack at refactoring the grid_search_text_feature_extraction.py script using an object-oriented approach, (classes, methods, encapsulation, control routines) that leverages Pipelines to do normalization, vectorization, and classification, and output an interpretable result.

Show and tell