This is a new project to practise and learn how to use Targets package to automate pipelines in R. I will start testing it using text classification models and later on with univariate time series models.
It allows us to practise several skills:
- Fuunctions building
- Model creation
- Pipeline management
- Model fitting and upgrading
At a later stage in development I will combine the output of this pipeline example with a Shiny dashboard
-
- Text classification model using elastic and XGBOOST models
-
- Time Series model using ARIMA to predict 7 days AE Type I Attendances
The aim of this project is to use Targets package to setup a small text classification model using Tidymodels and Themis packages.
As a way to learn text classifications models, I will follow this video from Andrew Couch:
- Andrew Couch TidyTuesday: Improving Model Train Times with TidyModels
- https://www.youtube.com/watch?v=MVQExXGooaM&t=187s
- Load .csv data
- Summarize Dialogue by character
- Address class imbalance
- Create initial train test split
- Build matrix to count number of words by character
- 5.1 Apply naive tokenization to our TRAIN and TEST sets
- PRE-PROCESSING
- 6.1. Tokenization
- 6.2. Remove stop-words
- 6.3. Create a token folder whre max_tokens is (75)
- 6.4. Then we compute the tfidf() algorithm
- MODEL FITTING
- 7.1 Simple model: Elastic net model.
- 7.2 Complex model. XGBOOST MODEL. This is a specific type of TREE model
- MODEL PARAMETERS TUNING
- 8.1 Elastic model parameters tuning
- 8.1.1 elastic grid_regular() set of parameters chosen using dials package
- 8.1.2 elastic grid_ma_entropy() set of parameters chosen using dials package
- 8.2 XGBOOST model parameters tuning
- 8.2.1 XGBOOST grid_regular() set of parameters chosen using dials package
- 8.2.2 XGBOOST grid_max_entropy() set of parameters chosen using dials package
- Compare data structures from both models using a regular gird
And in the process I will explore new models and test many of the functions he uses to conduct his analysis
I will use some text examples from Harry Potter books to conducting the analysis, based on several variables from its characters and their text quotes in the book. Tidymodels package is the framework chosen to build a small classification model based on this data. https://www.kaggle.com/datasets/balabaskar/harry-potter-books-corpora-part-1-7
Classifcation models predicting a class label for a given observation
graph TD;
Original_text_set-->Train;
Original_text_set-->Test;
Train-->Apply_classification_models;
Test-->Get_model_running_times;
Train --> Choose_model_based_shortest_time;
Choose_model_based_shortest_time --> Plot_models_running_times;
- When dealing with a classification data set with skewed class proportions is called imbalanced data.
- Within this data set we find two types of imbalanced data classes:
- Classes that make the largest proportion of the data are defined as majority classes
- Classes that make the smaller proportion are called minority classes
- Classification model using imbalanced data
- This is a small example on how to use Targets to create and run a classification predictive model where the distribution of examples across the classes is not equal.
We use Themis package in R to deal with imbalanced data https://github.com/tidymodels/themis
Themis package makes use of the Recipes package
- With recipes, you can use DPLYR like pipeable sequences of feature engineering steps to get your data ready for modelling.
- Also we make use of the specific library textrecipes, it contain extra steps for the recipes package for preprocessing text data. https://github.com/tidymodels/textrecipes
- The easiest way to get recipes is to install all of the tidy models packages https://www.tidymodels.org/packages/
The aim of this small project was to practise how to tune different models using Tidy models framework. At the end we obtain the different training times depending on the model we have used.
This plot can be useful in future models to quicly assess model performance
This second example is a Time Series model using ARIMA to forecast 7 days AE Type I Attendances in England. I will create the model and use Targets to build a pipeline to run it.
- TS 01. Start by importing the data and turning it into a TS object
Using TSstudio library https://ramikrispin.github.io/TSstudio/
library(TSstudio)
ts_plot (AE_major_ts)
- TS 02. Check for trend seasonality and other regular patterns
- Decompose original TS into (Random,Sesonal, Trend, Observed components). Using Additive and multiplicative ts
- TS 03. Seasonal plot to check for Seasonality by Frequency Unit and Cycle
- TS 04. Heatmap plot describing AE Type I Attendances variation across months for 2010-2019 period
- TS 04. Correlationa analysis
- This section will also use the "AE_major_ts" data set to explore the ACF and PACF plots to assess seasonal lags in the TS data
- We can plot on which lag there is strong correlation to re-affirm what we can observe in the ACF and PACF plots about the strong seasonality in lag12
By looking into which plot we observe a linear relationship with previous lag numbers we can assess how many different trend and seasonality patters are present in our AE Type I Attendances data
ARIMA model using auto.arima() function after initial test train split of AE Attendances Type I data
We use the ts_split() functin to obtain our train and test sets
AE_major_ts_split <- ts_split( ts.obj = AE_major_ts, sample.out = 12)
train <- AE_major_ts_split$train
test <- AE_major_ts_split$test
And then we forecast 12 months into the future using ARIMA. The outcome is an interactive plotly chart, so it is important to run this code and interact with the chart that the above code produces.
library(forecast)
md <- auto.arima(train)
fc <- forecast(md, h = 12)
test_forecast(actual = AE_major_ts, forecast.obj = fc, test = test)
This is an interactive Plotly chart that can be used to explore RMSE and MAPE accuracy measures as tooltips on top of the standard actual and forecasted values produed by the ARIMA model.