/fgvr

R package to power-up data science analysis based on learned techniques in the FGV MBA course.

Primary LanguageRApache License 2.0Apache-2.0

FGVR

R package to power-up data science analysis based on learned techniques in the FGV MBA course.

Don't panic! --Douglas Adams on "The Hitchhiker's Guide to the Galaxy" book

The premise of this package is gathering a set of R functions that helps FGV MBA's students performing repetitive activities during the following steps: Data Cleaning, Data Enhancements, Data Preparation... and more!

All functions and resources available in this package was inspired on the Business Analytics and Big Data classes, where the following Professors shed some light into our minds:

Name (Discipline) Assignment Repository
Gustavo Mirapalheta
(Exploratory Data Analysis)
:octocat: [https://github.com/ldaniel/Exploratory-Data-Analysis]
João Rafael Dias
(Predictive Analytics)
:octocat: [https://github.com/ldaniel/Predictive-Analytics]
Eduardo Francisco
(Spatial statistics)
:octocat: [https://github.com/ldaniel/Spatial-Statistics]
Rafael Scopel
(Time Series Analysis)
:octocat: [https://github.com/ldaniel/Time-Series-Analysis]
Rodrigo Togneri
(Matrix Methods and Cluster Analysis)
:octocat: [https://github.com/ldaniel/Matrix-Methods-Cluster-Analysis]

Thank you all for that! 😄

Contributors

Special thanks to these awesome contributors: @Daniel, @Rodrigo e @Ygor, who shared a lot of time and dedication to achieve such great work! 👊

Profile Contributor E-mail
Daniel Campos (daniel.ferraz.campos@gmail.com)
Leandro Daniel (contato@leandrodaniel.com)
Rodrigo Goncalves (rodrigo.goncalves@me.com)
Ygor Lima (ygor_redesocial@hotmail.com)

Installation

To get the current development version from github:

# install.packages("devtools")
devtools::install_github("ldaniel/fgvr")

Running

The fgvr package has a set of handy functions.

createProjectFromTemplate

This function creates an initial R project setup focused in data science.

fgvr::createProjectFromTemplate("Predictive-Analytics", "c:/temp")

The following structure will be created:

[Project root directory]
|   README.md
|   __myproject__.Rproj
|
+---data
|   +---processed
|   |       bigtable.feather
|   |       readme.txt
|   |
|   \---raw
|           game-of-thrones-deaths-data.txt
|           readme.txt
|
+---docs
|       readme.txt
|
+---images
|       readme.txt
|
+---markdown
|       01_about_the_data.Rmd
|       02_data_preparation.Rmd
|       03_exploration_report.Rmd
|       conclusion.Rmd
|       index.Rmd
|       references.Rmd
|       _pdf.Rmd
|       _site.yml
|
+---models
|       readme.txt
|       source_train_test_dataset.rds
|
\---src
    +---datapreparation
    |       execute_data_preparation.R
    |       step_01_config_environment.R
    |       step_02_data_ingestion.R
    |       step_03_data_cleaning.R
    |       step_04_label_translation.R
    |       step_05_data_enhancement.R
    |       step_06_dataset_preparation.R
    |
    +---playground
    |       playground.R
    |
    \---util
            auxiliary_functions.R
            generate_markdown_website.R

createTestAndTrainSamples

This function creates train and test datasets given a database and the Y variable. In addition, this function also returns the sample proportion for each dataset.

# using, just as an example, the sample dataset loansdefaulters, also included in the package 
base <- fgvr::loansdefaulters

# example calling the function by passing all parameters:
#   dataset    = the dataset you want to split into test and train samples.
#   yvar       = the Y variable in your dataset.
#   seed       = the seed number used to generate the train and test samples.
#                the default value is 12345.
#   percentage = the percentage of data that goes to training sample.
#                the default value is 0.7.
mydataset <- fgvr::createTestAndTrainSamples(dataset = base, yvar = "y_loan_defaulter", 
                                             seed = 12345, percentage = 0.7)

# or omitting 'seed' and 'percentage' parameters, then the default values will be used.
mydataset <- fgvr::createTestAndTrainSamples(dataset = base, yvar = "y_loan_defaulter")

# getting the final samples and proportion.
mydataset$data.train
mydataset$data.test
mydataset$event.proportion