Datafest-2018

Mission Statement

We want to use unsupervised learning methods such as K-means clustering to see if anything naturally groups in normTitleCategory. We will aggregate the data by first grouping by jobIds since there are multiple job listings. We want every row to be one UNIQUE job.

We initially drew a random sample of 500,000 observations. We made some variables modifications and pre-processed the data.

Variable Assumptions

jobId We will use the most recent date of a job posting, since there are multiple job listings with the same jobId.

Word Count & Character length Outliers will be assigned the median value in each variable.

numReviews If a value is below the mean in numReviews, it is assigned to dummy variable: few. If above the mean, it is assigned to dummy variable: many. If the value is 0, then both dummy variables are 0.

Date: We filtered the data from October 1st, 2016 to September 30th, 2017. We changed October 2016 dates to -1 and November 2016 dates to 0. We will use the latest date of the job posting (max). We plan to assign certain weights when we cluster. We divided the year into 4 seasons (binary variables: 0 or 1):

For fall:

October
November
December

For winter: (iswinter)

January
February
March

For spring: (isSpring)

April
May
June

For summer: (isSummer)

July
August
September

normSalary We normalized the estimateSalary by dividing salary by its corresponding State's Per Capita Personal Income. We also removed NA values in the estimateSalary.

Country We filtered out Canada and Germany, and kept the US data.

normTitleCategory For blank and uncategorized groups, we used dummy variables. We then grouped these categories into 5 broad categories:

Medicine
Technology
Business
Service
Blue-collar

jobAgeDays We will use the median value of jobAgeDays for the same job listings.

clicks We will sum all of the clicks from all of the same job listings. It is a highly skewed right variable.

clicksPerDay A new variable we created by dividing clicks.y by its corresponding day. We normalized clicksPerDay by using the fourth root, since it was the most skewed-right variable we had.

Supervising & License We changed the NA values to 0.

Aggregating the Data

We need to normalize every numeric variable by using the scale function, except for our binary variables. We normalized descriptionCharacterLength and description WordCount, clickPerDay, normSalary.

R libraries used

dplyr
tidyverse
ggplot2

Machine Learning Methods

K-means Clustering -- The Elbow Method

#Clustering We clustered with characterlength, word count, isMany, normsalary, iswinter, isspring, issummer, clicksPerDay. After performing the elbow method to determine an optimal (k) number of clusters, k=5 was found to be optimal.

Initial analysis:

We also computed relative proportions for each industry.

Cluster 1:

Characterlength and wordcount seemed to be the most important - more service industry versus tech. However, there is nothing super significant
Most common industries:
1. sales
2. retail
3. service
Least common industries:
1. techsoftware
2. meddental
3. meddr

Cluster 2:

Salary seemed to be the most important - high-skilled jobs
Most 5 common industries:
1. accounting
2. architecture
3. engineering(engchem, engelectric, engid, engmech)
4. finance
5. install
6. meddr
7. project
8. techsoftware
Least 5 common industry:
1. childcare
2. food
3. personal
4. retail
5. warehouse

Cluster 3

Low characterlengths and wordcount
Most 5 common industries:
1. veterinary
2. sanitation
3. driver
4. personal
5. science
6. warehouse
Low 5 common industries:
1. finance
2. marketing
3. military
4. socialscience
5. transport

Cluster 4:

not useful -- average
Most common industries:
1. admin
2. agriculture
3. care
4. customer
5. hospitality
6. service
Low common industries:

childcare
engchem
engid
techinfo

Cluster 5:

clicks seemed to be the most important
Most 5 common industries:
1. accounting
2. childcare
3. customer
4. driver
5. service
6. tech software
Low 5 common industries:
1. aviation
2. agriculture
3. engchem
4. mining
5. military

We created proportions of each category in normTitleCategory for each cluster and created a new dataframe to compare each clusters.

Clustering by Location

All of the other clusters did not show anything signficiant in regards to location.

Cluster 2:

High: North Carolina, Georgia, Arizona
California is slightly above average
Low: Wyoming, North Dakota, South Dakota, Arkansas, Vermont, Connecticut

Plotting

Important Variables used as the axes: *Salarymeans, Wordsmean, ClicksPerDaymean

svetakvsundhar/Datafest-2018