Slides: https://docs.google.com/presentation/d/1Qzgl6P9cNUejpLUx-ivtGDouEj2i82tKl3jF-sBwdbU/edit?usp=sharing
Missing value imputation-R: https://github.com/IQSS/amelia & https://gking.harvard.edu/amelia
Missing value imputation-sklearn:https://scikit-learn.org/stable/modules/impute.html
2024/6/11
-
add justifications for the threshold of the missing value ratio
-
2024/4/18-5/30:
-
add variables: VCF0006a” or “Unique Respondent Number (Cross-year ID for panel cases), voter/non-voter, vote_D/vote_R to the state-wise prediction results:
- VCF0006a: done
- voter/non-voter,vote_D/vote_R: already in the data
-
add the one model trained on the whole data and the state-wise model prediction results:
- to do in - done
-
add the citation of the ML term in documents
- done
-
clean the code, make it:
- script-run for new data test (apply curent model to new data) - doing
- script-run for new data training (model update)
- visulization and store results script -done
-
-
2024/4/11:
- send documents on lgistic regression + elastic net to the professor
- make the documetns on data process ( one-hot ) and missing value imputation
- build a table to show the gap between the "vote-D" and "vote-R" for whole data and state-wise data
-
2024/3/26:
- model did not work well on "intend-vote" group due to the imbalance of the data
- try some imbalanced data handling methods, like SMOTE, ADASYN, and RandomOverSampler, but did not work well
- tried some advanced/non-linear models, like GBTtree, RBF-SVM, and ensemble models, like AdaBoost , but did not work well
- clean the code and add some comments
-
2024/3/12:
- save and finished almost all stat-based analysis
- start to build the feature-importance model (Log-Reg)
- focus on the "WA" state group
-
2024/2/23:
-
add table for "intend to vote" but final "non-voter" for the white/black group in different area
-
add year-based plotting for the changing ratio
-
get some hypotheses to verify(based on "urban-rural"-feature, not miss out!):
-
Blacks in urban America are more likely to vote and vote for the Democratic candidates than are Blacks in rural America.
-
Blacks in suburban America are more likely to vote for the Republican candidates than are Blacks in urban America.
-
Whites in rural America are more likely to vote Republicans than are Whites in urban or suburban America.
-
White non-voters are more likely to live in rural American than in urban America.
-
-
state-based analysis: focus on WA
-
foucus on final non-voters who intend to vote
-
after feature filter-out, check the performance with only top-5/10/20 features
2024 Jan 3rd:
-
finish the feature filtering: set the missing-ratio in recent 20 years as the threshold, and remove the features with missing ratio larger than 0.3-done
-
build the simple classifier: logistic regression, random forest, and gradient boosting-doing
- using the small dataset (no missing values) to build the simple classifier
- send email to the professor to ask about the missing values and categorical features
- using the sklearn/R based imputation method to deal with the missing values
how to deal with the missing data?
- just drop? -> only 10% of the data left( 6000~7000 samples)
- fill with mean? -> too much categorical data
- use the
amelia
package? -> can it deal with the categorical data? -> to check - use the
sklearn.impute
package? -> can it deal with the categorical data? -> to check - use
how to deal with the categorical features?
- should we use the label "UK/don't want answer" as a category, or just drop it/as a missing values?
- is so, can we use the
sklearn.preprocessing.OneHotEncoder
to encode the categorical features? -> to check
2023 Dec:
- finish the data collection and process, set the target - Done