In this repository I share replication materials for the article "Predicting Political Attitudes from Web Browsing Histories: Machine Learning Approach".
The paper aims to introduce machine learning approach to identify political attitudes based on peoples' website choices. Specifically, I use web tracking data of 1,000 German voters generated by them after three months of tracking. I propose to use categorization of websites based on existing domains. When matching website domains with existing categories from Webshrinker, each category represents a predicting variable in a regression. I use the following regressions: Linear Regression, Elastic Net, and Random Forest.
Below is the of available replication materials and supplementary files for replication and further research.
Machine Learning Methods: Dimentionality reduction, Linear regression, Random Forest and Elastic Net
- Table with domain categories from Webshrinker that we managed to match with domains from our initial web tracking data.
- R code for exploring domain categories from Webshrinker: A table with descriptive statistics like sum of visits by group of domain categories;
- Distribution plots: code.
- Table with top 5 domains per category. Note that Weshrinker offered subcategories withing main categories like Business. The table shows top domains for each subcategory.
- Plots with descriptive OLS estimates, with controlls: Selected political attitudes and domain categories, the rest of the political attitudes;
- Plots with OLS estimates with controlls for the rest of the political attitudes;
- OLS, Random Forest and ElasticNet summary plot: Pearson correlations and R2 for all political attitudes (R code to make this plot);
- Plots with Variable Importance Rank of domain categories for each political attitude (R code that can also produce an interactive plot with plotly): Variable importance from Random Forest, and Linear regression.
- Two models showed significant predictions: support for democratic political system and interest in politics. Ploted variable importance rank for both models: Plot 1 and Plot 2 respectively.
We combined survey and web tracking data to build machine learning models where web site visits predict self-reported political attitues. There are several findings about predicting models and their applications in social science. The evidence is mixed and requires further research. We built machine learning model for each political attitude of interest, 15 in total. Two models showed significant prediction: interest in politics and support for democratic system. Issues related attitudes and populist attitudes could not be predicted from web tracking data. Web tracking data was more successful in predicting demograpgics. From variable importance rank we also learned that media related website domains have a substantial contribution for predicting political attitudes. Entertainment domains did not contribute to the model performance.
Summary of the analsis from this repository is avaiable in the Online Appendix of the paper: LINK.
Additionally, plots for validation of web tracking data: browsing behavior and privacy policy of web tracking vs national German panel.