I conducted exploratory data analysis on Billboard hits in 2000 to see what interesting insights you can gain from the data.
I set out to predict whether or not a Congressional bill would pass based on summaries (provided by the CRS) alone. I'll be building on this extensively, including using LDA to cluster bills into separate categories, and incorporating roll calls to try to predict which way an individual member of the House would vote.
I looked at tweets from Breitbart News and The Onion to see how similar they were. I also conducted sentiment analysis on the tweets to see if one was more subjective or polarizing than the other.
I love the START database (the Global Terrorism Database). I used it extensively during my Masters' degree and jumped at the opportunity to analyse it from a different angle. The codebook is an extensive EDA of the database and a bayesian comparison of two groups' use of suicide bombs.
Everyone studying data science has done this project. I wanted to build the most concise and efficient webscraper I could (and keep updating it as I learn more about webscraping!) and use job titles and summaries, as well as variables that look at key words, to maximise the predictive accuracy of my classification algorithms.