gdelt-demo

Note about this project

I am not actively working on this project, so it is a snapshot of what I was learning in 2019. Even though it remains an unfinished hodgepodge, it demonstrates some of my capabilities in a way that I can't do with code written in my present role.

Archival README

For the main showcase, start here. You can view it in Jupyter Notebook or at that link.

This is an early-stage demo project to show how data science can be applied to the Global Database of Events, Language and Tone (GDELT), a multi-terabyte dataset of world events. From this analysis I expect to derive insights into how nations and societies interact, particularly in predicting precursors to violence that may not be obvious.

This project is exploratory, in the sense that I'm still getting a feel for the capabilities of the GDELT data.

It showcases my competency in Python and in data science, and has been a platform to build skills in AWS (and soon, Google Query as needed for real-time GDELT v.2), including Hadoop and to an extent Spark. So far I have examples of analytics involving classification, using decision trees, k nearest neighbors, simple vector machines, and random forests. I also have an example of linear regression. As this project progresses, it will more extensively showcase a full data science life-cycle, but the pieces are already there in some form: exploratory analysis, hypothesis testing, data wrangling/data munging, modeling, and drawing conclusions.

For other projects in my showcase, a task list for this project, and other info, see readme_more.md.

Setup

Most of this code is Python 3 (3.5 or 3.6). To run the analysis code in Python, you'll likely want to do the standard routine of pip3 install -r requirements.txt.

News and roadmap

For a while I had turned my attention away from this project to build some Python skills, but as of mid-January 2019 I'm actively building it out.
At present I'm working on using the dyad queries (from Hadoop/Hive) to predict aggressive events, still using version 1 of GDELT. I've also made some updates to my AWS automation via boto3 (see automation).

The roadmap is highly iterative and subject to change, as project roadmaps often should be.

1. Proof of concept, done.
2. 1. Done: Extract some features related to countries and how they relate. Demonstrate basic classification and clustering techniques.
  2. Still to do: Add basic visualizations using matplotlib.
3. Expand the universe of research questions under consideration and deploy the above tools to derive more "real world" value in the findings. Enhance visualizations.
4. Consider alternative interfaces into the data, at a much higher level of sophistication and interactivity (e.g. Django site? Interactive visualizations?) to make the findings more accessible to the public.

Skills demonstrated and accomplishments

Skills

Ideation and asking "good questions": Since ideation is an important soft skill that I can contribute in abundance, I've documented my thought process in devising new questions. See research questions under Start_here.ipynb.
Regression analysis (Scikit-learn LinearRegression class): Findings in Start_here.ipynb. See analysis for code.
SQL and HiveQL: The most complex examples are my first cuts at feature extraction; see queries/feature_extrac. See queries/exploration. The SQL ones should run against a local MySQL. Many of the Hive ones will run against MySQL with minimal modification.
AWS setup (S3, EMR cluster creation): Data engineering skills are not my highest priority, but they are useful to complement the purely data science skills.
- via CLI: See automation/scripts.
- via boto3: See automation.
- Hadoop on Elastic MapReduce (EMR):
Spark (via PySpark): See queries/spark_sql. Hadoop has worked fine with HiveQL, but so far, my Spark apps hang when run as steps. Learning what's going on here will be helpful to my overall understanding.

Accomplishments

The preceding list is oriented toward showcasing specific skills, but the findings themselves (as enumerated in Start_here.ipynb) are an accomplishment. So far I've:

Found that media coverage of in general, holding constant the kinds of events covereed, tends to get more negative in general over time. By in general I mean that it's not just the lifespan of a given event (e.g. a war that lasts for years) but all coverage.
Found that media coverage of events tends to be more positive for the kinds of events that promote stability, but not to what I would consider a great extent after controlling for the previous time effect.

In addition the automation itself is an accomplishment, because it starts to give structure and reproducibility to setting up various environments (local, remote SQL, AWS EMR) for analysis.

reed9999/gdelt-demo