Slack Channel: #refugees
Project Description: Classifying, tagging, analyzing and visualizing news events about internal displacement. Based on a challenge from the IDMC. Our aim is to build a tool that can populate a database with displacement events which can be both classified by a machine and verified by a human. The details of each event are to be fed into a tool for analysis and visualization.
Project Lead:
Maintainers: These are the additional people mainly responsible for reviewing pull requests, providing feedback and monitoring issues.
- Join the Slack channel
- Read the pinned posts in Slack to get a full idea of the project or and feel free to ask questions.
- Browse our issues for
help-wanted
,beginner-friendly
, anddiscussion
tags (full issue label guide here) - See something you want to work on? Make a comment on the issue or ping us on Slack so we can assign you the task or discuss it.
- Write the code and submit a pull request to add it to the project. Reach out for help any time!
- Try to keep each contribution and pull request focussed mostly on solving the issue at hand. If you see more things that are needed, feel free to let us know and/or make another issue.
- Datasets can be accessed from Dropbox
- We have a working plan for the project.
- Not ready to submit code to the main project? Feel free to play around with notebooks and submit them to the repository.
- "First-timers" are welcome! Whether you're trying to learn data science, hone your coding skills, or get started collaborating over the web, we're happy to help. (For beginners with Git and GitHub specifically, our github-playground repo and the #github-help Slack channel are good places to start.)
- We believe good code is reviewed code. All commits to this repository are approved by project maintainers and/or leads (listed above). The goal here is not to criticize or judge your abilities! Rather, sharing insights and achievements. Code reviews help us continually refine the project's scope and direction, as well as encourage the discussion we need for it to thrive.
- This README belongs to everyone. If we've missed some crucial information or left anything unclear, edit this document and submit a pull request. We welcome the feedback! Up-to-date documentation is critical to what we do, and changes like this are a great way to make your first contribution to the project.
The main components of the project:
- Scraping
- Take lists of URLs and and scrape the content of their web pages.
- Extract the main body text and metadata
- Store the information
- Machine Learning & NLP
- Filter broken URLs in master input dataset and those containing non-useful data (videos etc.)
- Classify URLs in master input dataset as conflict/violence, disaster or other. There is a training dataset to help with tagging.
- Extract information from articles within URLs: location and number of reporting units (households or individuals) displaced, date published and reporting term (conflict/violence, disaster or other). The larger extended input dataset can be used to help here.
- Visualize!
- A mapping tool is desired to visualize the displacement figures and locations, identify hotspots and trends.
- Histogram or other visualization for a selected region to identify reporting frequency on the area
- Taking into account only the documents that report actual displacement figures, visualize the excerpts of documents where the relevant information is reported (either looking at the map or browsing the list of URLs).
- Some pre-tagged datasets (1, 2) can be used to start exploring visualization options.
- Languages - Python 3
- Skills - NLP, ML, web scraping, geospatial, visualization
Don't see your skill here? Don't worry, we are looking to make all kinds of enhancements to the project so there will likely be a place for you. We especially need developer/web dev experience.