/os_datasets

Small collection of open source datasets.

os_datasets

This repo contains a small collection of open-source datasets I helped maintain when I started teaching online for various clients. So there's a bit of a pedagogical slant to why I selected these. There are too many great data sources for those new to data science to explore, so this repo aims to be an excellent resource for practicing machine learning and fundamental analysis techniques more than a comprehensive storage bin for CSV files.

This repo is re-based from another one, but the intention is to maintain this resource long-term. If you would like to contribute, just clone, commit, and do a pull request!

Datasets in The Wild

If you want a dataset and some basic forums related to data science challenges, Kaggle is legendary amongst data science people. Historically, Kaggle is known for its competitive modeling, but more recently, they've been better at educational resources for storytelling and exploratory work.

One of the best resources for finding datasets of all types is labled by feature types and problem domain (classification / regression / etc).

Massive amount of data relating to the global community and governments around the world. Lots of geospatial, time-series, and in a variety of easy to load formats. The resource to use is their searchable data catalog portal.

Similar to the World Bank, Google has an excellent searchable directory of government-related data worldwide.

Public Google BigQuery Datasets (login may be required)

I've helped thousands of students find data with BigQuery for various final projects when I worked as an instructor. You can find historical weather, subreddit posts with comments, municipal US-based crime data, historical cryptocurrency prices. If you're looking for a specific dataset to try to the viability of an idea, GBQ should be in your go-to list of sources to check out. Also, extremely easy to pull in to Pandas for those Python-leaning people.

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See datasets from Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.

Many datasets are available to mount as storage volumes but keep in mind some of them are only available from certain regions around the world. Still, perhaps coverage is better than a few years ago when my students were using this resource.

Data journalism at its' finest but also a great source of datasets referenced in their content.

Tons of government data that is both dirty as it is vast and wide. Well, not always dirty, sometimes there is a lot of good practice with some branches of the US governemtn in terms of data but a lot of data isn't very well documented or referenced.

For more regional-specific data related to crime, search for "department of justice [your state name]," and you can usually find great data from police encounters and many other high-level stats related to law enforcement.

Here's one for California

Useful Subreddits

Reddit also hosts great communities related to data. Here are a few that are useful.

/r/datasets - Search the subreddit for specific problems or data. Also, a great place to ask for specific datasets as the community is knowledgeable and fairly engaged (ie: bigfoot sightings).

/r/dataisbeautiful - Excellent subreddit for data visualization. Many of the popular posts cite their datasets. Great subreddit to follow for current trends and creative inspiration when it comes to using data.

/r/datascience - General data science subreddit also is an excellent resource for searching for datasets.

/r/MachineLearning - With over 1M members, lots of posts to search through for datasets.

/r/LanguageTechnology - A "niche" NLP subreddit that frequently has references to publically available datasets related to unstructured text.

/r/rstats - One of the bigger R-related subreddits that have many posts with great examples of visualizing data but also Q&A style posts.