/capital-nature-ingest

Scripts for ingesting data for Capital Nature

Primary LanguagePython

CircleCI

capital-nature-ingest

Webscraping nature-related events from a variety of sources to populate an events calendar on Capital Nature.

What is Capital Nature?

Capital Nature is a 501c3 nonprofit organization dedicated to bringing nature into the lives of Washington Metro area residents and visitors. They want to highlight on an events calendar all the great nature events and experiences happening in the area.

How do we update the event calendar?

  • For each of the event sources, we use Python (3.6.6) to scrape the events' data and transform it to fit our schema.
  • The project is currently designed to be used locally, outputting three separate spreadsheets (csv) that the Capital Nature team can upload to their Wordpress website.

In the future, we might deploy and schedule the script using AWS Lambda, with the output dumped into an S3 bucket.

To track bugs, request new features, or just submit interesting ideas, we use GitHub issues.

Getting Started

  1. Assuming you've got Python 3.6 and a GitHub account, clone the repo:
git clone https://github.com/DataKind-DC/capital-nature-ingest.git
  1. Navigate into the repository you just cloned:
cd capital-nature-ingest
  1. Start a virtual environment:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

You can deactivate the virtual environment with deactivate.

  1. Get the events:

Before getting the events, you'll need to have a National Park Service (NPS) API key and an Eventbrite API key.

  • Get one for NPS here
  • Get one for Eventbrite here. For the Eventbrite token, we've found it helpful to follow the instructions in this blog post. After signing up, in the top right dropdown, click on Account Settings > Developer Links sidebar > API Keys then click on Create API Key or go to this link

Once you've got your tokens, add them as environment variables called NPS_KEY and EVENTBRITE_TOKEN, respectively. Or simply run the script and input them when prompted.

To run the script:

python get_events.py

Running the above will scrape all of the events and output three csv files into a new data/ dir of the project:

  • cap-nature-events-<date>.csv (all of the events)
  • cap-nature-organizers-<date>.csv (a list of the event sources, which builds off the previous list each successive time you run this)
  • cap-nature-venues-<date>.csv (a list of the event venues, which builds off the previous list each successive time you run this)

Contributing

If you'd like to lend a hand, hop on over to our Issues to see what event sources still need scraping. If you see one that you'd like to tackle, assign yourself to that issue and/or leave a comment saying so. This will let others know that you're working on that event source and that they shouldn't duplicate your efforts.

Once you've found something you want to work on, please read our contributing guideline for details on how to contribute using git and GitHub.

License

Here