jds485/ds-pipelines-targets-3

Meet the example problem

Closed this issue · 7 comments

It's time to meet the data analysis challenge for this course! Over the next series of issues, you'll connect with the USGS National Water Information System (NWIS) web service to learn about some of the longest-running monitoring stations in USGS streamgaging history.

The repository for this course is already set up with a basic targets data pipeline that:

  • Queries NWIS to find the oldest discharge gage in each of three Upper Midwest states
  • Maps the state-winner gages

⌨️ Activity: Switch to a new branch

Before you edit any code, create a local branch called "three-states" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b three-states
git push -u origin three-states

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.


Comment on this issue once you've created and pushed the "three-states" branch.

Ready!

⌨️ Activity: Explore the starter pipeline

Without modifying any code, start by inspecting and running the existing data pipeline.

  • Open up _targets.R and read through - can you guess what will happen when you build the pipeline?
  • Build all targets in the pipeline.
  • Check out the contents of oldest_active_sites.

💡 Refresher hints:

  • To build a pipeline, run library(targets) and then tar_make().
  • To assign an R-object pipeline target to your local environment, run tar_load(mytarget). This function will load the object in its current state.
  • If you want to make sure you have the most up-to-date version of the target, you can have targets check for currentness or rebuild first by running tar_make(mytarget) and then using tar_load(mytarget).
  • You'll pretty much always want to call library(targets) in your R session while developing pipeline code - otherwise, you need to call targets::tar_make() in place of tar_make() anytime you run that command, and all that extra typing can add up.

When you're satisfied that you understand the current pipeline, include the value of oldest_active_sites$site_no and the image from site_map.png in a comment on this issue.


Add a comment to this issue to proceed.

oldest_active_sites$site_no
[1] "04073500" "05211000" "04063522"

site_map

⌨️ Activity: Spot the split-apply-combine

Hey, did you notice that there's a split-apply-combine action happening in this repo already?

Check out the find_oldest_sites() function:

find_oldest_sites <- function(states, parameter) {
  purrr::map_df(states, find_oldest_site, parameter)
}

This function:

  • splits states into each individual state
  • applies find_oldest_site to each state
  • combines the results back into a single tibble

and it all happened in just one line! The split-apply-combine operations we'll be exploring in this course require more code and are more useful for slow or fault-prone activities, but they follow the same general pattern.

Check out the documentation for map_df at ?purrr::map_df or online here if this function is new to you.


When you're ready, comment again on this issue.

Ready!

⌨️ Activity: Apply a downloading function to each state

Awesome, time for your first code changes ✏️.

  • Write three targets in _targets.R to apply get_site_data() to each state in states (insert these new targets under the # TODO: PULL SITE DATA HERE placeholder in _targets.R). The targets should be named wi_data, mn_data, and mi_data. oldest_active_sites should be used for the sites_info argument in get_site_data().

  • Add a call to source() near the top of _targets.R as needed to make your pipeline executable.

  • Test it: You should be able to run tar_make() with no arguments to get everything built.

💡 Hint: the get_site_data() function already exists and shouldn't need modification. You can find it by browsing the repo or by hitting Ctrl-SHIFT-F. in RStudio and then searching for "get_site_data".

When you're satisfied with your code, open a PR to merge the "three-states" branch into "main". Make sure to add _targets/*, 3_visualize/out/*, and any .DS_Store files to your .gitignore file before committing anything. In the description box for your PR, include a screenshot or transcript of your console session where the targets get built.


I'll respond in your new PR. You may need to refresh the PR page to see my response.