Covid-19 Occupation Analysis

Maxwell Austensen (@austensen)
Hayley Raetz (@hayley-raetz)
Jiaqi Dong (@Mocha22ol)

The analysis presented here appears in these two blog posts:

What are the Housing Costs of Households Most Vulnerable to Job Layoffs? An Initial Analysis

# Install required packages 

# pkgs <- c(
#   "tidyverse",
#   "googlesheets4",
#   "hrbrthemes",
#   "srvyr",
#   "knitr",
#   "rmarkdown",
#   "sf",
#   "dotenv"
# )

# install.packages(pkgs)

library(tidyverse) # general data manipulation and graphing
library(googlesheets4) # google sheets
library(srvyr) # survey functions
library(knitr) # markdown table with kable()
library(sf) # spatial data
library(dotenv) # load environment variables from ".env" file

# Load custom functions to help with plotting
source("R/utils.R")

# No scientific notation in outputs
options(scipen = 999)

# Interactively authorize the {googlesheets4} package to access your Google
# account and then cache the authorization token for subsequent uses.

# You must set your google email in the ".env" file (see ".env_sample")
options(
  gargle_oauth_email = Sys.getenv("GOOGLE_EMAIL"),
  gargle_oauth_cache = ".cache"
)

sheets_auth()

Vulnerable Occupations

To conduct this analysis we used IPUMS occupation code (occ) to separate occupations into two categories: those with the highest risk of mass layoffs and workplace closures due to the pandemic; and those likely to be more protected from widespread disruption. For more information on our the approach, see our blog post based on this analysis What are the Housing Costs of Households Most Vulnerable to Job Layoffs? An Initial Analysis.

We invite constructive criticism and a dialogue over the methodology for classifying these occupation. You can view and comment on the classification in this Google Sheet, and we look forward to any advice or feedback you might have.

# Read in the occupation risk classification

occ_risk_sheet <- "https://docs.google.com/spreadsheets/d/18dmgZC_sQZOc9AxETwqs7Wc9P6QmCG62y40wlFxXMMk"

occ_risk_xwalk <- occ_risk_sheet %>% 
  read_sheet(col_types = "ld___", col_names = c("risk_group", "occ"), skip = 1)

Data Preparation

All data for this analysis comes from IPUMS USA, University of Minnesota. To build on this analysis and/or replicate it for a different geography, you can sign up for a free account and download your own extract of the data. From the IPUMS USA page, go to Select Data and choose the variables. In addition to to automatically pre-selected variables, you’ll to select the following other variables: statefip, countyfip, puma, occ, incwage, hhincome, ownershp, rentgrs, unitsstr, hispan, and race. Then click Change Samples to select the data sample you want to use (for this analysis we have used ACS 2018 1-year). Once you have all your variables and samples selected, click View Cart and then Create Extract. The default options here are fine (format: .csv, structure: Rectangular (person)), and by default you’ll download data for the whole country. You can click Select Cases, then statefip to export data for only the states you select. Once the request is complete, download the file to the /data folder and adjust the following section of code to reflect your file name and filter to your desired geography.

# Read in IPUMS USA ACS microdata, filter to desired geography

ipums_raw <- read_csv("data/ipums_acs-2018-1yr_ny.csv.gz") %>% 
  rename_all(str_to_lower) %>% 
  filter(
    # Keep only NYC
    statefip == 36,
    countyfip %in% c(5, 47, 61, 81, 85)
  )

# Create all person- and household-level variables for analysis

ipums_clean <- ipums_raw %>% 
  filter(
    # Remove group quarters population
    gq %in% 1:2 
  ) %>% 
  # Join in risk_group occupation flag
  left_join(occ_risk_xwalk, by = "occ") %>%
  mutate(
    # Set missing values
    inc_wages = incwage %>% na_if(999999) %>% na_if(999998) %>% na_if(0),
    
    # There are lots of people with occupation codes but no wages, for this
    # analysis we'll count them as not having that occupation, and like people
    # without an OCC code will have risk_group=NA
    risk_group = if_else(is.na(inc_wages), NA, risk_group),
    risk_group_lab = if_else(risk_group, "More vulnerable", "Less vulnerable"),
    
    risk_wages = if_else(risk_group, inc_wages, NA_real_),
    
    # Household income
    hh_inc_nom = case_when(
      hhincome <= 0 ~ 0,
      hhincome == 9999999 ~ NA_real_, 
      TRUE ~ hhincome
    ),
    hh_inc_grp = cut(
      hh_inc_nom, c(-Inf, 15e3, 30e3, 50e3, 75e3, 112.5e3, 150e3, Inf),
      c("< $15k", "$15k - 30k", "$30k - $50k", "$50k - $75k", "$75k - $112.5k", "$112.5k - $150k", ">= $150k"),
      include.lowest = TRUE, 
      ordered_result = TRUE
    ),
  
    # Various renter variables. These are household level variables, and will
    # only be used later after filtering to one row per household.
    is_renter = (ownershp == 2),
    gross_rent_nom = if_else(is_renter, rentgrs, NA_real_),
    gross_rent_grp = cut(
      gross_rent_nom, 
      c(-Inf, 600, 1000, 1400, 1800, 2200, Inf),
      c("< $600", "$600 - 1,000", "$1,000 - $1,400", "$1,400 - $1,800", "$1,800 - $2,200", ">= $2,200"),
      include.lowest = TRUE, 
      ordered_result = TRUE
    ),
    is_rent_burdened = (rentgrs > (hh_inc_nom / 12 * 0.30)),
    is_rent_burdened_sev = (rentgrs > (hh_inc_nom / 12 * 0.50)),
    is_rent_burdened_mod = (is_rent_burdened) & (!is_rent_burdened_sev),
    
    # Race/ethnicity labels for graph
    race_name = case_when(
      hispan %in% 1:4 ~ "Hispanic,\nof any race",
      race == 1 ~ "Non-Hispanic\nwhite",
      race == 2 ~ "Non-Hispanic\nBlack",
      race == 3 ~ "Non-Hispanic\nAmerican\nIndian or\nAlaska Native",
      race == 4 ~ "Non-Hispanic\nAsian or\nPacific Islander", # Chinese
      race == 5 ~ "Non-Hispanic\nAsian or\nPacific Islander", # Japanese
      race == 6 ~ "Non-Hispanic\nAsian or\nPacific Islander", # Other Asian or Pacific Island
      race == 7 ~ "Non-Hispanic\nother",
      race == 8 ~ "Non-Hispanic\nTwo or more\nmajor races", # Two major races
      race == 9 ~ "Non-Hispanic\nTwo or more\nmajor races" # Three or more major races
    ),
    bldg_size = case_when(
      unitsstr %in% 3:4 ~ "1",
      unitsstr %in% 5:6 ~ "2-4",
      unitsstr == 7 ~ "5-9",
      unitsstr == 8 ~ "10-19",
      unitsstr == 9 ~ "20-49",
      unitsstr == 10 ~ "50+",
      TRUE ~ "other"
    ) %>% ordered(levels = c("1","2-4","5-9","10-19","20-49","50+", "other"))
  ) %>% 
  # Group by household and categorize households based or occupations of members
  group_by(serial) %>% 
  mutate(
    # Household with at least one wage earner in a more vulnerable occupation
    
    # If there are no members with wages then NA, if there are any at-risk
    # people with wages then TRUE, if there are people with wages but none of
    # them are at risk then FALSE
    hh_any_risk = case_when(
      all(is.na(risk_group)) ~ NA, # no wage earners
      any(risk_group, na.rm = TRUE) ~ TRUE, # any wage earners are at risk
      all(!risk_group, na.rm = TRUE) ~ FALSE # all wage earners are at NOT at risk
    ),
    
    # Household with all wage earners in more vulnerable occupations
    
    # If all members have no wag income then NA, if all the wage earners are in
    # the risk group then TRUE, if there are members with wage income but
    # none/only some are at risk then FALSE
    hh_all_risk = case_when(
      all(is.na(risk_group)) ~ NA, # no wage earners
      all(risk_group, na.rm = TRUE) ~ TRUE, # all wage earners are at risk
      any(!risk_group, na.rm = TRUE) ~ FALSE # not all wage earners are at risk
    ),
    
    # The total wages for each household that come from vulnerable occupations
    hh_risk_wages = sum(risk_wages, na.rm = TRUE),
    
    # The percent of household income that comes from wages from vulnerable occupations
    hh_risk_wages_pct = sum(risk_wages, na.rm = TRUE) / na_if(hh_inc_nom, 0)
  ) %>% 
  ungroup()

Graphs

# Share of households with at least one member employed in a more vulnerable
# occupation by household income

p <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    !is.na(hh_any_risk) # remove households with no wage earners
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_inc_grp, .drop = FALSE) %>% 
  summarise(risk_pct = survey_mean(hh_any_risk, vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  fc_col_plot(
    x = hh_inc_grp,
    y = risk_pct,
    ymin = risk_pct_low,
    ymax = risk_pct_upp,
    y_limits = c(0, 1),
    y_format = "percent"
  ) +
  labs(
    title = str_glue(
      "Share of households with at least one member employed 
      in a more vulnerable occupation by household income"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = str_glue("Share of households"),
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-any_share-income.png")

# Households with at least one member employed in a more vulnerable occupation
# by household income

p <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_any_risk # keep only households with at least one wage earner in vulnerable occupation
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_inc_grp, .drop = FALSE) %>% 
  summarise(households = survey_total(vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  fc_col_plot(
    x = hh_inc_grp,
    y = households,
    y_limits = c(0, 300000),
    ymin = households_low,
    ymax = households_upp,
    y_format = "si"
  ) +
  labs(
    title = str_glue(
      "Households with at least one member employed in
      a more vulnerable occupation by household income"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = "Households",
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-any_households-income.png")

# Renter share of households with at least one member employed in a more
# vulnerable occupation by household income

p <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_any_risk # keep only households with at least one wage earner in vulnerable occupation
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_inc_grp, .drop = FALSE) %>% 
  summarise(renter_pct = survey_mean(is_renter, vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  fc_col_plot(
    x = hh_inc_grp,
    y = renter_pct,
    y_limits = c(0, 1),
    ymin = renter_pct_low,
    ymax = renter_pct_upp,
    y_format = "percent"
  ) +
  labs(
    title = str_glue(
      "Renter share of households with at least one member employed
      in a more vulnerable occupation by household income"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = str_glue("Renter share of households"),
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-any_renter-share-income.png")

# Households with all employed members in more vulnerable occupations by
# household income

p <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_all_risk # keep only households with all wage earners in vulnerable occupations
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_inc_grp, .drop = FALSE) %>% 
  summarise(households = survey_total(vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  fc_col_plot(
    x = hh_inc_grp,
    y = households,
    y_limits = c(0, 250000),
    ymin = households_low,
    ymax = households_upp,
    y_format = "si"
  ) +
  labs(
    title = str_glue(
      "Households with all employed members in
      more vulnerable occupations by household income"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = "Households",
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-all_households-income.png")

# Share of households with all employed members in more vulnerable occupations
# by household income

p <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    !is.na(hh_all_risk) # keep only households with at least one wage earner
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_inc_grp, .drop = FALSE) %>% 
  summarise(risk_pct = survey_mean(hh_all_risk, vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  fc_col_plot(
    x = hh_inc_grp,
    y = risk_pct,
    y_limits = c(0, 0.75),
    ymin = risk_pct_low,
    ymax = risk_pct_upp,
    y_format = "percent"
  ) +
  labs(
    title = str_glue(
      "Share of households with all employed members in
      more vulnerable occupations by household income"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = str_glue("Share of households"),
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )
  
plot_save_include("img/nyc_occ-risk-all_share-income.png")

# Median rent for households with at least one member employed in a more
# vulnerable occupation by household income

p <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_any_risk, # keep only households with at least one wage earner in vulnerable occupation
    is_renter # keep only renter households
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_inc_grp, .drop = FALSE) %>% 
  summarise(gross_rent_nom_med = survey_median(gross_rent_nom, vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  fc_col_plot(
    x = hh_inc_grp,
    y = gross_rent_nom_med,
    y_limits = c(0, 2500),
    ymin = gross_rent_nom_med_low,
    ymax = gross_rent_nom_med_upp,
    y_format = "dollar"
  ) +
  labs(
    title = str_glue(
      "Median rent for households with at least one member employed 
      in a more vulnerable occupation by household income"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = "Median gross rent",
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-any_rent-income.png")

# Share of wage earners employed in more vulnerable occupations by
# race/ethnicity

p <- ipums_clean %>% 
  filter(
    !is.na(risk_group) # keep only wage earners
  ) %>% 
  as_survey_design(weights = perwt) %>% 
  group_by(race_name, .drop = FALSE) %>% 
  summarise(pop_pct = survey_mean(risk_group, vartype = "ci", level = 0.90)) %>% 
  ungroup() %>% 
  mutate(race_name = fct_reorder(race_name, -pop_pct)) %>% 
  fc_col_plot(
    x = race_name,
    y = pop_pct,
    y_limits = c(0, 0.6),
    ymin = pop_pct_low,
    ymax = pop_pct_upp,
    y_format = "percent"
  ) +
  labs(
    title = str_glue(
      "Share of wage earners employed in more
      vulnerable occupations by race/ethnicity"
    ),
    subtitle = "New York City, 2018",
    x = "Race/ethnicity",
    y = str_glue("Share of wage earners"),
    caption = str_glue(
      "Notes: Error bars represent 90% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk_pop-share-race.png")

# Create distributions of renter households by building size, for "any risk"
# (more vulnerable) and "no risk" (less vulnerable), including CIs

size_more_vul <- ipums_clean %>%
  filter(
    pernum == 1, # keep only one row per household
    hh_any_risk, # keep only households with at least one wage earner in vulnerable occupation
    is_renter #keep only renter households
  ) %>% 
  as_survey_design(weights = perwt) %>%
  group_by(bldg_size, .drop = FALSE) %>%
  summarise(households = survey_total(vartype = "ci", level = 0.90))

total_more_vul <- size_more_vul %>%
  mutate(
    households_moe = households_upp - households,
    households_moe_sqr = households_moe^2
  ) %>%
  summarize(
    total = sum(households),
    total_moe = sqrt(sum(households_moe_sqr))
  )

size_more_vul_rates <- size_more_vul %>%
  mutate(
    households_moe = households_upp - households,
    total = total_more_vul[["total"]],
    total_moe = total_more_vul[["total_moe"]],
    share = households / total,
    share_moe = (1 / total) * sqrt(households_moe^2 - (share * total_moe)^2),
    share_low = share - share_moe,
    share_upp = share + share_moe
  ) %>%
  select(bldg_size, share, share_low, share_upp) %>%
  mutate(category = "More vulnerable")


size_less_vul <- ipums_clean %>%
  filter(
    pernum == 1, # keep only one row per household
    !hh_any_risk, # keep only households without at least one wage earner in vulnerable occupation
    is_renter #keep only renter households
  ) %>% 
  as_survey_design(weights = perwt) %>%
  group_by(bldg_size, .drop = FALSE) %>%
  summarise(households = survey_total(vartype = "ci", level = 0.90))

total_less_vul <- size_less_vul %>%
  mutate(
    households_moe = households_upp - households,
    households_moe_sqr = households_moe^2
  ) %>%
  summarize(
    total = sum(households),
    total_moe = sqrt(sum(households_moe_sqr))
  )

size_less_vul_rates <- size_less_vul %>%
  mutate(
    households_moe = households_upp - households,
    total = total_less_vul[["total"]],
    total_moe = total_less_vul[["total_moe"]],
    share = households / total,
    share_moe = (1 / total) * sqrt(households_moe^2 - (share * total_moe)^2),
    share_low = share - share_moe,
    share_upp = share + share_moe
  ) %>%
  select(bldg_size, share, share_low, share_upp) %>%
  mutate(category = "Less vulnerable")

p <- bind_rows(size_less_vul_rates, size_more_vul_rates) %>% 
  filter(bldg_size != "other") %>% 
  fc_col_plot_cluster(
    x = bldg_size,
    y = share,
    fill = category,
    y_limits = c(0, 0.5),
    ymin = share_low,
    ymax = share_upp,
    y_format = "percent"
  ) +
  scale_fill_manual(values = c("#2c7fb8", "#98e2c9")) +
  labs(
    title = str_glue(
      "Distribution of renter households across building sizes, by economic vulnerability"
    ),
    subtitle = "New York City, 2018",
    x = "Units in building",
    y = "Share of households",
    fill = NULL,
    caption = str_glue(
      "Notes: Only renter households with at least one emloyed member are included. Households are considered  more vulnerable if at least one member works 
      in a vulnerable occupation and households without any members in such occupations are considered less vulnerable. 
      Error bars represent 90% confidence intervals, and value labels reflect point estimates.
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_renter-risk_bldg_size.png")

# Calculate various stats about renter households for the same more and less
# vulnerable groups as above.

more_less_vul_stats <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    !is.na(hh_any_risk), # keep only households with at least one wage earner
    is_renter # keep only renter households
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  group_by(hh_any_risk) %>% 
  summarise(
    hh_gross_rent_med = survey_median(gross_rent_nom, vartype = "ci", level = 0.90),
    hh_gross_rent = survey_quantile(gross_rent_nom, 0.75, vartype = "ci", level = 0.90),
    hh_rent_burdened_pct = survey_mean(is_rent_burdened, vartype = "ci", level = 0.90),
    hh_rent_burdened_mod_pct = survey_mean(is_rent_burdened_mod, vartype = "ci", level = 0.90),
    hh_rent_burdened_sev_pct = survey_mean(is_rent_burdened_sev, vartype = "ci", level = 0.90)
  ) %>% 
  ungroup() %>% 
  pivot_longer(-hh_any_risk) %>% 
  mutate(
    type = case_when(
      str_detect(name, "_low") ~ "low",
      str_detect(name, "_upp") ~ "upp",
      TRUE ~ "est"
    ),
    name = str_remove(name, "(_low|_upp)"),
    hh_any_risk = recode(as.character(hh_any_risk), "TRUE" = "More vulnerable", "FALSE" = "Less vulnerable")
  ) %>% 
  pivot_wider(names_from = type, values_from = value)

p <- more_less_vul_stats %>% 
  filter(str_detect(name, "gross_rent")) %>% 
  mutate(
    name = name %>% 
      recode("hh_gross_rent_med" = "Median", "hh_gross_rent_q75" = "75th Percentile") %>% 
      ordered(levels = c("Median", "75th Percentile"))
  ) %>% 
  fc_col_plot_cluster(
    x = name,
    y = est,
    fill = hh_any_risk,
    y_limits = c(0, 3000),
    ymin = low,
    ymax = upp,
    y_format = "dollar"
  ) +
  scale_fill_manual(values = c("#2c7fb8", "#98e2c9")) +
  labs(
    title = str_glue(
      "Monthly gross rent, by economic vulnerability"
    ),
    subtitle = "New York City, 2018",
    x = NULL,
    y = "Monthly gross rent",
    fill = NULL,
    caption = str_glue(
      "Notes: Only renter households with at least one emloyed member are included. Households are considered  more vulnerable if at least one member works 
      in a vulnerable occupation and households without any members in such occupations are considered less vulnerable. 
      Error bars represent 90% confidence intervals, and value labels reflect point estimates.
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_renter-risk_gross-rent.png")

p <- more_less_vul_stats %>% 
  filter(str_detect(name, "burden")) %>% 
  mutate(
    name = name %>% 
      recode("hh_rent_burdened_pct" = "Rent burdened (>30%)", 
             "hh_rent_burdened_mod_pct" = "Moderately rent burdened (30%-50%)", 
             "hh_rent_burdened_sev_pct" = "Severely rent burdened (>50%)") %>% 
      ordered(levels = c("Rent burdened (>30%)", "Moderately rent burdened (30%-50%)", "Severely rent burdened (>50%)"))
  ) %>% 
  fc_col_plot_cluster(
    x = name,
    y = est,
    fill = hh_any_risk,
    y_limits = c(0, 0.6),
    ymin = low,
    ymax = upp,
    y_format = "percent"
  ) +
  scale_fill_manual(values = c("#2c7fb8", "#98e2c9")) +
  labs(
    title = str_glue(
      "Share of renter households that are rent burdened, by economic vulnerability"
    ),
    subtitle = "New York City, 2018",
    x = NULL,
    y = "Share of renter households",
    fill = NULL,
    caption = str_glue(
      "Notes: Only renter households with at least one emloyed member are included. Households are considered  more vulnerable if at least one member works 
      in a vulnerable occupation and households without any members in such occupations are considered less vulnerable. 
      Error bars represent 90% confidence intervals, and value labels reflect point estimates.
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_renter-risk_rent-burden.png")

Maps

To map the results of this analysis at the neighborhood level, the first step is to prepare the geomoetries for Public Use Microdata Areas (PUMAs). For New York City the city has a dataset of PUMAs already nicely clipped to the shoreline, for here we’ll be using those and joining on the neighborhood names.

# Get NYC PUMAs from NYC Open Data, and attach local neighborhood names

nyc_puma_names <- read_csv("data/nyc_puma_names.csv")

nyc_pumas_url <- "https://data.cityofnewyork.us/api/geospatial/cwiz-gcty?method=export&format=GeoJSON"
nyc_pumas <- read_sf(nyc_pumas_url) %>% 
  transmute(puma = as.numeric(puma)) %>% 
  st_transform(2263) %>% # local state plane projection for NYC
  st_simplify(dTolerance = 100) # reduce the detail of the polygons

Elsewhere in the country the best option for getting the geomoetries is to use the tigris R package to download Tiger/Line shapefiles from the US Census Bureau.

# {NOT RUN}
# Example of getting PUMA geometries using tirgis package. 
library(tigris)

# Easiest option will be to get all PUMAs in the state, then innner join to the NYC IPUMS data
ny_pumas <- tigris::pumas("NY", class = "sf") %>% 
  transmute(
    statefip = STATEFP10,
    puma = as.numeric(PUMACE10),
    puma_name = NAMELSAD10
  )

# Set some of the options once here to recude repetition below
survey_mean_ci90 <- purrr::partial(survey_mean, vartype = "ci", level = 0.90)

ipums_puma <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    # is_renter, # keep only renter households
    !is.na(hh_any_risk) # remove households with no wage earners
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  # In NYC pumas nest within county, so helpful to keep county/borough IDs
  group_by(countyfip, puma) %>% 
  summarise(
    hh_any_risk_renter_pct = survey_mean_ci90(hh_any_risk*na_if(is_renter, FALSE), na.rm = TRUE),
    hh_all_risk_renter_pct = survey_mean_ci90(hh_any_risk*na_if(is_renter, FALSE), na.rm = TRUE),
    hh_any_risk_pct = survey_mean_ci90(hh_any_risk),
    hh_all_risk_pct = survey_mean_ci90(hh_any_risk),
  ) %>% 
  ungroup() %>% 
  mutate(
    hh_any_risk_pct_moe = hh_any_risk_pct_upp - hh_any_risk_pct,
    hh_anll_risk_pct_moe = hh_all_risk_pct_upp - hh_all_risk_pct,
    hh_any_risk_renter_pct_moe = hh_any_risk_renter_pct_upp - hh_any_risk_renter_pct,
    hh_anll_risk_renter_pct_moe = hh_all_risk_renter_pct_upp - hh_all_risk_renter_pct,
    # Suppress the most unreliable observations, then bin for map
    # See notes below about using reliability claculator to decide on these bins
    hh_any_risk_renter_pct_grp = ifelse(hh_any_risk_renter_pct_moe > 0.07, NA_real_, hh_any_risk_renter_pct) %>% 
      cut(
        breaks = c(-Inf, 0.4, 0.6, Inf),
        labels = c("< 40%", "40% - 60%", ">= 60%"),
        include.lowest = TRUE,
        ordered_result = TRUE
      ) %>% 
        fct_explicit_na("Insufficient Data")
  ) %>% 
  select(-matches(".*_(upp|low)$")) %>% 
  left_join(nyc_pumas, by = "puma") %>% 
  left_join(nyc_puma_names, by = "puma") %>% 
  st_as_sf() # dataframe takes class of left table, so change back to spatial type

p <- ggplot(ipums_puma) +
  aes(fill = hh_any_risk_renter_pct) +
  geom_sf(size = 0.1, color = "white") +
  scale_fill_viridis_c(labels = scales::percent_format(1), option = "inferno") +
  theme_fc_map() +
  theme( # some NYC specific tweaks
    legend.position = c(0, .85),
    legend.direction = "vertical",
    legend.margin = margin(0, 0, 0, 0)
  ) +
  labs(
    title = str_glue(
      "Share of renter households with at least one member 
      employed in a more vulnerable occupation"
    ),
    subtitle = "New York City, Sub-Borough Areas (PUMAs), 2018",
    fill = "Share of renter households",
    caption = str_glue(
      "Notes: The denominator includes only renter households with at least one wage earner
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-any-renter_share-map.png", height = 7, width = 7)

The above map doesn’t take in account the reliability of the estimates. Below the Sub Borough Areas (PUMAs) are categorized into three bins and pumas with the most unreliable estimates (90% MOE > 7 percentage points) are suppressed. These categories were selected with the help of the R package mapreliability and the interactive tool reliability_calculator() to ensure that there is an acceptable level of potential missclassification as a result of sampling error. With this classification, and the most unreliable estimate excluded, there is less than a 10% that chance any given geography in this map is misclassified due to sampling error, and for each individual category, there is less than a 20% chance that any given geography is misclassified due to sampling error.

library(mapreliability) # remotes::install_github("austensen/mapreliability")

ipums_puma %>% 
  filter(hh_any_risk_renter_pct_moe <= 0.07) %>% 
  reliability_table_custom(hh_any_risk_renter_pct, hh_any_risk_renter_pct_moe, c(0, 0.4, 0.6))

## # A tibble: 3 x 5
##   class_breaks count tot_count reliability tot_reliability
##          <dbl> <int>     <int>       <dbl>           <dbl>
## 1          0      13        40       11.6             9.61
## 2          0.4    18        40        7.51            9.61
## 3          0.6     9        40       10.9             9.61

p <- ggplot(ipums_puma) +
  aes(fill = hh_any_risk_renter_pct_grp) +
  geom_sf(size = 0.1, color = "white") +
  scale_fill_manual(
    values = c(viridisLite::inferno(3, begin = 0.1, end = 0.9), "grey60"), 
    guide = guide_legend(reverse = TRUE)
  ) +
  # scale_fill_viridis_d(guide = guide_legend(reverse = TRUE)) +
  theme_fc_map() +
  theme( # some NYC specific tweaks
    legend.position = c(0, .85),
    legend.direction = "vertical",
    legend.margin = margin(0, 0, 0, 0)
  ) +
  labs(
    title = str_glue(
      "Share of renter households with at least one member 
      employed in a more vulnerable occupation"
    ),
    subtitle = "New York City, Sub-Borough Areas (PUMAs), 2018",
    fill = "Share of renter households",
    caption = str_glue(
      "Notes: The denominator includes only renter households with at least one wage earner. 
      Areas with margin of error (90%) greater than 7 percentage points are not shown. 
      There is less than a 10% that chance any given geography in this map is misclassified 
      due to sampling error, and for each individual category, there is less than a 20% chance 
      that any given geography is misclassified due to sampling error.
      Sources: American Community Survey (2018), IPUMS USA, NYU Furman Center"
    )
  )

plot_save_include("img/nyc_occ-risk-any-renter_share-grp-map.png", height = 7, width = 7)

Tables

# Total number of households
hh_total <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  summarise(
    households = survey_total(1, vartype = "ci", level = 0.90)
  )

hh_total %>% 
  pivot_longer(everything()) %>% 
  kable()

name	value
households	3184458
households_low	3161148
households_upp	3207768

# Total number of wage earners
wage_earners_total <- ipums_clean %>% 
  filter(inc_wages >=0) %>% 
  as_survey_design(weights = perwt) %>% 
  summarise(
    individuals = survey_total(1, vartype = "ci", level = 0.90)
  )

wage_earners_total %>% 
  pivot_longer(everything()) %>% 
  kable()

name	value
individuals	4085333
individuals_low	4058659
individuals_upp	4112007

# Population living in households with at least one wage earner in more vulnerable occupation
pop_hh_risk_any <- ipums_clean %>% 
  filter(hh_any_risk) %>% 
  as_survey_design(weights = perwt) %>% 
  summarise(
    individuals = survey_total(1, vartype = "ci", level = 0.90)
  )

pop_hh_risk_any %>% 
  pivot_longer(everything()) %>% 
  kable()

name	value
individuals	3496121
individuals_low	3471727
individuals_upp	3520515

# Various stats for households with at least one wage earner in vulnerable occupation
hh_any_risk_stats <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_any_risk # keep only households with at least one wage earner in vulnerable occupation
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  summarise(
    households = survey_total(1, vartype = "ci", level = 0.90),
    hh_inc_nom_med = survey_median(hh_inc_nom, na.rm = TRUE),
    hh_risk_wages_med = survey_median(hh_risk_wages, na.rm = TRUE),
    hh_risk_wages_pct_med = survey_median(hh_risk_wages_pct, na.rm = TRUE),
    gross_rent_nom_med = survey_median(gross_rent_nom, vartype = "ci", na.rm = TRUE)
  )

hh_any_risk_stats %>% 
  pivot_longer(everything()) %>% 
  kable()

name	value
households	1031873.0000000
households_low	1018905.6246131
households_upp	1044840.3753869
hh_inc_nom_med	66589.2401216
hh_inc_nom_med_se	1021.3159191
hh_risk_wages_med	33292.9347826
hh_risk_wages_med_se	765.2109373
hh_risk_wages_pct_med	0.6511628
hh_risk_wages_pct_med_se	0.0095706
gross_rent_nom_med	1480.0000000
gross_rent_nom_med_low	1450.0000000
gross_rent_nom_med_upp	1500.0000000

# Various stats for households with all wage earners in vulnerable occupations
hh_all_risk_stats <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_all_risk # keep only households with all wage earners in vulnerable occupations
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  summarise(
    households = survey_total(1, vartype = "ci", level = 0.90),
    hh_inc_nom_med = survey_median(hh_inc_nom, na.rm = TRUE),
    hh_risk_wages_med = survey_median(hh_risk_wages, na.rm = TRUE),
    hh_risk_wages_pct_med = survey_median(hh_risk_wages_pct, na.rm = TRUE),
    gross_rent_nom_med = survey_median(gross_rent_nom, vartype = "ci", na.rm = TRUE)
  )

hh_all_risk_stats %>% 
  pivot_longer(everything()) %>% 
  kable()

name	value
households	548652.0000
households_low	538877.9518
households_upp	558426.0482
hh_inc_nom_med	41200.0000
hh_inc_nom_med_se	969.1174
hh_risk_wages_med	33757.1429
hh_risk_wages_med_se	765.0927
hh_risk_wages_pct_med	1.0000
hh_risk_wages_pct_med_se	0.0000
gross_rent_nom_med	1350.0000
gross_rent_nom_med_low	1310.0000
gross_rent_nom_med_upp	1387.1656

# Various stats for households making less than $150k and with at least one wage
# earner in a more vulnerable occupation
hh_any_risk_lt150k <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_any_risk,
    hh_inc_nom <150000
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  summarise(
    households = survey_total(1, vartype = "ci", level = 0.90),
    hh_inc_nom_med = survey_median(hh_inc_nom, na.rm = TRUE),
    hh_risk_wages_med = survey_median(hh_risk_wages, na.rm = TRUE),
    hh_risk_wages_pct_med = survey_median(hh_risk_wages_pct, na.rm = TRUE),
    gross_rent_nom_med = survey_median(gross_rent_nom, na.rm = TRUE)
  )

hh_any_risk_lt150k %>% 
  pivot_longer(everything()) %>% 
  kable()

name	value
households	872156.0000000
households_low	860239.8717704
households_upp	884072.1282296
hh_inc_nom_med	55000.0000000
hh_inc_nom_med_se	850.8103283
hh_risk_wages_med	30000.0000000
hh_risk_wages_med_se	0.0000000
hh_risk_wages_pct_med	0.7319793
hh_risk_wages_pct_med_se	0.0140094
gross_rent_nom_med	1430.0000000
gross_rent_nom_med_se	15.3025870

# Various stats for households making less than $150k and with all wage earners
# in a more vulnerable occupation
hh_all_risk_lt150k <- ipums_clean %>% 
  filter(
    pernum == 1, # keep only one row per household
    hh_all_risk,
    hh_inc_nom <150000
  ) %>% 
  as_survey_design(weights = hhwt) %>% 
  summarise(
    households = survey_total(1, vartype = "ci", level = 0.90),
    hh_inc_nom_med = survey_median(hh_inc_nom, na.rm = TRUE),
    hh_risk_wages_med = survey_median(hh_risk_wages, na.rm = TRUE),
    hh_risk_wages_pct_med = survey_median(hh_risk_wages_pct, na.rm = TRUE),
    gross_rent_nom_med = survey_median(gross_rent_nom, na.rm = TRUE)
  )

hh_all_risk_lt150k %>% 
  pivot_longer(everything()) %>% 
  kable()