/smartest_places_in_the_us

A reanalysis of the smartest Census places in the US by state.

Primary LanguageJupyter Notebook

The Most Educated Places in the U.S.

Earlier this week I came across an article on the most educated places in the US. I thought it would be cool to try to recreate their findings and expand upon by taking into account the uncertainty of the survey results.

Let's get started.

Setup

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import us

Read data

We need two variables for each place in the US:

  • Percent of population over 25 with a bachelor's degree or higher (called educational attainment from now on)
  • Total population

We'll also grab the margin of error for these variables. We'll use them later.

def state_and_place(df):
    """Extracts state and place from the original identification variable."""
    columns = (
        df
        .id
        .astype(str)
        .str.extract('(?P<state_fips>\d{1,2})(?P<place_fips>\d{5})', expand=True)
        .astype(int)
    )
    return df.join(columns)
def read_acs(year):
    """Reads and formats ACS educational attainment data for a given year."""
    filename = '../data/ACS_{year}_5YR_S1501/ACS_{year}_5YR_S1501_with_ann.csv'.format(year=str(year)[-2:])
    
    na_values = ('**', '-', '+', '***', '*****', 'N', '(X)')

    variables = {
        'Id2': 'id',
        'Geography': 'place',
        "Total; Estimate; Percent bachelor's degree or higher": 'pct_bachelor_plus',
        "Total; Margin of Error; Percent bachelor's degree or higher": 'pct_bachelor_plus_moe'
    }
    
    return (
        pd
        .read_csv(
            filename,
            usecols=variables.keys(),
            na_values=na_values,
            # https://www.census.gov/geo/maps-data/data/tiger/char_encoding.html
            encoding='ISO-8859-1',
            skiprows=1
        )
        .rename(columns=variables)
        .pipe(state_and_place)
        .drop('id', axis=1)
        .set_index(['state_fips', 'place_fips'])
    )
def read_population(year):
    """Collects and formats population data from the Census API for a given year."""
    url = 'http://api.census.gov/data/{}/acs5?get=B01003_001E,B01003_001M&for=place:*'.format(year)
    r = requests.get(url)
    data = r.json()
    
    columns = {
        'B01003_001E': 'population',
        'B01003_001M': 'population_moe',
        'state': 'state_fips',
        'place': 'place_fips'
    }
    
    return (
        pd.DataFrame(data=data[1:], columns=data[0])
        .rename(columns=columns)
        .apply(lambda s: s.astype(int))
        .set_index(['state_fips', 'place_fips'])
    )
def state(df):
    """Converts state FIPS codes to state names."""
    data = (
        df
        .index
        .get_level_values('state_fips')
        .astype(str)
        .str.rjust(2, fillchar='0')
    )
    s = pd.Series(data=data, index=df.index)
    
    return s.apply(lambda x: us.states.lookup(x).name)
def read_data(year):
    """Reads and formats desired ACS data for a given year."""
    acs = read_acs(year)
    population = read_population(year)
    
    return (
        acs
        .join(population)
        .assign(state=state)
        .set_index(['state', 'place'])
        .sort_index(axis=1)
    )
df = read_data(2012)

Let's check out the dataset

Whenever I work with a dataset for the first time I like to do a few things:

  • Check out the shape/composition.
  • Look at the top few rows.
  • Look at the bottom two rows.
  • Get a sense of how things are distributed.
# Composition
df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 29256 entries, (Alabama, Abanda CDP, Alabama) to (Wyoming, Y-O Ranch CDP, Wyoming)
Data columns (total 4 columns):
pct_bachelor_plus        29029 non-null float64
pct_bachelor_plus_moe    29029 non-null float64
population               29256 non-null int32
population_moe           29256 non-null int32
dtypes: float64(2), int32(2)
memory usage: 914.2+ KB
# Top
df.head()
pct_bachelor_plus pct_bachelor_plus_moe population population_moe
state place
Alabama Abanda CDP, Alabama 0.0 45.8 42 31
Abbeville city, Alabama 16.9 5.7 2699 18
Adamsville city, Alabama 12.3 3.9 4537 26
Addison town, Alabama 17.8 10.5 826 284
Akron town, Alabama 7.9 8.7 369 141
# Bottom
df.tail()
pct_bachelor_plus pct_bachelor_plus_moe population population_moe
state place
Wyoming Woods Landing-Jelm CDP, Wyoming 40.6 17.7 91 102
Worland city, Wyoming 20.4 3.9 5426 18
Wright town, Wyoming 14.3 4.3 2216 382
Yoder town, Wyoming 15.9 15.0 149 104
Y-O Ranch CDP, Wyoming 0.0 16.1 223 109
# Summary
df.dropna().describe()
pct_bachelor_plus pct_bachelor_plus_moe population population_moe
count 29029.000000 29029.000000 2.902900e+04 29029.000000
mean 20.028689 9.109429 7.903848e+03 190.236625
std 16.372341 11.247205 6.613706e+04 221.437532
min 0.000000 0.200000 1.000000e+00 0.000000
25% 9.100000 4.000000 3.460000e+02 48.000000
50% 15.700000 5.800000 1.106000e+03 123.000000
75% 26.000000 9.200000 4.100000e+03 245.000000
max 100.000000 100.000000 8.199221e+06 3536.000000

Population vs. Attainment

We can see right away that we're dealing with some pretty heavily skewed data. Most places are teeny-tiny, while some are massive (I see you New York). Most also tend to have educational attainment rates (i.e., percent of the population over 25 with bachelor's degrees) around 10-20%, though some go all the way to 100%! More on that later.

sns.jointplot('population', 'pct_bachelor_plus', data=df);

png

Recreating the original metric

To make sure I'm starting off right, I wanted to try to recreate the original ranking. Fortunately the author was explicit with his methodology, so recreating was a breeze.

colleges = (
    'Stanford CDP, California',
    'University at Buffalo CDP, New York',
    'University of Pittsburgh Johnstown CDP, Pennsylvania',
    'Mississippi State CDP, Mississippi',
    'St. Vincent College CDP, Pennsylvania',
    'University CDP, Mississippi'
)
def highest_attainment_orig(df, filter_=True):
    """Gets the place with the higehst educational attainment in each state."""
    if filter_:
        # Only include places with at least 1,000 people, excluding colleges
        subset = df.loc[(df.population > 1000) & (df.index.map(lambda x: x[1] not in colleges))]
    else:
        subset = df
    # Find the place with the higehst attainment by state
    idx = (
        subset
        .groupby(level='state')
        .pct_bachelor_plus
        .transform('max')
        .eq(subset.pct_bachelor_plus)
    )
    
    return subset.loc[idx].reset_index('state', drop=True)
(
    df
    .pipe(highest_attainment_orig)
    .pct_bachelor_plus
    .sort_values(ascending=False)
)
place
Chevy Chase town, Maryland                 92.7
Ladera CDP, California                     89.3
Winnetka village, Illinois                 88.4
Mission Hills city, Kansas                 88.1
Short Hills CDP, New Jersey                87.1
Belle Meade city, Tennessee                86.4
Lost Creek CDP, Texas                      86.3
Scarsdale village, New York                84.9
Castleton CDP, Vermont                     84.2
Shorewood Hills village, Wisconsin         83.6
Yarrow Point town, Washington              83.1
Old Greenwich CDP, Connecticut             83.0
Hanover CDP, New Hampshire                 83.0
Mountain Brook city, Alabama               82.9
Ladue city, Missouri                       82.6
Druid Hills CDP, Georgia                   81.8
The Village of Indian Hill city, Ohio      81.2
McLean CDP, Virginia                       81.2
Edgeworth borough, Pennsylvania            81.2
Wellesley CDP, Massachusetts               80.8
Pine Brook Hill CDP, Colorado              80.5
Kiawah Island town, South Carolina         80.5
Meridian Hills town, Indiana               79.5
Indian Hills city, Kentucky                79.5
University Heights city, Iowa              78.8
Moose Wilson Road CDP, Wyoming             78.3
East Grand Rapids city, Michigan           77.5
Nichols Hills city, Oklahoma               76.9
Fearrington Village CDP, North Carolina    76.0
Key Biscayne village, Florida              76.0
Sandia Heights CDP, New Mexico             73.6
North Oaks city, Minnesota                 73.3
Emigration Canyon CDP, Utah                71.8
Paradise Valley town, Arizona              71.6
Rehoboth Beach city, Delaware              68.9
Kingston CDP, Rhode Island                 67.7
Cumberland Center CDP, Maine               67.6
Lake Oswego city, Oregon                   66.5
Oak Hills Place CDP, Louisiana             66.4
Hidden Spring CDP, Idaho                   66.2
Madison city, Mississippi                  63.1
Maunawili CDP, Hawaii                      62.3
Goldstream CDP, Alaska                     57.7
Dakota Dunes CDP, South Dakota             56.3
Big Sky CDP, Montana                       54.3
Shepherdstown town, West Virginia          51.4
Goshen town, Arkansas                      51.3
Washington city, District of Columbia      51.2
Incline Village CDP, Nevada                47.6
Wayne city, Nebraska                       44.4
Fargo city, North Dakota                   39.0
Name: pct_bachelor_plus, dtype: float64

Here's the original list, for comparison: top_places

The Problem: Uncertainty

One thing I was curious about was why the author decided to include only places with more than 1,000 people. What does this list look like if we include all places?

(
    df
    .pipe(highest_attainment_orig, filter_=False)
    .sort_values('pct_bachelor_plus', ascending=False)
)
pct_bachelor_plus pct_bachelor_plus_moe population population_moe
place
Lakeview North CDP, Wyoming 100.0 85.6 12 19
Gloster CDP, Louisiana 100.0 68.8 20 31
Allen CDP, Maryland 100.0 24.6 168 170
Zion CDP, Oklahoma 100.0 100.0 2 4
Ocean CDP, Maryland 100.0 100.0 7 10
Seconsett Island CDP, Massachusetts 100.0 65.6 22 26
Prairie City CDP, South Dakota 100.0 100.0 2 3
Bridger CDP, Montana 100.0 52.0 53 75
Turtle Lake CDP, Montana 100.0 51.0 70 111
Mendeltna CDP, Alaska 100.0 71.2 19 29
Allamuchy CDP, New Jersey 100.0 37.3 105 137
Blawenburg CDP, New Jersey 100.0 9.3 515 209
Port Colden CDP, New Jersey 100.0 45.9 45 71
Zarephath CDP, New Jersey 100.0 62.8 72 106
Cobre CDP, New Mexico 100.0 64.7 21 33
Lumberton CDP, New Mexico 100.0 74.1 20 41
Manzano CDP, New Mexico 100.0 100.0 7 11
Pinos Altos CDP, New Mexico 100.0 63.2 22 35
Regina CDP, New Mexico 100.0 100.0 5 8
Weed CDP, New Mexico 100.0 48.8 37 61
Dering Harbor village, New York 100.0 100.0 6 9
Paul Smiths CDP, New York 100.0 82.2 600 138
Thousand Island Park CDP, New York 100.0 32.9 77 73
Bellfountain CDP, Oregon 100.0 100.0 15 21
Brush Creek CDP, Oklahoma 100.0 37.5 69 103
Campo Verde CDP, Texas 100.0 92.0 12 18
Little Orleans CDP, Maryland 100.0 82.2 14 20
Lotsee town, Oklahoma 100.0 100.0 3 5
Carmet CDP, California 100.0 85.1 14 19
Oakville CDP, California 100.0 36.4 76 103
... ... ... ... ...
Remington CDP, Ohio 89.7 15.2 199 119
Mission Woods city, Kansas 88.5 7.0 175 36
Winnetka village, Illinois 88.4 2.1 12210 54
Grandfather village, North Carolina 88.2 9.1 85 38
Pippa Passes city, Kentucky 88.2 11.5 605 215
Barton Hills village, Michigan 88.1 5.6 286 62
University of Pittsburgh Johnstown CDP, Pennsylvania 87.5 70.3 1453 254
Belle Meade city, Tennessee 86.4 3.3 3016 105
Castleton CDP, Vermont 84.2 12.6 1167 271
Henderson Point CDP, Mississippi 84.0 24.1 50 39
Shorewood Hills village, Wisconsin 83.6 5.7 1720 81
Hanover CDP, New Hampshire 83.0 4.8 8608 248
Old Greenwich CDP, Connecticut 83.0 3.8 6622 431
Mountain Brook city, Alabama 82.9 2.3 20398 30
Ladue city, Missouri 82.6 2.9 8524 49
Druid Hills CDP, Georgia 81.8 3.0 14321 743
East Amana CDP, Iowa 80.9 43.7 60 52
Kiawah Island town, South Carolina 80.5 5.8 1448 239
Golf village, Florida 80.3 7.4 291 73
Woodland city, Minnesota 79.8 7.2 453 94
Glenbrook CDP, Nevada 78.7 13.6 178 105
Henlopen Acres town, Delaware 78.3 10.9 116 32
Keystone CDP, Nebraska 76.1 29.8 109 124
Hansboro city, North Dakota 75.0 35.4 12 9
Bluff CDP, Utah 74.6 20.2 346 134
Watch Hill CDP, Rhode Island 71.1 13.9 207 91
Cousins Island CDP, Maine 68.4 14.7 615 156
Hidden Spring CDP, Idaho 66.2 10.4 2077 312
Cammack Village city, Arkansas 62.2 7.4 858 186
Washington city, District of Columbia 51.2 0.5 605759 0

76 rows × 4 columns

That's a lot of 100%s. One thing to notice is that the confidence interval for these places are massive. Zion CDP, Oklahoma, for example, has a 90% CI of [0, 100]. The interval includes every value this variable can take! While filtering these small, highly-uncertain places is a good heuristic, there are more robust ways to handle this isuse.

One way is to use the 90% least plausible value, defined as the value such that there is only a 10% chance the true parameter is lower. In other words, the smartest places, according to this procedure, are the places that are most likely to have a high percentage of people with educational attainment. This works well because even if we completely overestimate the educational attainment rates we can be sure the top places are still on top.

The great thing about the American Community Survey (ACS) is that they provide margins of error (MoE) for all of their estimate variables. Best of all, their MoE are for 90% confidence intervals. All we need to do is subtract the MoE from the estimate!

A more conservative approach

def lower_bound(df, column):
    """Gets the 90% confidence interval lower bound for a variable."""
    lower = df[column] - df['{}_moe'.format(column)]
    # Minimum value is 0
    lower.loc[lower < 0] = 0
    
    return lower
def add_lower_bounds(df):
    """Adds the lower bounds for population and educational attainment."""
    return (
        df
        .assign(
            population_lower=lambda df: df.pipe(lower_bound, 'population'),
            pct_bachelor_plus_lower=lambda df: df.pipe(lower_bound, 'pct_bachelor_plus')
        )
    )
df = df.pipe(add_lower_bounds)
sns.jointplot('population', 'pct_bachelor_plus_lower', data=df);

png

You'll notice that the educational attainment distribution is even more skewed than before! This is because so many places are small and therefore have a lot of uncertainty associated with their estimates. These places will be unlikely to show up on our list (as they shouldn't).

def highest_attainment(df):
    """Find the place with the higehst attainment by state, using the lower bound."""
    subset = df.loc[df.index.map(lambda x: x[1] not in colleges)]
    idx = (
        subset
        .groupby(level='state')
        .pct_bachelor_plus_lower
        .transform('max')
        .eq(subset.pct_bachelor_plus_lower)
    )
    
    return (
        subset
        .loc[idx]
        .reset_index('state', drop=True)
    )
def plot_highest(df):
    """Creates an errorbar plot for attainment data."""
    highest = df.sort_values('pct_bachelor_plus_lower')

    plt.figure(figsize=(16, 10))
    plt.errorbar(
        highest.pct_bachelor_plus,
        np.arange(len(highest)),
        xerr=highest.pct_bachelor_plus_moe,
        fmt='o'
    )

    plt.xlim(40, 100)
    plt.ylim(-1, len(highest))
    plt.yticks(np.arange(len(highest)), highest.index)
df.pipe(highest_attainment).pipe(plot_highest)

png

In the plot above we have the point estimate (the dot) and its confidence interval, ordered by the 90% least plausible value (i.e., the lower end of the CI). Compared to the original results, about half of the top places have changed.

How do things look now?

The original data is a few years old. Let's see how things look with 2010-2014 data.

df_14 = read_data(2014)
df_14.pipe(add_lower_bounds).pipe(highest_attainment).pipe(plot_highest)

png

Improvements

I don't think places are a good unit of analysis for this kind of work. Many are subsets of larger cities, which can be misleading.

An alternative would be to use metropolitian/micropolitan statistical areas and see how things change. The issue there is than smaller geographies are underrepresented (or not represented at all).

I'd like to try a hybrid approach: Use MSA where possible, otherwise use place. That allows us to acknowledge agglomeration effects, where applicable. I would need to find a mapping of places to MSAs. If that doesn't exist (quick Google search suggest it doesn't), I'd have to create one using Census TIGER shapefiles, which is easy to do with something like PostGIS or QGIS.

Acknowledgements

Huge thanks goes to Andy Kiersz, the author of the original article, for the inspiration.

Much of the discussion on uncertainty comes from Cameron Davidson-Pilon's book Bayesian Methods for Hackers. Can't recommend it enough.