Getting Started

We’re going to dig deeper into the correlates of mobility by using Opportunity Atlas to create some plots. We’re going to make a scatter plot!

Visit Opportunity Atlas and search for the place you chose for your pre-class activity. If you no longer like this place, you can pick a new one.
Zoom in or out, as necessary, so that you get a “reasonable” geographic area that contains 30-50 census tracts (not a hard and fast rule, but each tract will be one data point).
Click on “Download the data” at the bottom right-hand corner.

You want to download “Just what is on the screen.” If this option is not available, make sure you select “Single Subgroup.”
You can pick “all” or select a specific subgroup, but I wouldn’t recommend any group that leads to a lot of grey tracts.
I recommend you stick with “household income” as your outcome variable, but feel free to select a different outcome variable.

Get all your options set up and download the data set.
Pick a second variable you want to correlate with your first variable, and download that in the same way. I recommend “Job growth rate from 2004 to 2013” or “Density of Jobs in 2013.” You want the first to be an “outcome” and the second to be a “neighborhood characteristic.”

In any case, I’d recommend you pick the same variables as your groupmates, so when you compare graphs it makes sense!

Loading data

Make sure your data files are in the right working directory. You may need to adjust the names.

# Import outcome 1 and save as a df

income <- read_csv("shown_tract_kfr_rP_gP_pall.csv")

# Import outcome 2 and save as a df

jobdensity <- read_csv("shown_tract_job_density_2013.csv")

Take a look at your data (change names if you renamed your data frames!)

income

## # A tibble: 192 x 3
##          tract Name                                   Household_Income_rP_gP_pa…
##          <dbl> <chr>                                                       <dbl>
##  1 26125151000 Bloomfield Village, Bloomfield Hills,…                      83217
##  2 26125152700 Birmingham, MI                                              80100
##  3 26125152000 Bloomfield Hills, MI                                        79472
##  4 26125196600 Troy, MI                                                    77675
##  5 26125196800 Troy, MI                                                    76962
##  6 26125152600 Birmingham, MI                                              76088
##  7 26125137400 Novi, MI                                                    75867
##  8 26125188000 Huntington Woods, MI                                        75791
##  9 26125167900 Farmington Hills, MI                                        75691
## 10 26125196700 Troy, MI                                                    75441
## # … with 182 more rows

jobdensity

## # A tibble: 192 x 3
##          tract Name                                   Density_of_Jobs_in_2013
##          <dbl> <chr>                                                    <dbl>
##  1 26125183500 Royal Oak, MI                                            25093
##  2 26125161600 Southfield City Centre, Southfield, MI                   19750
##  3 26125162200 Southfield, MI                                           14799
##  4 26125197600 Troy, MI                                                 13947
##  5 26125162100 Southfield, MI                                           11899
##  6 26125160900 Southfield, MI                                           10492
##  7 26125160300 Southfield, MI                                            8745
##  8 26125153200 Birmingham, MI                                            8652
##  9 26163556500 Livonia, MI                                               7584
## 10 26125197701 Troy, MI                                                  7236
## # … with 182 more rows

Now, we’re going to use join_by() to merge our two data sets. Note that the by part of the function is optional. If you delete it, it will join based on all variables with the same names across the two data sets - in this case, just tract and Name.

merged_data <- full_join(by = c("tract", "Name"), income, jobdensity)
merged_data

## # A tibble: 192 x 4
##         tract Name                     Household_Income_rP_… Density_of_Jobs_in…
##         <dbl> <chr>                                    <dbl>               <dbl>
##  1    2.61e10 Bloomfield Village, Blo…                 83217               476. 
##  2    2.61e10 Birmingham, MI                           80100               386. 
##  3    2.61e10 Bloomfield Hills, MI                     79472              1160  
##  4    2.61e10 Troy, MI                                 77675               457. 
##  5    2.61e10 Troy, MI                                 76962              3838  
##  6    2.61e10 Birmingham, MI                           76088              1316  
##  7    2.61e10 Novi, MI                                 75867                59.4
##  8    2.61e10 Huntington Woods, MI                     75791               671. 
##  9    2.61e10 Farmington Hills, MI                     75691                82  
## 10    2.61e10 Troy, MI                                 75441              1296  
## # … with 182 more rows

Exploratory analysis

A simple way to look at the relationship between two variables is to calculate a correlation coefficent. Here, I do it using Base R. There are other ways to do this as well. But, this gives us one number, and not a ton of information.

(Note that you’ll need to change the variable names to match yours!)

cor.test(merged_data$Household_Income_rP_gP_pall, merged_data$Density_of_Jobs_in_2013,method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$Household_Income_rP_gP_pall and merged_data$Density_of_Jobs_in_2013
## t = -2.4201, df = 190, p-value = 0.01646
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.30701848 -0.03210642
## sample estimates:
##        cor 
## -0.1729285

Let’s instead make a scatter plot:

  ggplot(data = merged_data, mapping = aes(y = Household_Income_rP_gP_pall, x = Density_of_Jobs_in_2013, color=Name )) + 
      geom_point(show.legend=FALSE)

You may want to play around with the scale if you get a lot of bunching. For density of jobs, for example, it looks a lot better if I take a log.

  ggplot(data = merged_data, mapping = aes(y = Household_Income_rP_gP_pall, x = Density_of_Jobs_in_2013, color=Name )) + 
      geom_point(show.legend=FALSE) + 
  scale_x_log10()

What if you want to estimate a line of best fit? This is regression land, but we can give it a go. Notice here that I moved color=Name out of the main ggplot and into geom_point only. The idea is that I want to color my points based on city, but I don’t want separate regression lines per city. You can play w/ moving arguments around to see how it does or does not make a difference.

  ggplot(data = merged_data, mapping = aes(y = Household_Income_rP_gP_pall, x = Density_of_Jobs_in_2013,  )) + 
      geom_point(aes(color=Name), show.legend=FALSE) + 
    geom_smooth(method=lm, se =FALSE, show.legend=FALSE) + 
    scale_x_log10()

## `geom_smooth()` using formula 'y ~ x'

All done!

Share w/ your group mates. You can paste in your chat or in your shared Google doc.
Knit and upload. If you’ve got some time, format that graph.

10-02 in-class activity

Emily Beam

04-08-2021

Getting Started

Loading data

Exploratory analysis

All done!