We’re going to dig deeper into the correlates of mobility by using Opportunity Atlas to create some plots. We’re going to make a scatter plot!
Visit Opportunity Atlas and search for the place you chose for your pre-class activity. If you no longer like this place, you can pick a new one.
Zoom in or out, as necessary, so that you get a “reasonable” geographic area that contains 30-50 census tracts (not a hard and fast rule, but each tract will be one data point).
Click on “Download the data” at the bottom right-hand corner.
Get all your options set up and download the data set.
Pick a second variable you want to correlate with your first variable, and download that in the same way. I recommend “Job growth rate from 2004 to 2013” or “Density of Jobs in 2013.” You want the first to be an “outcome” and the second to be a “neighborhood characteristic.”
In any case, I’d recommend you pick the same variables as your groupmates, so when you compare graphs it makes sense!
Make sure your data files are in the right working directory. You may need to adjust the names.
# Import outcome 1 and save as a df
income <- read_csv("shown_tract_kfr_rP_gP_pall.csv")
# Import outcome 2 and save as a df
jobdensity <- read_csv("shown_tract_job_density_2013.csv")
Take a look at your data (change names if you renamed your data frames!)
income
## # A tibble: 192 x 3
## tract Name Household_Income_rP_gP_pa…
## <dbl> <chr> <dbl>
## 1 26125151000 Bloomfield Village, Bloomfield Hills,… 83217
## 2 26125152700 Birmingham, MI 80100
## 3 26125152000 Bloomfield Hills, MI 79472
## 4 26125196600 Troy, MI 77675
## 5 26125196800 Troy, MI 76962
## 6 26125152600 Birmingham, MI 76088
## 7 26125137400 Novi, MI 75867
## 8 26125188000 Huntington Woods, MI 75791
## 9 26125167900 Farmington Hills, MI 75691
## 10 26125196700 Troy, MI 75441
## # … with 182 more rows
jobdensity
## # A tibble: 192 x 3
## tract Name Density_of_Jobs_in_2013
## <dbl> <chr> <dbl>
## 1 26125183500 Royal Oak, MI 25093
## 2 26125161600 Southfield City Centre, Southfield, MI 19750
## 3 26125162200 Southfield, MI 14799
## 4 26125197600 Troy, MI 13947
## 5 26125162100 Southfield, MI 11899
## 6 26125160900 Southfield, MI 10492
## 7 26125160300 Southfield, MI 8745
## 8 26125153200 Birmingham, MI 8652
## 9 26163556500 Livonia, MI 7584
## 10 26125197701 Troy, MI 7236
## # … with 182 more rows
Now, we’re going to use join_by()
to merge our two data sets. Note that the by
part of the function is optional. If you delete it, it will join based on all variables with the same names across the two data sets - in this case, just tract
and Name
.
merged_data <- full_join(by = c("tract", "Name"), income, jobdensity)
merged_data
## # A tibble: 192 x 4
## tract Name Household_Income_rP_… Density_of_Jobs_in…
## <dbl> <chr> <dbl> <dbl>
## 1 2.61e10 Bloomfield Village, Blo… 83217 476.
## 2 2.61e10 Birmingham, MI 80100 386.
## 3 2.61e10 Bloomfield Hills, MI 79472 1160
## 4 2.61e10 Troy, MI 77675 457.
## 5 2.61e10 Troy, MI 76962 3838
## 6 2.61e10 Birmingham, MI 76088 1316
## 7 2.61e10 Novi, MI 75867 59.4
## 8 2.61e10 Huntington Woods, MI 75791 671.
## 9 2.61e10 Farmington Hills, MI 75691 82
## 10 2.61e10 Troy, MI 75441 1296
## # … with 182 more rows
A simple way to look at the relationship between two variables is to calculate a correlation coefficent. Here, I do it using Base R. There are other ways to do this as well. But, this gives us one number, and not a ton of information.
(Note that you’ll need to change the variable names to match yours!)
cor.test(merged_data$Household_Income_rP_gP_pall, merged_data$Density_of_Jobs_in_2013,method="pearson")
##
## Pearson's product-moment correlation
##
## data: merged_data$Household_Income_rP_gP_pall and merged_data$Density_of_Jobs_in_2013
## t = -2.4201, df = 190, p-value = 0.01646
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.30701848 -0.03210642
## sample estimates:
## cor
## -0.1729285
Let’s instead make a scatter plot:
ggplot(data = merged_data, mapping = aes(y = Household_Income_rP_gP_pall, x = Density_of_Jobs_in_2013, color=Name )) +
geom_point(show.legend=FALSE)
You may want to play around with the scale if you get a lot of bunching. For density of jobs, for example, it looks a lot better if I take a log.
ggplot(data = merged_data, mapping = aes(y = Household_Income_rP_gP_pall, x = Density_of_Jobs_in_2013, color=Name )) +
geom_point(show.legend=FALSE) +
scale_x_log10()
What if you want to estimate a line of best fit? This is regression land, but we can give it a go. Notice here that I moved color=Name
out of the main ggplot and into geom_point
only. The idea is that I want to color my points based on city, but I don’t want separate regression lines per city. You can play w/ moving arguments around to see how it does or does not make a difference.
ggplot(data = merged_data, mapping = aes(y = Household_Income_rP_gP_pall, x = Density_of_Jobs_in_2013, )) +
geom_point(aes(color=Name), show.legend=FALSE) +
geom_smooth(method=lm, se =FALSE, show.legend=FALSE) +
scale_x_log10()
## `geom_smooth()` using formula 'y ~ x'
Share w/ your group mates. You can paste in your chat or in your shared Google doc.
Knit and upload. If you’ve got some time, format that graph.