Density and histogram examples

I’ve put together a few (rough, unformatted) examples to give you an idea of how we can examine variable distributions. Enjoy!

Load atlas and related packages, create a data frame, michigan that is restricted to, well, Michigan.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(haven)


atlas <- read_dta("atlas.dta")


michigan <- filter(atlas,state == 26)

Let’s make a histogram with two variables. Notice that when everything is in the same frame, you can just layer on the plots! But, you would want to call the aesthetics within the each layer.

ggplot(michigan) + 
    geom_histogram(aes(kfr_black_p25,after_stat(count),na.rm=TRUE)) + 
    geom_histogram(aes(kfr_black_p75,after_stat(count),na.rm=TRUE)) + xlim(0,100000)

## Warning: Ignoring unknown aesthetics: na.rm

## Warning: Ignoring unknown aesthetics: na.rm

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1670 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1670 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

Yikes! That doesn’t looks good. Let’s try harder:

ggplot(michigan) + 
    geom_histogram(aes(kfr_black_p25,after_stat(count),fill = "red", alpha = 0.2),binwidth=5000,na.rm=TRUE) + 
    geom_histogram(aes(kfr_black_p75,after_stat(count),fill = "blue", alpha = 0.2,na.rm=TRUE),binwidth=5000,na.rm=TRUE) + xlim(0,100000)

## Warning: Ignoring unknown aesthetics: na.rm

Is there a better way?

Try geom_density()

ggplot(michigan) + 
    geom_density(aes(kfr_black_p25,after_stat(count)),na.rm=TRUE) + 
    geom_density(aes(kfr_black_p75,after_stat(count)),na.rm=TRUE) + xlim(0,100000)

Follow the help and you can do all sorts of things!

ggplot(michigan) + 
    geom_density(aes(kfr_black_p25,after_stat(count),fill = "blue",na.rm=TRUE),alpha = 0.1) + 
    geom_density(aes(kfr_black_p75,after_stat(count),fill = "red",na.rm=TRUE),alpha = 0.1) + xlim(0,100000)

## Warning: Ignoring unknown aesthetics: na.rm

## Warning: Ignoring unknown aesthetics: na.rm

## Warning: Removed 1670 rows containing non-finite values (stat_density).

## Warning: Removed 1670 rows containing non-finite values (stat_density).

Or, you can do something fancy like a violin plot! Here, you need some sort of “x” variable. If you have a small enough geographic area, you could do something county (good for VT, bad for MI)

# Generate categorical variable for your y axis

# Take a look to get a sense
michigan %>% summarize(mean(frac_coll_plus2010, na.rm = TRUE), quantile(frac_coll_plus2010, na.rm = TRUE,0.25), quantile(frac_coll_plus2010, na.rm = TRUE,0.75))

# Cut into 4
xs<- quantile(michigan$frac_coll_plus2010,c(0,1/3,2/3,1), na.rm = TRUE)

# Generate cqtegorical variable 

michigan <- michigan %>% mutate(category=cut(frac_coll_plus2010, breaks=xs, labels=c("1st tercile","2nd tercile","3rd tercile"),na.rm=TRUE))

# Here I trimmed the y for interpretation, but it's not always best practice
ggplot(michigan,aes(factor(category),kfr_black_p25)) + 
    geom_violin(na.rm=TRUE) + 
    geom_jitter(height = 0.3,width = 0.3,size =.1,na.rm=TRUE) + 
  scale_x_discrete(na.translate = FALSE)  + 
  ylim(0,100000)

R Notebook

Density and histogram examples