Lesson 07-02: Introduction to dplyr

Lesson 07-02: Introduction to dplyrActivitiesObjectivesR BasicsFunctionsLoading dataRecap of dplyrLogical operatorsMissing values

These lessons draw heavily from the excellent book, R for Data Science

 

Activities

  1. Download this markdown template: Lesson_0702_inclass.Rmd. You can knit it to html for easier reading, or you can load the html version here: Lesson_0702_inclass.html.

  2. Work through the various prompts and questions in the markdown file. Stay in your channel w/ your group-mates so you can discuss challenges, share resources, and make things easier.

  3. If something isn't clear or you want more background, recall these three resources:

    1. The RStudio tutorials you accessed for the pre-class activities
    2. References linked to below 👇
    3. The Google
  4. Knit your edited markdown file to document and upload to BB. Remember to compress if you use an html file.

Objectives

Become comfortable with the following commands from dplyr and ggplot2

Essentially, we're practicing with the commands we learned for our pre-class activities.

 

R Basics

We can use R as a calculator. You can enter all sort of things in the console.

You can create new objects with <-. This is an assignment statement because it assigns something to an object.

Object names must start with a letter, and can only contain letters, numbers, _ and .. It is a good idea to use words that are descriptive. Personally, i like snake_case because when you double click on it, you select the entire object (rather than this-way)

You can type the name of an object to view it. You can update the object at any time!

 

Functions

R has lots of functions of this form function_name(arg1 = val1, arg2 = val2, ...)

Take a look at a basic function, seq. If you type it into R, note that it will help by adding parentheses, etc. When you make an assignment, you won't see the value. You'll have to ask R to print it.

Or, you can speed things up by surrounding the entire thing with parentheses:

Loading data

We're going to work with gapminder data! This is just an excerpt of the data for simplicity, but we can play more later on.1

You can bulk download some or all gapfinder data through various repositories.

This block will install the gapminder package and load it. If this is your first time, you will need to "uncomment" the install.packages line in order to install it. You will want to comment it back out or delete it once it's installed, or it can create problems when knitting

Now that we've loaded the library, we have access to a data frame

Why do we only see a bit of the data frame? It is very big. We are looking at a "tibble," part of the dplyr universe, which shows the first 10 rows of a variable. It's a bit easier to work with. If you want to see all observations and variables, you can type View().

One particular handy thing you can do is select one variable using $.

There are several types of data listed here (there are a few more types as well!):

Recap of dplyr

Every dplyr verb follows the same pattern. The first argument is always a data frame, and the function always returns a data frame:

In addition to interactive RStudio tutorials, my favorite source is R for Data Science - super clear explanations and walkthroughs!

CommandWhat it doesExcel equivalentHelp!
filter()Restrict observations within your dataframe based on 1+ criteriaIF() and IFS()Filter rows with filter()
arrange()Sort the order of observationsData/Sort...Arrange rows with arrange()
select()Show only a certain set of variables (columns)Hiding various columnsSelect columns with select()
mutate()Make new variables (columns)Adding a new column that contains a formulaAdd new variables with mutate()
summarize()Generate summary statistics - essentially "collapses" your data to those statisticsIt's AVERAGE(), COUNT(), MEDIAN() and more!Grouped summaries with summarize()
group_by()Groups your data by a set of categories (let's you calculate means by state, for example)Embedded in Pivot tablesGrouped summaries with summarize()
    

 

Logical operators

R using standard operators for comparisons: >, >=, <, <=, != (not equal), and == (equal).

TestMeaning
x < yLess than
x > yGreater than
x == yEqual to
x <= yLess than or equal to
x >= yGreater than or equal to
x != yNot equal to
x %in% yIn (group membership)
is.na(x)Is missing
!is.na(x)Is not missing

You can also use “or” with “|” and “not” with “!”:

OperatorMeaning
a & band
a | bor
!anot

 

Don't do this:

Do this:

Missing values

Often we will enounter missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

If you want to determine if a value is missing, use is.na():

 


 

 


1 The full documentation is available here