dplyr
Lesson 07-02: Introduction to dplyr
ActivitiesObjectivesR BasicsFunctionsLoading dataRecap of dplyr
Logical operatorsMissing values
These lessons draw heavily from the excellent book, R for Data Science
Download this markdown template: Lesson_0702_inclass.Rmd. You can knit it to html for easier reading, or you can load the html version here: Lesson_0702_inclass.html.
Work through the various prompts and questions in the markdown file. Stay in your channel w/ your group-mates so you can discuss challenges, share resources, and make things easier.
If something isn't clear or you want more background, recall these three resources:
Knit your edited markdown file to document and upload to BB. Remember to compress if you use an html file.
Become comfortable with the following commands from dplyr
and ggplot2
filter()
select()
arrange()
Essentially, we're practicing with the commands we learned for our pre-class activities.
We can use R as a calculator. You can enter all sort of things in the console.
x1 / 200 * 2
(59 + 2 + 2) / 3
sin(pi / 2)
You can create new objects with <-
. This is an assignment statement because it assigns something to an object.
xxxxxxxxxx
# object_name <- value
Object names must start with a letter, and can only contain letters, numbers, _
and .
. It is a good idea to use words that are descriptive. Personally, i like snake_case because when you double click on it, you select the entire object (rather than this-way)
xxxxxxxxxx
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
}
You can type the name of an object to view it. You can update the object at any time!
xxxxxxxxxx
this_is_a_really_long_name <- 2.5
this_is_a_really_long_name
R has lots of functions of this form function_name(arg1 = val1, arg2 = val2, ...)
Take a look at a basic function, seq
. If you type it into R, note that it will help by adding parentheses, etc. When you make an assignment, you won't see the value. You'll have to ask R to print it.
xxxxxxxxxx
seq(1,10)
x <- seq(1,10)
x
Or, you can speed things up by surrounding the entire thing with parentheses:
xxxxxxxxxx
(x <- seq(1,10))
We're going to work with gapminder data! This is just an excerpt of the data for simplicity, but we can play more later on.1
You can bulk download some or all gapfinder data through various repositories.
This block will install the gapminder package and load it. If this is your first time, you will need to "uncomment" the install.packages
line in order to install it. You will want to comment it back out or delete it once it's installed, or it can create problems when knitting
xxxxxxxxxx
# install.packages("gapminder")
library(gapminder)
Now that we've loaded the library, we have access to a data frame
xxxxxxxxxx
gapminder
Why do we only see a bit of the data frame? It is very big. We are looking at a "tibble," part of the dplyr
universe, which shows the first 10 rows of a variable. It's a bit easier to work with. If you want to see all observations and variables, you can type View()
.
One particular handy thing you can do is select one variable using $
.
xxxxxxxxxx
gapminder$dep_delay
There are several types of data listed here (there are a few more types as well!):
int
stands for integers.dbl
stands for doubles, or real numbers.chr
stands for character vectors, or strings.dttm
stands for date-times (a date + a time).dplyr
Every dplyr verb follows the same pattern. The first argument is always a data frame, and the function always returns a data frame:
xxxxxxxxxx
#VERB(DATA_TO_TRANSFORM, STUFF_IT_DOES)
In addition to interactive RStudio tutorials, my favorite source is R for Data Science - super clear explanations and walkthroughs!
Command | What it does | Excel equivalent | Help! |
---|---|---|---|
filter() | Restrict observations within your dataframe based on 1+ criteria | IF() and IFS() | Filter rows with filter() |
arrange() | Sort the order of observations | Data/Sort... | Arrange rows with arrange() |
select() | Show only a certain set of variables (columns) | Hiding various columns | Select columns with select() |
mutate() | Make new variables (columns) | Adding a new column that contains a formula | Add new variables with mutate() |
summarize() | Generate summary statistics - essentially "collapses" your data to those statistics | It's AVERAGE() , COUNT() , MEDIAN() and more! | Grouped summaries with summarize() |
group_by() | Groups your data by a set of categories (let's you calculate means by state, for example) | Embedded in Pivot tables | Grouped summaries with summarize() |
R using standard operators for comparisons: >
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
Test | Meaning |
---|---|
x < y | Less than |
x > y | Greater than |
x == y | Equal to |
x <= y | Less than or equal to |
x >= y | Greater than or equal to |
x != y | Not equal to |
x %in% y | In (group membership) |
is.na(x) | Is missing |
!is.na(x) | Is not missing |
You can also use “or” with “|
” and “not” with “!
”:
Operator | Meaning |
---|---|
a & b | and |
a | b | or |
!a | not |
Don't do this:
xxxxxxxxxx
filter(flights, month = 1)
filter(flights,month == 11 | 12)
Do this:
xxxxxxxxxx
filter(flights, month == 11)
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
Often we will enounter missing values, or NA
s (“not availables”). NA
represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
xxxxxxxxxx
NA > 5
10 == NA
NA + 10
NA / 2
If you want to determine if a value is missing, use is.na()
:
xxxxxxxxxx
x <-5
is.na(x)