The Ultimate Guide for Data Manipulation in R
Manipulating and handling data in R used to be very challenging, but with dplyr and other packages in tidyverse things have become easier. Some estimate about 90% of the time is spent on data cleaning and manipulating. If there are ways to cut that time short, any data scientist will welcome those. tidyverse in R is useful for this purpose. This guide will explain how to complete some of the most common data manipulation tasks.
In this guide, you will learn:
Let's get started then.
How to read a CSV into a data frame
Before we manipulate any data we need to fetch some data. Let’s see a few examples of reading CSVs using the readr
library.
Read a local CSV file
I saved the Iris data set from UCI ML repository as a CSV file. Here’s how you can load it into a new data frame
library(readr)
iris_df <- read_csv("iris.csv")
We can peek into this data frame using the glimpse
function from dplyr
.
library(dplyr)
glimpse(iris_df)
## Rows: 150
## Columns: 5
## $ sepal_length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal_width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal_length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal_width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…
You’ll see that the read_csv
function correctly guessed the data types and used the first row as the column names. You can change these options by providing the various parameter values. The default values for these parameters are:
- file:
- col_names: TRUE
- col_types: NULL
- locale: default_locale()
- na: c("“,”NA")
- quoted_na: TRUE
- quote: "
- comment:
- trim_ws: TRUE
- skip: 0
- n_max: Inf
- guess_max: min(1000, n_max)
- progress: show_progress()
- skip_empty_rows: TRUE
For example:
col_names
: if your file doesn’t have a header row, you can set this option toFALSE
.col_types
: you can provide data types that this function should use. You can use a string representation for this data set “ddddc”. Tryread_csv("iris.csv", col_types = "ddddc")
.na
: if your data has missing values represented by a different character such as a question mark, you can specify that using this parameterskip
: say the first few lines of your data have comments or other things that you don’t want in your data frame, you can this parameter.
By default, this function will not convert a string variable to a factor column. That saves a lot of time and frustration!
Read a CSV file from the internet
Reading a CSV file from the internet has the same process as reading a local file. (Just make sure that you have the latest libraries. Below this article, you will see my R
version as well as the version of libraries.)
Here we will read a file from the Cato Institute. This file has the human freedom index values for the year 2019.
hfi_index <- read_csv("https://object.cato.org/sites/cato.org/files/human-freedom-index-files/human-freedom-index-2019.csv",
na = "-",
col_types = cols(
.default = col_guess(),
year = col_double(),
hf_score = col_double(),
hf_rank = col_integer()
)
)
Note the values provided to the col_types
parameter. I want the read_csv
function to guess all the columns except for the year, hf_score, and hf_rank columns. Also, the missing values are represented by a hyphen, hence I provided it to the na
argument.
How to select column(s)
Say you want to select only a few columns from a data frame because the original data frame has too many columns or for other reasons. Base R
has the subset
function you could use, but dplyr
has a handy function called (guess??) select
.
Let’s see a few examples.
How to select columns by names
From the Iris data frame, we want to select sepal_length
and petal_width
.
This is one way:
select(iris_df, sepal_length, petal_width)
## # A tibble: 150 x 2
## sepal_length petal_width
##
## 1 5.1 0.2
## 2 4.9 0.2
## 3 4.7 0.2
## 4 4.6 0.2
## 5 5 0.2
## 6 5.4 0.4
## 7 4.6 0.3
## 8 5 0.2
## 9 4.4 0.2
## 10 4.9 0.1
## # … with 140 more rows
How to select columns by column indexes
Say we want to select the first five columns from the human freedom index data frame.
This is one way:
select(hfi_index, 1:5)
## # A tibble: 1,620 x 5
## year ISO_code countries region hf_score
##
## 1 2017 ALB Albania Eastern Europe 7.84
## 2 2017 DZA Algeria Middle East & North Africa 4.99
## 3 2017 AGO Angola Sub-Saharan Africa 5.4
## 4 2017 ARG Argentina Latin America & the Caribbean 6.86
## 5 2017 ARM Armenia Caucasus & Central Asia 7.42
## 6 2017 AUS Australia Oceania 8.62
## 7 2017 AUT Austria Western Europe 8.48
## 8 2017 AZE Azerbaijan Caucasus & Central Asia 6.22
## 9 2017 BHS Bahamas Latin America & the Caribbean 7.56
## 10 2017 BHR Bahrain Middle East & North Africa 6.63
## # … with 1,610 more rows
How to change column names
Say we want to change the underscores in the column names to periods or dots, we can use the select
function again this way:
iris_df_names_changed <- select(iris_df,
sepal.length = sepal_length,
sepal.width = sepal_width,
petal.length = petal_length,
petal.width = petal_width,
class)
glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 5
## $ sepal.length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal.width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…
We can also use the names
function from base R
to search and replace a character.
iris_df_names_changed <- iris_df
names(iris_df_names_changed) <- gsub(names(iris_df_names_changed), pattern = "_", replacement = ".")
glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 5
## $ sepal.length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal.width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…
You can see a downside of using the names
function: it overwrites the names of your existing data frame, whereas the select
function can be used to create a new data frame.
BTW, select
has a cousin named rename
. The biggest difference between these two: select
will select only the listed columns, whereas rename
will change the name of the provided columns but still return all the other columns.
For example, compare the two following commands and their outputs:
select(iris_df,
sepal.length = sepal_length,
sepal.width = sepal_width)
## # A tibble: 150 x 2
## sepal.length sepal.width
##
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
rename(iris_df,
sepal.length = sepal_length,
sepal.width = sepal_width)
## # A tibble: 150 x 5
## sepal.length sepal.width petal_length petal_width class
##
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
## 7 4.6 3.4 1.4 0.3 Iris-setosa
## 8 5 3.4 1.5 0.2 Iris-setosa
## 9 4.4 2.9 1.4 0.2 Iris-setosa
## 10 4.9 3.1 1.5 0.1 Iris-setosa
## # … with 140 more rows
How to exclude certain columns
We saw one use case which shows how certain columns can be excluded (sometimes by accident). Another way is to use a negative sign.
For example, say we want to remove sepal_length and sepal_width from the data frame. Here’s one way:
select(iris_df, -sepal_length, -sepal_width)
## # A tibble: 150 x 3
## petal_length petal_width class
##
## 1 1.4 0.2 Iris-setosa
## 2 1.4 0.2 Iris-setosa
## 3 1.3 0.2 Iris-setosa
## 4 1.5 0.2 Iris-setosa
## 5 1.4 0.2 Iris-setosa
## 6 1.7 0.4 Iris-setosa
## 7 1.4 0.3 Iris-setosa
## 8 1.5 0.2 Iris-setosa
## 9 1.4 0.2 Iris-setosa
## 10 1.5 0.1 Iris-setosa
## # … with 140 more rows
Exclude columns by name searches
Exclude if the column name begins with some characters
Include all columns beginning with sepal:
select(iris_df, starts_with('sepal'))
## # A tibble: 150 x 2
## sepal_length sepal_width
##
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
Exclude all columns beginning with sepal:
select(iris_df, -starts_with('sepal'))
## # A tibble: 150 x 3
## petal_length petal_width class
##
## 1 1.4 0.2 Iris-setosa
## 2 1.4 0.2 Iris-setosa
## 3 1.3 0.2 Iris-setosa
## 4 1.5 0.2 Iris-setosa
## 5 1.4 0.2 Iris-setosa
## 6 1.7 0.4 Iris-setosa
## 7 1.4 0.3 Iris-setosa
## 8 1.5 0.2 Iris-setosa
## 9 1.4 0.2 Iris-setosa
## 10 1.5 0.1 Iris-setosa
## # … with 140 more rows
Exclude if the column name ends with some characters
Include all columns ending with width:
select(iris_df, ends_with('width'))
## # A tibble: 150 x 2
## sepal_width petal_width
##
## 1 3.5 0.2
## 2 3 0.2
## 3 3.2 0.2
## 4 3.1 0.2
## 5 3.6 0.2
## 6 3.9 0.4
## 7 3.4 0.3
## 8 3.4 0.2
## 9 2.9 0.2
## 10 3.1 0.1
## # … with 140 more rows
Exclude all columns ending with width:
select(iris_df, -ends_with('width'))
## # A tibble: 150 x 3
## sepal_length petal_length class
##
## 1 5.1 1.4 Iris-setosa
## 2 4.9 1.4 Iris-setosa
## 3 4.7 1.3 Iris-setosa
## 4 4.6 1.5 Iris-setosa
## 5 5 1.4 Iris-setosa
## 6 5.4 1.7 Iris-setosa
## 7 4.6 1.4 Iris-setosa
## 8 5 1.5 Iris-setosa
## 9 4.4 1.4 Iris-setosa
## 10 4.9 1.5 Iris-setosa
## # … with 140 more rows
Exclude if the column name contains some characters
Include all columns names containing an underscore:
select(iris_df, contains('_'))
## # A tibble: 150 x 4
## sepal_length sepal_width petal_length petal_width
##
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## # … with 140 more rows
Exclude all columns names containing an underscore:
select(iris_df, -contains('_'))
## # A tibble: 150 x 1
## class
##
## 1 Iris-setosa
## 2 Iris-setosa
## 3 Iris-setosa
## 4 Iris-setosa
## 5 Iris-setosa
## 6 Iris-setosa
## 7 Iris-setosa
## 8 Iris-setosa
## 9 Iris-setosa
## 10 Iris-setosa
## # … with 140 more rows
Select numeric columns
The select
family of functions have other useful functions such as select_if
or select_at
.
Include all numeric columns::
select_if(iris_df, is.numeric)
## # A tibble: 150 x 4
## sepal_length sepal_width petal_length petal_width
##
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## # … with 140 more rows
Exclude all numeric columns or include character variables:
select_if(iris_df, is.character)
## # A tibble: 150 x 1
## class
##
## 1 Iris-setosa
## 2 Iris-setosa
## 3 Iris-setosa
## 4 Iris-setosa
## 5 Iris-setosa
## 6 Iris-setosa
## 7 Iris-setosa
## 8 Iris-setosa
## 9 Iris-setosa
## 10 Iris-setosa
## # … with 140 more rows
How to change the order of the columns in a data frame
Often we need to reorder certain columns either for modeling or for printing tables. There are a few ways of doing this.
Move a column to the end
We can do so by using the select
function and a special function in dplyr
called everything
.
For example, we will move the sepal_length column to the end:
select(iris_df, -sepal_length, everything(), sepal_length)
## # A tibble: 150 x 5
## sepal_width petal_length petal_width class sepal_length
##
## 1 3.5 1.4 0.2 Iris-setosa 5.1
## 2 3 1.4 0.2 Iris-setosa 4.9
## 3 3.2 1.3 0.2 Iris-setosa 4.7
## 4 3.1 1.5 0.2 Iris-setosa 4.6
## 5 3.6 1.4 0.2 Iris-setosa 5
## 6 3.9 1.7 0.4 Iris-setosa 5.4
## 7 3.4 1.4 0.3 Iris-setosa 4.6
## 8 3.4 1.5 0.2 Iris-setosa 5
## 9 2.9 1.4 0.2 Iris-setosa 4.4
## 10 3.1 1.5 0.1 Iris-setosa 4.9
## # … with 140 more rows
We are excluding sepal_length at first, then adding other columns, and then added sepal_length at the end.
We can manually rearrange the columns too:
select(iris_df, sepal_width, petal_length, petal_width, class, sepal_length)
## # A tibble: 150 x 5
## sepal_width petal_length petal_width class sepal_length
##
## 1 3.5 1.4 0.2 Iris-setosa 5.1
## 2 3 1.4 0.2 Iris-setosa 4.9
## 3 3.2 1.3 0.2 Iris-setosa 4.7
## 4 3.1 1.5 0.2 Iris-setosa 4.6
## 5 3.6 1.4 0.2 Iris-setosa 5
## 6 3.9 1.7 0.4 Iris-setosa 5.4
## 7 3.4 1.4 0.3 Iris-setosa 4.6
## 8 3.4 1.5 0.2 Iris-setosa 5
## 9 2.9 1.4 0.2 Iris-setosa 4.4
## 10 3.1 1.5 0.1 Iris-setosa 4.9
## # … with 140 more rows
or using the last_col
function which selects the last column:
select(iris_df, 2:last_col(), sepal_length)
## # A tibble: 150 x 5
## sepal_width petal_length petal_width class sepal_length
##
## 1 3.5 1.4 0.2 Iris-setosa 5.1
## 2 3 1.4 0.2 Iris-setosa 4.9
## 3 3.2 1.3 0.2 Iris-setosa 4.7
## 4 3.1 1.5 0.2 Iris-setosa 4.6
## 5 3.6 1.4 0.2 Iris-setosa 5
## 6 3.9 1.7 0.4 Iris-setosa 5.4
## 7 3.4 1.4 0.3 Iris-setosa 4.6
## 8 3.4 1.5 0.2 Iris-setosa 5
## 9 2.9 1.4 0.2 Iris-setosa 4.4
## 10 3.1 1.5 0.1 Iris-setosa 4.9
## # … with 140 more rows
Move the columns around or switch the positions
Here you can use the column names or indexes to change the column positions:
select(iris_df, sepal_width, class, sepal_length, petal_width, petal_length)
## # A tibble: 150 x 5
## sepal_width class sepal_length petal_width petal_length
##
## 1 3.5 Iris-setosa 5.1 0.2 1.4
## 2 3 Iris-setosa 4.9 0.2 1.4
## 3 3.2 Iris-setosa 4.7 0.2 1.3
## 4 3.1 Iris-setosa 4.6 0.2 1.5
## 5 3.6 Iris-setosa 5 0.2 1.4
## 6 3.9 Iris-setosa 5.4 0.4 1.7
## 7 3.4 Iris-setosa 4.6 0.3 1.4
## 8 3.4 Iris-setosa 5 0.2 1.5
## 9 2.9 Iris-setosa 4.4 0.2 1.4
## 10 3.1 Iris-setosa 4.9 0.1 1.5
## # … with 140 more rows
with indexes:
select(iris_df, 2, 5, 1, 4, 3)
## # A tibble: 150 x 5
## sepal_width class sepal_length petal_width petal_length
##
## 1 3.5 Iris-setosa 5.1 0.2 1.4
## 2 3 Iris-setosa 4.9 0.2 1.4
## 3 3.2 Iris-setosa 4.7 0.2 1.3
## 4 3.1 Iris-setosa 4.6 0.2 1.5
## 5 3.6 Iris-setosa 5 0.2 1.4
## 6 3.9 Iris-setosa 5.4 0.4 1.7
## 7 3.4 Iris-setosa 4.6 0.3 1.4
## 8 3.4 Iris-setosa 5 0.2 1.5
## 9 2.9 Iris-setosa 4.4 0.2 1.4
## 10 3.1 Iris-setosa 4.9 0.1 1.5
## # … with 140 more rows
Piping or chaining
The following examples will start getting complicated. Rather than operating some operations and saving the manipulated data frame into a new data frame, we can use the %>%
pipe operator from the magrittr
package. This operator carries forward the last manipulated data frame (or other objects) for the next operations. An example would help explain this.
Say we want to include only the columns starting with sepal and then we want to change their names to upper case using the toupper
function.
Here’s how to do so by saving or creating a new data frame and then operating on the new data frame:
selected_iris_df <- select(iris_df, starts_with("sepal"))
selected_iris_df
## # A tibble: 150 x 2
## sepal_length sepal_width
##
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
rename_all(selected_iris_df, toupper)
## # A tibble: 150 x 2
## SEPAL_LENGTH SEPAL_WIDTH
##
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
Here’s how to do the same thing with the pipe operator (%>%
):
select(iris_df, starts_with("sepal")) %>%
rename_all(toupper)
## # A tibble: 150 x 2
## SEPAL_LENGTH SEPAL_WIDTH
##
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
Note that the rename_all
function here doesn’t need to know the data frame name, it automatically operates on the data fed by the pipe operator.
How to sort or order a data frame
Sorting or ordering a data frame is useful when you hope to run some operations based on the order of the data. Caution: most R
practitioners would advise you against creating row-by-row operations because of the observed inefficiencies. Often there are better ways of meeting those needs. But again, when you are printing a table or aggregating some data, you may need to sort the data. Luckily, dplyr
has arrange
function.
Say we want to the top five rows from the Iris data frame defined by the smallest sepal length. We need two steps: 1) sort the data ascending sepal length, and 2) select the first five rows.
arrange(iris_df, sepal_length) %>%
head(n = 5)
## # A tibble: 5 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 4.3 3 1.1 0.1 Iris-setosa
## 2 4.4 2.9 1.4 0.2 Iris-setosa
## 3 4.4 3 1.3 0.2 Iris-setosa
## 4 4.4 3.2 1.3 0.2 Iris-setosa
## 5 4.5 2.3 1.3 0.3 Iris-setosa
If we want to sort sepal length in a descending order, we can use the desc
function, like this:
arrange(iris_df, desc(sepal_length)) %>%
head(n = 5)
## # A tibble: 5 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 7.9 3.8 6.4 2 Iris-virginica
## 2 7.7 3.8 6.7 2.2 Iris-virginica
## 3 7.7 2.6 6.9 2.3 Iris-virginica
## 4 7.7 2.8 6.7 2 Iris-virginica
## 5 7.7 3 6.1 2.3 Iris-virginica
Select the top or bottom ten (or n) rows
While we can use the arrange
function to select the top rows, dplyr
provides top_n
function for this specific purpose.
Here’s how to get the bottom five rows by smallest sepal length using the desc
function:
top_n(iris_df, n = 5, wt = desc(sepal_length))
## # A tibble: 5 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 4.4 2.9 1.4 0.2 Iris-setosa
## 2 4.3 3 1.1 0.1 Iris-setosa
## 3 4.4 3 1.3 0.2 Iris-setosa
## 4 4.5 2.3 1.3 0.3 Iris-setosa
## 5 4.4 3.2 1.3 0.2 Iris-setosa
You can also use a negative sign to denote the bottom rows:
top_n(iris_df, n = -5, wt = sepal_length)
## # A tibble: 5 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 4.4 2.9 1.4 0.2 Iris-setosa
## 2 4.3 3 1.1 0.1 Iris-setosa
## 3 4.4 3 1.3 0.2 Iris-setosa
## 4 4.5 2.3 1.3 0.3 Iris-setosa
## 5 4.4 3.2 1.3 0.2 Iris-setosa
Here’s how to get the top five rows by largest sepal length:
top_n(iris_df, n = 5, wt = sepal_length)
## # A tibble: 5 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 7.7 3.8 6.7 2.2 Iris-virginica
## 2 7.7 2.6 6.9 2.3 Iris-virginica
## 3 7.7 2.8 6.7 2 Iris-virginica
## 4 7.9 3.8 6.4 2 Iris-virginica
## 5 7.7 3 6.1 2.3 Iris-virginica
Note that the arrange
function assumes you want to sort the data in ascending order, whereas the top_n
function assumes you want to get the data in descending order.
How to create groups or discretize data
Say we want to convert a numeric range to groups such as 0-5, 6-10, and so on. We can use the cut
function to create such groups and labels.
For example:
simple_df <- data.frame(x = 1:20)
simple_df$x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
cut(simple_df$x, breaks = c(0, 5, 10, 15, Inf),
labels = c("0-5", "6-10", "11-15", "15+"),
include.lowest = TRUE)
## [1] 0-5 0-5 0-5 0-5 0-5 6-10 6-10 6-10 6-10 6-10 11-15 11-15
## [13] 11-15 11-15 11-15 15+ 15+ 15+ 15+ 15+
## Levels: 0-5 6-10 11-15 15+
But to add this new variable called y
to the existing data frame we can use the following syntax:
simple_df$y <- cut(simple_df$x,
breaks = c(0, 5, 10, 15, Inf),
labels = c("0-5", "6-10", "11-15", "15+"),
include.lowest = TRUE)
simple_df
## x y
## 1 1 0-5
## 2 2 0-5
## 3 3 0-5
## 4 4 0-5
## 5 5 0-5
## 6 6 6-10
## 7 7 6-10
## 8 8 6-10
## 9 9 6-10
## 10 10 6-10
## 11 11 11-15
## 12 12 11-15
## 13 13 11-15
## 14 14 11-15
## 15 15 11-15
## 16 16 15+
## 17 17 15+
## 18 18 15+
## 19 19 15+
## 20 20 15+
Or we can use the mutate
function from the dplyr
package to create a new variable, like this:
simple_df <- data.frame(x = 1:20)
simple_df <- mutate(simple_df,
y = cut(x,
breaks = c(0, 5, 10, 15, Inf),
labels = c("0-5", "6-10", "11-15", "15+"),
include.lowest = TRUE))
simple_df
## x y
## 1 1 0-5
## 2 2 0-5
## 3 3 0-5
## 4 4 0-5
## 5 5 0-5
## 6 6 6-10
## 7 7 6-10
## 8 8 6-10
## 9 9 6-10
## 10 10 6-10
## 11 11 11-15
## 12 12 11-15
## 13 13 11-15
## 14 14 11-15
## 15 15 11-15
## 16 16 15+
## 17 17 15+
## 18 18 15+
## 19 19 15+
## 20 20 15+
We can create custom functions to create these breaks and labels for us, so we don’t have to type in all the labels. Caution: these are very fragile functions only based on this use case. You will get errors or worse wrong results if you test it for some other data.
custom_breaks <- function(x, space_by = 5){
c(seq(from = 0, to = max(x) - space_by, by = space_by), Inf)
}
custom_labels <- function(x, space_by = 5){
brks <- custom_breaks(x, space_by = space_by)
lbls <- paste0(brks[1:(length(brks)-1)], "-", brks[2:length(brks)])
lbls[length(lbls)] <- paste0(brks[length(brks)-1], "+")
lbls
}
mutate(simple_df,
y = cut(x,
breaks = custom_breaks(x),
labels = custom_labels(x),
include.lowest = TRUE
)
)
## x y
## 1 1 0-5
## 2 2 0-5
## 3 3 0-5
## 4 4 0-5
## 5 5 0-5
## 6 6 5-10
## 7 7 5-10
## 8 8 5-10
## 9 9 5-10
## 10 10 5-10
## 11 11 10-15
## 12 12 10-15
## 13 13 10-15
## 14 14 10-15
## 15 15 10-15
## 16 16 15+
## 17 17 15+
## 18 18 15+
## 19 19 15+
## 20 20 15+
How to create quintile groups
dplyr
provides some windowed rank functions to help us quickly create new columns based on percentile or percentile groups.
For example, let’s say we want to create four percentile groups for sepal width.
mutate(iris_df, s_w_rank = ntile(sepal_width, 4))
## # A tibble: 150 x 6
## sepal_length sepal_width petal_length petal_width class s_w_rank
##
## 1 5.1 3.5 1.4 0.2 Iris-setosa 4
## 2 4.9 3 1.4 0.2 Iris-setosa 2
## 3 4.7 3.2 1.3 0.2 Iris-setosa 3
## 4 4.6 3.1 1.5 0.2 Iris-setosa 3
## 5 5 3.6 1.4 0.2 Iris-setosa 4
## 6 5.4 3.9 1.7 0.4 Iris-setosa 4
## 7 4.6 3.4 1.4 0.3 Iris-setosa 4
## 8 5 3.4 1.5 0.2 Iris-setosa 4
## 9 4.4 2.9 1.4 0.2 Iris-setosa 2
## 10 4.9 3.1 1.5 0.1 Iris-setosa 3
## # … with 140 more rows
How to extract or split string columns
Sometimes data from different columns gets clubbed in one column. The separate
function from the tidyr
package comes is helpful in such cases.
In this example below, we will split or separate a character column into two columns using the hyphen as a separator or delimiter.
library(tidyr)
string_df <- data.frame(imp_info = paste0(LETTERS[1:5], "-", 1:5))
head(string_df)
## imp_info
## 1 A-1
## 2 B-2
## 3 C-3
## 4 D-4
## 5 E-5
separate(data = string_df,
col = imp_info,
sep = "-",
into = c("letter", "number"),
remove = FALSE)
## imp_info letter number
## 1 A-1 A 1
## 2 B-2 B 2
## 3 C-3 C 3
## 4 D-4 D 4
## 5 E-5 E 5
This function is very powerful as it can take regular expressions as delimiters, which opens up many possibilities. Also, there are many options to handle missing values and filling or selecting columns.
For example, let’s extract the area code of a US-based phone number by setting the other columns as NA
.
# generate random phone number digits
us_phone_numbers <- data.frame(phone_number = paste(round(runif(5, min = 111, max = 999)),
round(runif(5, min = 1111, max = 9999)),
round(runif(5, min = 1111, max = 9999)),
sep = "-"))
head(us_phone_numbers)
## phone_number
## 1 736-1736-8194
## 2 147-9024-4773
## 3 414-4239-9121
## 4 417-3928-7203
## 5 294-4495-8930
separate(us_phone_numbers, col = phone_number, into = c("area_code", NA, NA), sep = "-")
## area_code
## 1 736
## 2 147
## 3 414
## 4 417
## 5 294
How to identify the presence of certain text
Often we care about the presence of certain text and creating a logical indicator or flag is useful in those cases. We have many options to achieve this. Let’s look at a few methods.
We want to check whether CA appears in the city, state name column in the following data frame. We will use str_detect
from the stringr
package.
library(stringr)
some_cities <- head(select(maps::us.cities, name), n = 10)
some_cities
## name
## 1 Abilene TX
## 2 Akron OH
## 3 Alameda CA
## 4 Albany GA
## 5 Albany NY
## 6 Albany OR
## 7 Albuquerque NM
## 8 Alexandria LA
## 9 Alexandria VA
## 10 Alhambra CA
mutate(some_cities, is_CA = str_detect(name, "CA"))
## name is_CA
## 1 Abilene TX FALSE
## 2 Akron OH FALSE
## 3 Alameda CA TRUE
## 4 Albany GA FALSE
## 5 Albany NY FALSE
## 6 Albany OR FALSE
## 7 Albuquerque NM FALSE
## 8 Alexandria LA FALSE
## 9 Alexandria VA FALSE
## 10 Alhambra CA TRUE
If you wanted to check for multiple states, you can use regular expression patterns. For example, we want to check for TX and CA.
mutate(some_cities, is_CA_TX = str_detect(name, "CA|TX"))
## name is_CA_TX
## 1 Abilene TX TRUE
## 2 Akron OH FALSE
## 3 Alameda CA TRUE
## 4 Albany GA FALSE
## 5 Albany NY FALSE
## 6 Albany OR FALSE
## 7 Albuquerque NM FALSE
## 8 Alexandria LA FALSE
## 9 Alexandria VA FALSE
## 10 Alhambra CA TRUE
How to create or make a new data frame
You rarely need to create a blank data frame. New programmers in R
who are used to other languages are used to creating empty objects which are filled in loops. Most likely, that will not be the case in R
. But creating a new data frame based on the existing data frame is easy: you simply assign the output of a statement to a new data frame. You saw some examples already, but let’s say we want to store the cities data frame in a new data frame, here’s one way to do so:
my_super_awesome_new_df <- some_cities
That’s it! It’s that simple.
There’s another operator from the magrittr
package that’s useful when dealing with pipes or chains. That operator is: %<>%
i.e. it pushes the data forward but also assigns it back. We will select a few columns from the iris_df_names_changed
data frame and overwrite it.
library(magrittr)
glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 5
## $ sepal.length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal.width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…
iris_df_names_changed %<>% select(1:3)
glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 3
## $ sepal.length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
How to convert a vector into a data frame
Perhaps you noticed above, but we saw many examples of vectors stored in a data frame. Let’s say we want to create a new data frame from the LETTERS
vector.
head(LETTERS)
## [1] "A" "B" "C" "D" "E" "F"
new_letters_df <- data.frame(upper_case_leter = head(LETTERS))
str(new_letters_df)
## 'data.frame': 6 obs. of 1 variable:
## $ upper_case_leter: chr "A" "B" "C" "D" ...
If we want to add two different vectors of the same length, we can simply add another column to the new data frame.
head(letters)
## [1] "a" "b" "c" "d" "e" "f"
new_letters_df <- data.frame(upper_case_leter = head(LETTERS),
lower_case_letter = head(letters))
str(new_letters_df)
## 'data.frame': 6 obs. of 2 variables:
## $ upper_case_leter : chr "A" "B" "C" "D" ...
## $ lower_case_letter: chr "a" "b" "c" "d" ...
If you need to stack different vectors in one column, you can simply use c
to combine the vectors.
new_letters_df <- data.frame(mixed_case_leter = c(head(LETTERS), head(letters)))
glimpse(new_letters_df)
## Rows: 12
## Columns: 1
## $ mixed_case_leter "A", "B", "C", "D", "E", "F", "a", "b", "c", "d", "e…
How to create subsets of a data frame
We can create subsets or filter data frames using the filter
function from the dplyr
package. For example, say we want to create a new data frame for California cities from the us.cities
data frame.
Here’s one way to do so:
library(maps)
glimpse(us.cities)
## Rows: 1,005
## Columns: 6
## $ name "Abilene TX", "Akron OH", "Alameda CA", "Albany GA", "Alb…
## $ country.etc "TX", "OH", "CA", "GA", "NY", "OR", "NM", "LA", "VA", "CA…
## $ pop 113888, 206634, 70069, 75510, 93576, 45535, 494962, 44933…
## $ lat 32.45, 41.08, 37.77, 31.58, 42.67, 44.62, 35.12, 31.29, 3…
## $ long -99.74, -81.52, -122.26, -84.18, -73.80, -123.09, -106.62…
## $ capital 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
filter(us.cities, country.etc == 'CA') %>%
glimpse()
## Rows: 208
## Columns: 6
## $ name "Alameda CA", "Alhambra CA", "Aliso Viejo CA", "Altadena …
## $ country.etc "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA…
## $ pop 70069, 88857, 41975, 43280, 334909, 109485, 66361, 57429,…
## $ lat 37.77, 34.08, 33.57, 34.19, 33.84, 37.99, 34.53, 34.13, 3…
## $ long -122.26, -118.13, -117.73, -118.13, -117.87, -121.80, -11…
## $ capital 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Note we need to equal signs to compare value. If you use a single equal to sign, which is reserved for assignment, you will likely get an error.
Let’s say we want to subset or filter the data frame for CA and TX. You can use the %in%
operator to do so:
glimpse(us.cities)
## Rows: 1,005
## Columns: 6
## $ name "Abilene TX", "Akron OH", "Alameda CA", "Albany GA", "Alb…
## $ country.etc "TX", "OH", "CA", "GA", "NY", "OR", "NM", "LA", "VA", "CA…
## $ pop 113888, 206634, 70069, 75510, 93576, 45535, 494962, 44933…
## $ lat 32.45, 41.08, 37.77, 31.58, 42.67, 44.62, 35.12, 31.29, 3…
## $ long -99.74, -81.52, -122.26, -84.18, -73.80, -123.09, -106.62…
## $ capital 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
filter(us.cities, country.etc %in% c('CA', 'TX')) %>%
glimpse()
## Rows: 279
## Columns: 6
## $ name "Abilene TX", "Alameda CA", "Alhambra CA", "Aliso Viejo C…
## $ country.etc "TX", "CA", "CA", "CA", "TX", "CA", "TX", "CA", "CA", "CA…
## $ pop 113888, 70069, 88857, 41975, 84467, 43280, 183295, 334909…
## $ lat 32.45, 37.77, 34.08, 33.57, 33.11, 34.19, 35.20, 33.84, 3…
## $ long -99.74, -122.26, -118.13, -117.73, -96.67, -118.13, -101.…
## $ capital 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, …
You can use filters on numeric columns also. For example, we want to select rows from the iris data frame where the sepal length is greater than the median.
summary(iris_df)
## sepal_length sepal_width petal_length petal_width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## class
## Length:150
## Class :character
## Mode :character
##
##
##
nrow(iris_df) # check the number of rows
## [1] 150
filtered_iris_df <- filter(iris_df, sepal_length > median(sepal_length))
summary(filtered_iris_df)
## sepal_length sepal_width petal_length petal_width
## Min. :5.90 Min. :2.200 Min. :4.000 Min. :1.000
## 1st Qu.:6.20 1st Qu.:2.800 1st Qu.:4.700 1st Qu.:1.500
## Median :6.45 Median :3.000 Median :5.100 Median :1.800
## Mean :6.58 Mean :2.970 Mean :5.239 Mean :1.811
## 3rd Qu.:6.80 3rd Qu.:3.175 3rd Qu.:5.700 3rd Qu.:2.100
## Max. :7.90 Max. :3.800 Max. :6.900 Max. :2.500
## class
## Length:70
## Class :character
## Mode :character
##
##
##
nrow(filtered_iris_df)
## [1] 70
How to add, subtract or multiply columns of a data frame
The mutate
family functions are very useful when performing simple or complex operations on the columns of a data frame. Let’s say we want to multiply all the numeric columns of the iris data frame by 10. Here’s how we can do so:
iris_df %>% mutate_if(is.numeric, ~.*10)
## # A tibble: 150 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 51 35 14 2 Iris-setosa
## 2 49 30 14 2 Iris-setosa
## 3 47 32 13 2 Iris-setosa
## 4 46 31 15 2 Iris-setosa
## 5 50 36 14 2 Iris-setosa
## 6 54 39 17 4 Iris-setosa
## 7 46 34 14 3 Iris-setosa
## 8 50 34 15 2 Iris-setosa
## 9 44 29 14 2 Iris-setosa
## 10 49 31 15 1 Iris-setosa
## # … with 140 more rows
The tilde (~
) sign here signifies all columns selected by the mutate
function.
If we want to create new columns instead of overwriting, here’s one way:
iris_df %>% mutate_if(is.numeric, list(multiplied = function(x) x*5)) %>% glimpse()
## Rows: 150
## Columns: 9
## $ sepal_length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ sepal_width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ petal_length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ petal_width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", …
## $ sepal_length_multiplied 25.5, 24.5, 23.5, 23.0, 25.0, 27.0, 23.0, 25.…
## $ sepal_width_multiplied 17.5, 15.0, 16.0, 15.5, 18.0, 19.5, 17.0, 17.…
## $ petal_length_multiplied 7.0, 7.0, 6.5, 7.5, 7.0, 8.5, 7.0, 7.5, 7.0, …
## $ petal_width_multiplied 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.5, 1.0, 1.0, …
We can do multiple operations at once. For example, we will also add and subtract 20.
iris_df %>%
mutate_if(
is.numeric,
list(
multiplied = function(x) x * 5,
added = function(x) x + 20,
subtracted = function(x) x - 20
)
) %>%
glimpse()
## Rows: 150
## Columns: 17
## $ sepal_length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ sepal_width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ petal_length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ petal_width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", …
## $ sepal_length_multiplied 25.5, 24.5, 23.5, 23.0, 25.0, 27.0, 23.0, 25.…
## $ sepal_width_multiplied 17.5, 15.0, 16.0, 15.5, 18.0, 19.5, 17.0, 17.…
## $ petal_length_multiplied 7.0, 7.0, 6.5, 7.5, 7.0, 8.5, 7.0, 7.5, 7.0, …
## $ petal_width_multiplied 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.5, 1.0, 1.0, …
## $ sepal_length_added 25.1, 24.9, 24.7, 24.6, 25.0, 25.4, 24.6, 25.…
## $ sepal_width_added 23.5, 23.0, 23.2, 23.1, 23.6, 23.9, 23.4, 23.…
## $ petal_length_added 21.4, 21.4, 21.3, 21.5, 21.4, 21.7, 21.4, 21.…
## $ petal_width_added 20.2, 20.2, 20.2, 20.2, 20.2, 20.4, 20.3, 20.…
## $ sepal_length_subtracted -14.9, -15.1, -15.3, -15.4, -15.0, -14.6, -15…
## $ sepal_width_subtracted -16.5, -17.0, -16.8, -16.9, -16.4, -16.1, -16…
## $ petal_length_subtracted -18.6, -18.6, -18.7, -18.5, -18.6, -18.3, -18…
## $ petal_width_subtracted -19.8, -19.8, -19.8, -19.8, -19.8, -19.6, -19…
Instead of applying an operation to all numeric columns, we can select columns such as the ones ending with length.
iris_df %>%
mutate_at(vars(ends_with("length")),
list(multiplied = function(x) x * 5)) %>%
glimpse()
## Rows: 150
## Columns: 7
## $ sepal_length 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ sepal_width 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ petal_length 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ petal_width 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ class "Iris-setosa", "Iris-setosa", "Iris-setosa", …
## $ sepal_length_multiplied 25.5, 24.5, 23.5, 23.0, 25.0, 27.0, 23.0, 25.…
## $ petal_length_multiplied 7.0, 7.0, 6.5, 7.5, 7.0, 8.5, 7.0, 7.5, 7.0, …
One benefit of mutate
is that you can use the newly created variable immediately.
For example:
squares_df <- data.frame(x = 1:10)
mutate(squares_df,
x_squared = x^2,
sqrt_of_x_squre = x_squared^(1/2), # note we're using x_squared
is_x_squared_even = x_squared %% 2 == 0)
## x x_squared sqrt_of_x_squre is_x_squared_even
## 1 1 1 1 FALSE
## 2 2 4 2 TRUE
## 3 3 9 3 FALSE
## 4 4 16 4 TRUE
## 5 5 25 5 FALSE
## 6 6 36 6 TRUE
## 7 7 49 7 FALSE
## 8 8 64 8 TRUE
## 9 9 81 9 FALSE
## 10 10 100 10 TRUE
How to use if, then, and else for data frame calculations
So far we have seen simple use cases of data manipulation, but in real-world problems, we get challenging problems. Often we need to create a new variable based on a series of conditions.
Let’s look at an if and else statement. We will create a new column with a value of yes or no when the sepal width exceeds 3 and the petal width is less than 0.3.
mutate(iris_df,
does_it_meet_cond = ifelse(sepal_width > 3 & petal_width < 0.3,
yes = "yes",
no = "no")) %>%
filter(does_it_meet_cond == 'yes')
## # A tibble: 28 x 6
## sepal_length sepal_width petal_length petal_width class does_it_meet_co…
##
## 1 5.1 3.5 1.4 0.2 Iris-seto… yes
## 2 4.7 3.2 1.3 0.2 Iris-seto… yes
## 3 4.6 3.1 1.5 0.2 Iris-seto… yes
## 4 5 3.6 1.4 0.2 Iris-seto… yes
## 5 5 3.4 1.5 0.2 Iris-seto… yes
## 6 4.9 3.1 1.5 0.1 Iris-seto… yes
## 7 5.4 3.7 1.5 0.2 Iris-seto… yes
## 8 4.8 3.4 1.6 0.2 Iris-seto… yes
## 9 5.8 4 1.2 0.2 Iris-seto… yes
## 10 5.4 3.4 1.7 0.2 Iris-seto… yes
## # … with 18 more rows
We can also use case_when
from the dplyr
package to test multiple conditions. In this simple example, we will recreate the breaks we created earlier using the cut
function. We will also use the between
function to save us some typing.
glimpse(simple_df)
## Rows: 20
## Columns: 2
## $ x 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
## $ y 0-5, 0-5, 0-5, 0-5, 0-5, 6-10, 6-10, 6-10, 6-10, 6-10, 11-15, 11-15…
mutate(simple_df, brks = case_when(
between(x, 0, 5) ~ '0-5',
between(x, 6, 10) ~ '6-10',
between(x, 11, 15) ~ '11-15',
x > 15 ~ '15+'
))
## x y brks
## 1 1 0-5 0-5
## 2 2 0-5 0-5
## 3 3 0-5 0-5
## 4 4 0-5 0-5
## 5 5 0-5 0-5
## 6 6 6-10 6-10
## 7 7 6-10 6-10
## 8 8 6-10 6-10
## 9 9 6-10 6-10
## 10 10 6-10 6-10
## 11 11 11-15 11-15
## 12 12 11-15 11-15
## 13 13 11-15 11-15
## 14 14 11-15 11-15
## 15 15 11-15 11-15
## 16 16 15+ 15+
## 17 17 15+ 15+
## 18 18 15+ 15+
## 19 19 15+ 15+
## 20 20 15+ 15+
We can use complicated conditions to create our new variable. We will use the mtcars
data set to create some made-up conditions for car segments.
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
## $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
## $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
## $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
## $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
## $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
## $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
## $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
## $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
## $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
## $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…
mutate(mtcars, car_segment = case_when(
mpg > 20 & cyl > 5 & hp > 100 ~ "fancy car",
mpg > 15 & cyl > 7 & hp > 150 ~ "power car",
mpg > 14 & disp > 200 & hp > 200 ~ "strange car",
TRUE ~ "others"
)) %>% head(10)
## mpg cyl disp hp drat wt qsec vs am gear carb car_segment
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 fancy car
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 fancy car
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 others
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 fancy car
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 power car
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 others
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 strange car
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 others
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 others
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 others
How to join, union, merge or append data frames using bind functions
Often we need to combine different data frames either by joining them or stacking them. Let’s look at some common use cases.
Join two data frames
Let’s say we have a product information data frame and a customer information data frame. We need to join these two data frames. dplyr
provides convenient join functions to do so.
First, let’s create the data sets.
products_df <- data.frame(product = LETTERS[1:5],
price = round(runif(5)*10, 2))
glimpse(products_df)
## Rows: 5
## Columns: 2
## $ product "A", "B", "C", "D", "E"
## $ price 5.96, 6.25, 1.65, 0.28, 0.02
customers_df <- data.frame(cust_name = randomNames::randomNames(10),
state = sample(state.abb, 10, replace = TRUE),
products_purchased = sample(LETTERS[1:3], 10, replace = TRUE))
glimpse(customers_df)
## Rows: 10
## Columns: 3
## $ cust_name "Ward, Sarah", "el-Din, Munisa", "Tran, Zachary", …
## $ state "DE", "ME", "ND", "OH", "IA", "AK", "MO", "IN", "U…
## $ products_purchased "A", "C", "C", "C", "A", "A", "B", "B", "B", "B"
If we want to find out product information of the products purchased by customers, we will need to use an inner join.
Like this:
# Don't run.
# This will give you an error since we haven't specified columns to join with
inner_join(products_df, customers_df)
By default, the join functions will join data frames using the common variables found in both the data frames. But in our example, we have product
column in one and products_purchased
in another. We will need to use the by
argument of the join function to specify the column names. This argument takes a character vector as an input.
Like this:
inner_join(products_df,
customers_df,
by = c("product" = "products_purchased"))
## product price cust_name state
## 1 A 5.96 Ward, Sarah DE
## 2 A 5.96 Dedios, Deja IA
## 3 A 5.96 Lawrence, Lauren AK
## 4 B 6.25 Brock, Tre'Sean MO
## 5 B 6.25 Sullivan, Dominiqua IN
## 6 B 6.25 Lee, Deanna UT
## 7 B 6.25 Barber, Margaret VT
## 8 C 1.65 el-Din, Munisa ME
## 9 C 1.65 Tran, Zachary ND
## 10 C 1.65 Wilfahrt, Katie-Lynne OH
If you need to join data frames with multiple columns, you can specify those columns in the by
argument like this: by = c("df.1.col1" = "df.2.col1", "df.1.col2" = "df.2.col2")
.
If we need to see all the products even if there are no customers for them, we will need to use a left join with the product data frame on the left side of the join.
left_join(products_df,
customers_df,
by = c("product" = "products_purchased"))
## product price cust_name state
## 1 A 5.96 Ward, Sarah DE
## 2 A 5.96 Dedios, Deja IA
## 3 A 5.96 Lawrence, Lauren AK
## 4 B 6.25 Brock, Tre'Sean MO
## 5 B 6.25 Sullivan, Dominiqua IN
## 6 B 6.25 Lee, Deanna UT
## 7 B 6.25 Barber, Margaret VT
## 8 C 1.65 el-Din, Munisa ME
## 9 C 1.65 Tran, Zachary ND
## 10 C 1.65 Wilfahrt, Katie-Lynne OH
## 11 D 0.28
## 12 E 0.02
Similarly, we can right_join
to show all the rows from the data frame on the right and only matching rows from the data frame on the left.
Create a union of data frames
If you need to simple stack rows from different or multiple data frames, you can use the bind_rows
function from the dplyr
package. This function offers a lot of flexibility to handle the mismatches in column names and missing values.
Let’s create some data frames first.
first_df <- data.frame(segment = LETTERS[1:5],
price = runif(5)*10)
second_df <- data.frame(segment = LETTERS[6:8],
price = runif(3)*10,
rating = 3:1)
We know that both of these data frames have segment and price in common, but the second data frame has an additional column called rating.
Let’s stack these two or create an union:
bind_rows(first_df, second_df)
## segment price rating
## 1 A 8.5454397 NA
## 2 B 5.9386813 NA
## 3 C 8.7418077 NA
## 4 D 2.2458716 NA
## 5 E 2.0619303 NA
## 6 F 5.7534880 3
## 7 G 0.3834018 2
## 8 H 4.3073954 1
You’ll see that the rating column has missing values for the rows from the first data frame. You can use the .id
argument to create another column to show which data frame that row came from.
bind_rows(first_df, second_df, .id = "df_id")
## df_id segment price rating
## 1 1 A 8.5454397 NA
## 2 1 B 5.9386813 NA
## 3 1 C 8.7418077 NA
## 4 1 D 2.2458716 NA
## 5 1 E 2.0619303 NA
## 6 2 F 5.7534880 3
## 7 2 G 0.3834018 2
## 8 2 H 4.3073954 1
Append columns
Often the best way to handle this use case i.e. combining columns of different data frames is best handled with joins. But if you know that either the rows from different data frames belong to the same record and no sorting has taken place, or the results are inconsequential, you can use the bind_cols
function. These data frames must contain the same number of rows.
For example, you will get an error if you run this:
# Don't run
bind_cols(first_df, second_df)
But you will be successful when you run this.
bind_cols(first_df[1:3, ], second_df)
## segment price segment1 price1 rating
## 1 A 8.545440 F 5.7534880 3
## 2 B 5.938681 G 0.3834018 2
## 3 C 8.741808 H 4.3073954 1
How to create a random sample
Sometimes you may want to work with a smaller sample of a data frame. If you want to select just a few rows, not random, you can use head
or tail
functions. By default, they select six rows; you can increase the number with the n
argument.
# top six rows
head(iris_df)
## # A tibble: 6 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
# bottom six rows
tail(iris_df)
## # A tibble: 6 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 6.7 3.3 5.7 2.5 Iris-virginica
## 2 6.7 3 5.2 2.3 Iris-virginica
## 3 6.3 2.5 5 1.9 Iris-virginica
## 4 6.5 3 5.2 2 Iris-virginica
## 5 6.2 3.4 5.4 2.3 Iris-virginica
## 6 5.9 3 5.1 1.8 Iris-virginica
# top ten rows
head(iris_df, n = 10)
## # A tibble: 10 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
## 7 4.6 3.4 1.4 0.3 Iris-setosa
## 8 5 3.4 1.5 0.2 Iris-setosa
## 9 4.4 2.9 1.4 0.2 Iris-setosa
## 10 4.9 3.1 1.5 0.1 Iris-setosa
To get a sample, however, you can use sample_n
or sample_frac
functions from dplyr
. With sample_n
you can select a fixed number of randomly selected rows and with sample_frac
you can select a percent of records, say 10% or 15%.
# ten random rows
sample_n(iris_df, size = 10)
## # A tibble: 10 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 5.1 3.8 1.9 0.4 Iris-setosa
## 2 6.3 2.5 4.9 1.5 Iris-versicolor
## 3 4.9 3.1 1.5 0.1 Iris-setosa
## 4 5 2 3.5 1 Iris-versicolor
## 5 4.5 2.3 1.3 0.3 Iris-setosa
## 6 5.9 3.2 4.8 1.8 Iris-versicolor
## 7 6.6 3 4.4 1.4 Iris-versicolor
## 8 4.8 3.1 1.6 0.2 Iris-setosa
## 9 6.4 2.7 5.3 1.9 Iris-virginica
## 10 5.8 2.8 5.1 2.4 Iris-virginica
# 10% randomly selected rows
# there are 150 rows in the Iris data frame
sample_frac(iris_df, size = 0.1)
## # A tibble: 15 x 5
## sepal_length sepal_width petal_length petal_width class
##
## 1 5.8 2.7 3.9 1.2 Iris-versicolor
## 2 4.4 3.2 1.3 0.2 Iris-setosa
## 3 5.6 2.5 3.9 1.1 Iris-versicolor
## 4 6.4 2.7 5.3 1.9 Iris-virginica
## 5 6.5 3 5.8 2.2 Iris-virginica
## 6 5.4 3.9 1.3 0.4 Iris-setosa
## 7 5.4 3.4 1.5 0.4 Iris-setosa
## 8 6.2 2.2 4.5 1.5 Iris-versicolor
## 9 6.3 2.9 5.6 1.8 Iris-virginica
## 10 5.6 3 4.5 1.5 Iris-versicolor
## 11 5.8 2.6 4 1.2 Iris-versicolor
## 12 5.1 3.3 1.7 0.5 Iris-setosa
## 13 4.6 3.1 1.5 0.2 Iris-setosa
## 14 6.4 2.8 5.6 2.2 Iris-virginica
## 15 6.4 3.2 4.5 1.5 Iris-versicolor
These functions have additional useful arguments such as replace
to re-sample or show duplicate rows and weight
to prefer some rows over others.
How to efficiently loop through each row
As I mentioned earlier, if you are thinking of looping through each row of a data frame, you should reconsider your problem. Often there is a vectorized and efficient approach. But when you must there are a few ways of doing this.
Use a loop like this:
small_iris_df <- head(iris_df, 10)
for (i in seq_len(nrow(small_iris_df))){
cat(paste(i, "row of sepal_length has the value of", small_iris_df[i, 1][[1]]))
cat("\n")
}
## 1 row of sepal_length has the value of 5.1
## 2 row of sepal_length has the value of 4.9
## 3 row of sepal_length has the value of 4.7
## 4 row of sepal_length has the value of 4.6
## 5 row of sepal_length has the value of 5
## 6 row of sepal_length has the value of 5.4
## 7 row of sepal_length has the value of 4.6
## 8 row of sepal_length has the value of 5
## 9 row of sepal_length has the value of 4.4
## 10 row of sepal_length has the value of 4.9
You can also use the rowwise
function from dplyr
.
small_iris_df <- head(iris_df, 10)
rowwise(small_iris_df) %>%
do(something = paste("sepal_length has the value of", .$sepal_length)) %>%
.$something
## [[1]]
## [1] "sepal_length has the value of 5.1"
##
## [[2]]
## [1] "sepal_length has the value of 4.9"
##
## [[3]]
## [1] "sepal_length has the value of 4.7"
##
## [[4]]
## [1] "sepal_length has the value of 4.6"
##
## [[5]]
## [1] "sepal_length has the value of 5"
##
## [[6]]
## [1] "sepal_length has the value of 5.4"
##
## [[7]]
## [1] "sepal_length has the value of 4.6"
##
## [[8]]
## [1] "sepal_length has the value of 5"
##
## [[9]]
## [1] "sepal_length has the value of 4.4"
##
## [[10]]
## [1] "sepal_length has the value of 4.9"
Again, there are better ways of handling going through each row of a data frame, especially using the apply
family functions from base R
or group_by
and summarize
functions from dplyr
.
How to write or save a data frame to Excel or CSV
After the data manipulation and aggregation, you may want to save the data frame as a CSV or an Excel file. Here’s how we can do so:
Save as a CSV: we can use write_csv
function from the readr
package like this:
write_csv(sample_n(iris_df, size = 10),
path = "my_sampled_iris_data.csv")
Save as an Excel: we can use write.xlsx
function from the xlsx
package like this:
library(xlsx)
write.xlsx(sample_n(iris_df, size = 10),
path = "my_sampled_iris_data.xlsx")
This library requires Java and will not work if you don’t have JVM on your computer. You can use the openxlsx
package if you run into issues with Java.
library(openxlsx)
openxlsx::write.xlsx(sample_n(iris_df, size = 10),
file = "my_sampled_iris_data.xlsx")
My R
version and package information
R version 4.0.0 (2020-04-24)
OS: macOS Catalina 10.15.4
Attached packages
maps version 3.3.0
magrittr version 1.5
stringr version 1.4.0
tidyr version 1.1.0
dplyr version 0.8.5
readr version 1.3.1
Some Books You May Find Useful
Last update on 2022-05-05 / Affiliate links / Images from Amazon Product Advertising API