The Ultimate Guide for Data Manipulation in R

Manipulating and handling data in R used to be very challenging, but with dplyr and other packages in tidyverse things have become easier. Some estimate about 90% of the time is spent on data cleaning and manipulating. If there are ways to cut that time short, any data scientist will welcome those. tidyverse in R is useful for this purpose. This guide will explain how to complete some of the most common data manipulation tasks.

In this guide, you will learn:

  • How to read a CSV into a data frame
  • How to select a column
  • How to change column names
  • How to change the order of the columns in a data frame
  • How to remove certain columns
  • How to sort or order a data frame using
  • How to manipulate data to create groups
  • How to extract string columns
  • How to split and join string columns
  • How to identify certain characters in a string column
  • How to create or make a new data frame
  • How to split string columns using a delimiter
  • How to split string columns using a delimiter
  • How to add, subtract or multiply columns of a data frame
  • How to use if, then, else for data frame calculations
  • How to join, union, merge or append data frames using bind functions
  • How to create a random sample
  • How to create subsets of a data frame
  • How to efficiently loop through each row
  • How to join data frames using multiple columns
  • How to write or save a data frame to Excel or CSV

Let's get started then.

How to read a CSV into a data frame

Before we manipulate any data we need to fetch some data. Let’s see a few examples of reading CSVs using the readr library.

Read a local CSV file

I saved the Iris data set from UCI ML repository as a CSV file. Here’s how you can load it into a new data frame

library(readr)

iris_df <- read_csv("iris.csv")

We can peek into this data frame using the glimpse function from dplyr.

library(dplyr)

glimpse(iris_df)
## Rows: 150
## Columns: 5
## $ sepal_length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal_width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal_length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal_width   0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class         "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…

You’ll see that the read_csv function correctly guessed the data types and used the first row as the column names. You can change these options by providing the various parameter values. The default values for these parameters are:

  • file:
  • col_names: TRUE
  • col_types: NULL
  • locale: default_locale()
  • na: c("“,”NA")
  • quoted_na: TRUE
  • quote: "
  • comment:
  • trim_ws: TRUE
  • skip: 0
  • n_max: Inf
  • guess_max: min(1000, n_max)
  • progress: show_progress()
  • skip_empty_rows: TRUE

For example:

  • col_names: if your file doesn’t have a header row, you can set this option to FALSE.
  • col_types: you can provide data types that this function should use. You can use a string representation for this data set “ddddc”. Try read_csv("iris.csv", col_types = "ddddc").
  • na: if your data has missing values represented by a different character such as a question mark, you can specify that using this parameter
  • skip: say the first few lines of your data have comments or other things that you don’t want in your data frame, you can this parameter.

By default, this function will not convert a string variable to a factor column. That saves a lot of time and frustration!

Read a CSV file from the internet

Reading a CSV file from the internet has the same process as reading a local file. (Just make sure that you have the latest libraries. Below this article, you will see my R version as well as the version of libraries.)

Here we will read a file from the Cato Institute. This file has the human freedom index values for the year 2019.

hfi_index <- read_csv("https://object.cato.org/sites/cato.org/files/human-freedom-index-files/human-freedom-index-2019.csv",
  na = "-",
  col_types = cols(
    .default = col_guess(),
    year = col_double(),
    hf_score = col_double(),
    hf_rank = col_integer()
  )
)

Note the values provided to the col_types parameter. I want the read_csv function to guess all the columns except for the year, hf_score, and hf_rank columns. Also, the missing values are represented by a hyphen, hence I provided it to the na argument.

How to select column(s)

Say you want to select only a few columns from a data frame because the original data frame has too many columns or for other reasons. Base R has the subset function you could use, but dplyr has a handy function called (guess??) select.

Let’s see a few examples.

How to select columns by names

From the Iris data frame, we want to select sepal_length and petal_width.

This is one way:

select(iris_df, sepal_length, petal_width)
## # A tibble: 150 x 2
##    sepal_length petal_width
##                  
##  1          5.1         0.2
##  2          4.9         0.2
##  3          4.7         0.2
##  4          4.6         0.2
##  5          5           0.2
##  6          5.4         0.4
##  7          4.6         0.3
##  8          5           0.2
##  9          4.4         0.2
## 10          4.9         0.1
## # … with 140 more rows

How to select columns by column indexes

Say we want to select the first five columns from the human freedom index data frame.

This is one way:

select(hfi_index, 1:5)
## # A tibble: 1,620 x 5
##     year ISO_code countries  region                        hf_score
##                                           
##  1  2017 ALB      Albania    Eastern Europe                    7.84
##  2  2017 DZA      Algeria    Middle East & North Africa        4.99
##  3  2017 AGO      Angola     Sub-Saharan Africa                5.4 
##  4  2017 ARG      Argentina  Latin America & the Caribbean     6.86
##  5  2017 ARM      Armenia    Caucasus & Central Asia           7.42
##  6  2017 AUS      Australia  Oceania                           8.62
##  7  2017 AUT      Austria    Western Europe                    8.48
##  8  2017 AZE      Azerbaijan Caucasus & Central Asia           6.22
##  9  2017 BHS      Bahamas    Latin America & the Caribbean     7.56
## 10  2017 BHR      Bahrain    Middle East & North Africa        6.63
## # … with 1,610 more rows

How to change column names

Say we want to change the underscores in the column names to periods or dots, we can use the select function again this way:

iris_df_names_changed <- select(iris_df, 
       sepal.length = sepal_length, 
       sepal.width = sepal_width,
       petal.length = petal_length,
       petal.width = petal_width,
       class)

glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 5
## $ sepal.length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal.width   0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class         "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…

We can also use the names function from base R to search and replace a character.

iris_df_names_changed <- iris_df
names(iris_df_names_changed) <- gsub(names(iris_df_names_changed), pattern = "_", replacement = ".")

glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 5
## $ sepal.length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal.width   0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class         "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…

You can see a downside of using the names function: it overwrites the names of your existing data frame, whereas the select function can be used to create a new data frame.

BTW, select has a cousin named rename. The biggest difference between these two: select will select only the listed columns, whereas rename will change the name of the provided columns but still return all the other columns.

For example, compare the two following commands and their outputs:

select(iris_df, 
       sepal.length = sepal_length, 
       sepal.width = sepal_width)
## # A tibble: 150 x 2
##    sepal.length sepal.width
##                  
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows
rename(iris_df, 
       sepal.length = sepal_length, 
       sepal.width = sepal_width)
## # A tibble: 150 x 5
##    sepal.length sepal.width petal_length petal_width class      
##                                        
##  1          5.1         3.5          1.4         0.2 Iris-setosa
##  2          4.9         3            1.4         0.2 Iris-setosa
##  3          4.7         3.2          1.3         0.2 Iris-setosa
##  4          4.6         3.1          1.5         0.2 Iris-setosa
##  5          5           3.6          1.4         0.2 Iris-setosa
##  6          5.4         3.9          1.7         0.4 Iris-setosa
##  7          4.6         3.4          1.4         0.3 Iris-setosa
##  8          5           3.4          1.5         0.2 Iris-setosa
##  9          4.4         2.9          1.4         0.2 Iris-setosa
## 10          4.9         3.1          1.5         0.1 Iris-setosa
## # … with 140 more rows

How to exclude certain columns

We saw one use case which shows how certain columns can be excluded (sometimes by accident). Another way is to use a negative sign.

For example, say we want to remove sepal_length and sepal_width from the data frame. Here’s one way:

select(iris_df, -sepal_length, -sepal_width)
## # A tibble: 150 x 3
##    petal_length petal_width class      
##                         
##  1          1.4         0.2 Iris-setosa
##  2          1.4         0.2 Iris-setosa
##  3          1.3         0.2 Iris-setosa
##  4          1.5         0.2 Iris-setosa
##  5          1.4         0.2 Iris-setosa
##  6          1.7         0.4 Iris-setosa
##  7          1.4         0.3 Iris-setosa
##  8          1.5         0.2 Iris-setosa
##  9          1.4         0.2 Iris-setosa
## 10          1.5         0.1 Iris-setosa
## # … with 140 more rows

Exclude columns by name searches

Exclude if the column name begins with some characters

Include all columns beginning with sepal:

select(iris_df, starts_with('sepal'))
## # A tibble: 150 x 2
##    sepal_length sepal_width
##                  
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows

Exclude all columns beginning with sepal:

select(iris_df, -starts_with('sepal'))
## # A tibble: 150 x 3
##    petal_length petal_width class      
##                         
##  1          1.4         0.2 Iris-setosa
##  2          1.4         0.2 Iris-setosa
##  3          1.3         0.2 Iris-setosa
##  4          1.5         0.2 Iris-setosa
##  5          1.4         0.2 Iris-setosa
##  6          1.7         0.4 Iris-setosa
##  7          1.4         0.3 Iris-setosa
##  8          1.5         0.2 Iris-setosa
##  9          1.4         0.2 Iris-setosa
## 10          1.5         0.1 Iris-setosa
## # … with 140 more rows

Exclude if the column name ends with some characters

Include all columns ending with width:

select(iris_df, ends_with('width'))
## # A tibble: 150 x 2
##    sepal_width petal_width
##                 
##  1         3.5         0.2
##  2         3           0.2
##  3         3.2         0.2
##  4         3.1         0.2
##  5         3.6         0.2
##  6         3.9         0.4
##  7         3.4         0.3
##  8         3.4         0.2
##  9         2.9         0.2
## 10         3.1         0.1
## # … with 140 more rows

Exclude all columns ending with width:

select(iris_df, -ends_with('width'))
## # A tibble: 150 x 3
##    sepal_length petal_length class      
##                          
##  1          5.1          1.4 Iris-setosa
##  2          4.9          1.4 Iris-setosa
##  3          4.7          1.3 Iris-setosa
##  4          4.6          1.5 Iris-setosa
##  5          5            1.4 Iris-setosa
##  6          5.4          1.7 Iris-setosa
##  7          4.6          1.4 Iris-setosa
##  8          5            1.5 Iris-setosa
##  9          4.4          1.4 Iris-setosa
## 10          4.9          1.5 Iris-setosa
## # … with 140 more rows

Exclude if the column name contains some characters

Include all columns names containing an underscore:

select(iris_df, contains('_'))
## # A tibble: 150 x 4
##    sepal_length sepal_width petal_length petal_width
##                                 
##  1          5.1         3.5          1.4         0.2
##  2          4.9         3            1.4         0.2
##  3          4.7         3.2          1.3         0.2
##  4          4.6         3.1          1.5         0.2
##  5          5           3.6          1.4         0.2
##  6          5.4         3.9          1.7         0.4
##  7          4.6         3.4          1.4         0.3
##  8          5           3.4          1.5         0.2
##  9          4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## # … with 140 more rows

Exclude all columns names containing an underscore:

select(iris_df, -contains('_'))
## # A tibble: 150 x 1
##    class      
##          
##  1 Iris-setosa
##  2 Iris-setosa
##  3 Iris-setosa
##  4 Iris-setosa
##  5 Iris-setosa
##  6 Iris-setosa
##  7 Iris-setosa
##  8 Iris-setosa
##  9 Iris-setosa
## 10 Iris-setosa
## # … with 140 more rows

Select numeric columns

The select family of functions have other useful functions such as select_if or select_at.

Include all numeric columns::

select_if(iris_df, is.numeric)
## # A tibble: 150 x 4
##    sepal_length sepal_width petal_length petal_width
##                                 
##  1          5.1         3.5          1.4         0.2
##  2          4.9         3            1.4         0.2
##  3          4.7         3.2          1.3         0.2
##  4          4.6         3.1          1.5         0.2
##  5          5           3.6          1.4         0.2
##  6          5.4         3.9          1.7         0.4
##  7          4.6         3.4          1.4         0.3
##  8          5           3.4          1.5         0.2
##  9          4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## # … with 140 more rows

Exclude all numeric columns or include character variables:

select_if(iris_df, is.character)
## # A tibble: 150 x 1
##    class      
##          
##  1 Iris-setosa
##  2 Iris-setosa
##  3 Iris-setosa
##  4 Iris-setosa
##  5 Iris-setosa
##  6 Iris-setosa
##  7 Iris-setosa
##  8 Iris-setosa
##  9 Iris-setosa
## 10 Iris-setosa
## # … with 140 more rows

How to change the order of the columns in a data frame

Often we need to reorder certain columns either for modeling or for printing tables. There are a few ways of doing this.

Move a column to the end

We can do so by using the select function and a special function in dplyr called everything.

For example, we will move the sepal_length column to the end:

select(iris_df, -sepal_length, everything(), sepal_length)
## # A tibble: 150 x 5
##    sepal_width petal_length petal_width class       sepal_length
##                                        
##  1         3.5          1.4         0.2 Iris-setosa          5.1
##  2         3            1.4         0.2 Iris-setosa          4.9
##  3         3.2          1.3         0.2 Iris-setosa          4.7
##  4         3.1          1.5         0.2 Iris-setosa          4.6
##  5         3.6          1.4         0.2 Iris-setosa          5  
##  6         3.9          1.7         0.4 Iris-setosa          5.4
##  7         3.4          1.4         0.3 Iris-setosa          4.6
##  8         3.4          1.5         0.2 Iris-setosa          5  
##  9         2.9          1.4         0.2 Iris-setosa          4.4
## 10         3.1          1.5         0.1 Iris-setosa          4.9
## # … with 140 more rows

We are excluding sepal_length at first, then adding other columns, and then added sepal_length at the end.

We can manually rearrange the columns too:

select(iris_df, sepal_width, petal_length, petal_width, class, sepal_length)
## # A tibble: 150 x 5
##    sepal_width petal_length petal_width class       sepal_length
##                                        
##  1         3.5          1.4         0.2 Iris-setosa          5.1
##  2         3            1.4         0.2 Iris-setosa          4.9
##  3         3.2          1.3         0.2 Iris-setosa          4.7
##  4         3.1          1.5         0.2 Iris-setosa          4.6
##  5         3.6          1.4         0.2 Iris-setosa          5  
##  6         3.9          1.7         0.4 Iris-setosa          5.4
##  7         3.4          1.4         0.3 Iris-setosa          4.6
##  8         3.4          1.5         0.2 Iris-setosa          5  
##  9         2.9          1.4         0.2 Iris-setosa          4.4
## 10         3.1          1.5         0.1 Iris-setosa          4.9
## # … with 140 more rows

or using the last_col function which selects the last column:

select(iris_df, 2:last_col(), sepal_length)
## # A tibble: 150 x 5
##    sepal_width petal_length petal_width class       sepal_length
##                                        
##  1         3.5          1.4         0.2 Iris-setosa          5.1
##  2         3            1.4         0.2 Iris-setosa          4.9
##  3         3.2          1.3         0.2 Iris-setosa          4.7
##  4         3.1          1.5         0.2 Iris-setosa          4.6
##  5         3.6          1.4         0.2 Iris-setosa          5  
##  6         3.9          1.7         0.4 Iris-setosa          5.4
##  7         3.4          1.4         0.3 Iris-setosa          4.6
##  8         3.4          1.5         0.2 Iris-setosa          5  
##  9         2.9          1.4         0.2 Iris-setosa          4.4
## 10         3.1          1.5         0.1 Iris-setosa          4.9
## # … with 140 more rows

Move the columns around or switch the positions

Here you can use the column names or indexes to change the column positions:

select(iris_df, sepal_width, class, sepal_length, petal_width, petal_length)
## # A tibble: 150 x 5
##    sepal_width class       sepal_length petal_width petal_length
##                                        
##  1         3.5 Iris-setosa          5.1         0.2          1.4
##  2         3   Iris-setosa          4.9         0.2          1.4
##  3         3.2 Iris-setosa          4.7         0.2          1.3
##  4         3.1 Iris-setosa          4.6         0.2          1.5
##  5         3.6 Iris-setosa          5           0.2          1.4
##  6         3.9 Iris-setosa          5.4         0.4          1.7
##  7         3.4 Iris-setosa          4.6         0.3          1.4
##  8         3.4 Iris-setosa          5           0.2          1.5
##  9         2.9 Iris-setosa          4.4         0.2          1.4
## 10         3.1 Iris-setosa          4.9         0.1          1.5
## # … with 140 more rows

with indexes:

select(iris_df, 2, 5, 1, 4, 3)
## # A tibble: 150 x 5
##    sepal_width class       sepal_length petal_width petal_length
##                                        
##  1         3.5 Iris-setosa          5.1         0.2          1.4
##  2         3   Iris-setosa          4.9         0.2          1.4
##  3         3.2 Iris-setosa          4.7         0.2          1.3
##  4         3.1 Iris-setosa          4.6         0.2          1.5
##  5         3.6 Iris-setosa          5           0.2          1.4
##  6         3.9 Iris-setosa          5.4         0.4          1.7
##  7         3.4 Iris-setosa          4.6         0.3          1.4
##  8         3.4 Iris-setosa          5           0.2          1.5
##  9         2.9 Iris-setosa          4.4         0.2          1.4
## 10         3.1 Iris-setosa          4.9         0.1          1.5
## # … with 140 more rows

Piping or chaining

The following examples will start getting complicated. Rather than operating some operations and saving the manipulated data frame into a new data frame, we can use the %>% pipe operator from the magrittr package. This operator carries forward the last manipulated data frame (or other objects) for the next operations. An example would help explain this.

Say we want to include only the columns starting with sepal and then we want to change their names to upper case using the toupper function.

Here’s how to do so by saving or creating a new data frame and then operating on the new data frame:

selected_iris_df <- select(iris_df, starts_with("sepal"))
selected_iris_df
## # A tibble: 150 x 2
##    sepal_length sepal_width
##                  
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows
rename_all(selected_iris_df, toupper)
## # A tibble: 150 x 2
##    SEPAL_LENGTH SEPAL_WIDTH
##                  
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows

Here’s how to do the same thing with the pipe operator (%>%):

select(iris_df, starts_with("sepal")) %>% 
  rename_all(toupper)
## # A tibble: 150 x 2
##    SEPAL_LENGTH SEPAL_WIDTH
##                  
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows

Note that the rename_all function here doesn’t need to know the data frame name, it automatically operates on the data fed by the pipe operator.

How to sort or order a data frame

Sorting or ordering a data frame is useful when you hope to run some operations based on the order of the data. Caution: most R practitioners would advise you against creating row-by-row operations because of the observed inefficiencies. Often there are better ways of meeting those needs. But again, when you are printing a table or aggregating some data, you may need to sort the data. Luckily, dplyr has arrange function.

Say we want to the top five rows from the Iris data frame defined by the smallest sepal length. We need two steps: 1) sort the data ascending sepal length, and 2) select the first five rows.

arrange(iris_df, sepal_length) %>% 
  head(n = 5)
## # A tibble: 5 x 5
##   sepal_length sepal_width petal_length petal_width class      
##                                       
## 1          4.3         3            1.1         0.1 Iris-setosa
## 2          4.4         2.9          1.4         0.2 Iris-setosa
## 3          4.4         3            1.3         0.2 Iris-setosa
## 4          4.4         3.2          1.3         0.2 Iris-setosa
## 5          4.5         2.3          1.3         0.3 Iris-setosa

If we want to sort sepal length in a descending order, we can use the desc function, like this:

arrange(iris_df, desc(sepal_length)) %>% 
  head(n = 5)
## # A tibble: 5 x 5
##   sepal_length sepal_width petal_length petal_width class         
##                                          
## 1          7.9         3.8          6.4         2   Iris-virginica
## 2          7.7         3.8          6.7         2.2 Iris-virginica
## 3          7.7         2.6          6.9         2.3 Iris-virginica
## 4          7.7         2.8          6.7         2   Iris-virginica
## 5          7.7         3            6.1         2.3 Iris-virginica

Select the top or bottom ten (or n) rows

While we can use the arrange function to select the top rows, dplyr provides top_n function for this specific purpose.

Here’s how to get the bottom five rows by smallest sepal length using the desc function:

top_n(iris_df, n = 5, wt = desc(sepal_length))
## # A tibble: 5 x 5
##   sepal_length sepal_width petal_length petal_width class      
##                                       
## 1          4.4         2.9          1.4         0.2 Iris-setosa
## 2          4.3         3            1.1         0.1 Iris-setosa
## 3          4.4         3            1.3         0.2 Iris-setosa
## 4          4.5         2.3          1.3         0.3 Iris-setosa
## 5          4.4         3.2          1.3         0.2 Iris-setosa

You can also use a negative sign to denote the bottom rows:

top_n(iris_df, n = -5, wt = sepal_length)
## # A tibble: 5 x 5
##   sepal_length sepal_width petal_length petal_width class      
##                                       
## 1          4.4         2.9          1.4         0.2 Iris-setosa
## 2          4.3         3            1.1         0.1 Iris-setosa
## 3          4.4         3            1.3         0.2 Iris-setosa
## 4          4.5         2.3          1.3         0.3 Iris-setosa
## 5          4.4         3.2          1.3         0.2 Iris-setosa

Here’s how to get the top five rows by largest sepal length:

top_n(iris_df, n = 5, wt = sepal_length)
## # A tibble: 5 x 5
##   sepal_length sepal_width petal_length petal_width class         
##                                          
## 1          7.7         3.8          6.7         2.2 Iris-virginica
## 2          7.7         2.6          6.9         2.3 Iris-virginica
## 3          7.7         2.8          6.7         2   Iris-virginica
## 4          7.9         3.8          6.4         2   Iris-virginica
## 5          7.7         3            6.1         2.3 Iris-virginica

Note that the arrange function assumes you want to sort the data in ascending order, whereas the top_n function assumes you want to get the data in descending order.

How to create groups or discretize data

Say we want to convert a numeric range to groups such as 0-5, 6-10, and so on. We can use the cut function to create such groups and labels.

For example:

simple_df <- data.frame(x = 1:20)
simple_df$x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
cut(simple_df$x, breaks = c(0, 5, 10, 15, Inf), 
    labels = c("0-5", "6-10", "11-15", "15+"), 
    include.lowest = TRUE)
##  [1] 0-5   0-5   0-5   0-5   0-5   6-10  6-10  6-10  6-10  6-10  11-15 11-15
## [13] 11-15 11-15 11-15 15+   15+   15+   15+   15+  
## Levels: 0-5 6-10 11-15 15+

But to add this new variable called y to the existing data frame we can use the following syntax:

simple_df$y <- cut(simple_df$x, 
                   breaks = c(0, 5, 10, 15, Inf), 
                   labels = c("0-5", "6-10", "11-15", "15+"), 
                   include.lowest = TRUE)

simple_df
##     x     y
## 1   1   0-5
## 2   2   0-5
## 3   3   0-5
## 4   4   0-5
## 5   5   0-5
## 6   6  6-10
## 7   7  6-10
## 8   8  6-10
## 9   9  6-10
## 10 10  6-10
## 11 11 11-15
## 12 12 11-15
## 13 13 11-15
## 14 14 11-15
## 15 15 11-15
## 16 16   15+
## 17 17   15+
## 18 18   15+
## 19 19   15+
## 20 20   15+

Or we can use the mutate function from the dplyr package to create a new variable, like this:

simple_df <- data.frame(x = 1:20)

simple_df <- mutate(simple_df,
                    y = cut(x, 
                            breaks = c(0, 5, 10, 15, Inf), 
                            labels = c("0-5", "6-10", "11-15", "15+"), 
                            include.lowest = TRUE))
                    
simple_df                   
##     x     y
## 1   1   0-5
## 2   2   0-5
## 3   3   0-5
## 4   4   0-5
## 5   5   0-5
## 6   6  6-10
## 7   7  6-10
## 8   8  6-10
## 9   9  6-10
## 10 10  6-10
## 11 11 11-15
## 12 12 11-15
## 13 13 11-15
## 14 14 11-15
## 15 15 11-15
## 16 16   15+
## 17 17   15+
## 18 18   15+
## 19 19   15+
## 20 20   15+

We can create custom functions to create these breaks and labels for us, so we don’t have to type in all the labels. Caution: these are very fragile functions only based on this use case. You will get errors or worse wrong results if you test it for some other data.

custom_breaks <- function(x, space_by = 5){
  c(seq(from = 0, to = max(x) - space_by, by = space_by), Inf)
}

custom_labels <- function(x, space_by = 5){
 brks <- custom_breaks(x, space_by = space_by)
 lbls <- paste0(brks[1:(length(brks)-1)], "-", brks[2:length(brks)])
 lbls[length(lbls)] <- paste0(brks[length(brks)-1], "+")
 lbls
}
mutate(simple_df,
  y = cut(x,
    breaks = custom_breaks(x),
    labels = custom_labels(x),
    include.lowest = TRUE
  )
)
##     x     y
## 1   1   0-5
## 2   2   0-5
## 3   3   0-5
## 4   4   0-5
## 5   5   0-5
## 6   6  5-10
## 7   7  5-10
## 8   8  5-10
## 9   9  5-10
## 10 10  5-10
## 11 11 10-15
## 12 12 10-15
## 13 13 10-15
## 14 14 10-15
## 15 15 10-15
## 16 16   15+
## 17 17   15+
## 18 18   15+
## 19 19   15+
## 20 20   15+

How to create quintile groups

dplyr provides some windowed rank functions to help us quickly create new columns based on percentile or percentile groups.

For example, let’s say we want to create four percentile groups for sepal width.

mutate(iris_df, s_w_rank = ntile(sepal_width, 4))
## # A tibble: 150 x 6
##    sepal_length sepal_width petal_length petal_width class       s_w_rank
##                                            
##  1          5.1         3.5          1.4         0.2 Iris-setosa        4
##  2          4.9         3            1.4         0.2 Iris-setosa        2
##  3          4.7         3.2          1.3         0.2 Iris-setosa        3
##  4          4.6         3.1          1.5         0.2 Iris-setosa        3
##  5          5           3.6          1.4         0.2 Iris-setosa        4
##  6          5.4         3.9          1.7         0.4 Iris-setosa        4
##  7          4.6         3.4          1.4         0.3 Iris-setosa        4
##  8          5           3.4          1.5         0.2 Iris-setosa        4
##  9          4.4         2.9          1.4         0.2 Iris-setosa        2
## 10          4.9         3.1          1.5         0.1 Iris-setosa        3
## # … with 140 more rows

How to extract or split string columns

Sometimes data from different columns gets clubbed in one column. The separate function from the tidyr package comes is helpful in such cases.

In this example below, we will split or separate a character column into two columns using the hyphen as a separator or delimiter.

library(tidyr)

string_df <- data.frame(imp_info = paste0(LETTERS[1:5], "-", 1:5))
head(string_df)
##   imp_info
## 1      A-1
## 2      B-2
## 3      C-3
## 4      D-4
## 5      E-5
separate(data = string_df,
         col = imp_info,
         sep = "-",
         into = c("letter", "number"),
         remove = FALSE)
##   imp_info letter number
## 1      A-1      A      1
## 2      B-2      B      2
## 3      C-3      C      3
## 4      D-4      D      4
## 5      E-5      E      5

This function is very powerful as it can take regular expressions as delimiters, which opens up many possibilities. Also, there are many options to handle missing values and filling or selecting columns.

For example, let’s extract the area code of a US-based phone number by setting the other columns as NA.

# generate random phone number digits
us_phone_numbers <- data.frame(phone_number = paste(round(runif(5, min = 111, max = 999)),
                       round(runif(5, min = 1111, max = 9999)),
                       round(runif(5, min = 1111, max = 9999)),
                       sep = "-"))

head(us_phone_numbers)
##    phone_number
## 1 736-1736-8194
## 2 147-9024-4773
## 3 414-4239-9121
## 4 417-3928-7203
## 5 294-4495-8930
separate(us_phone_numbers, col = phone_number, into = c("area_code", NA, NA), sep = "-")
##   area_code
## 1       736
## 2       147
## 3       414
## 4       417
## 5       294

How to identify the presence of certain text

Often we care about the presence of certain text and creating a logical indicator or flag is useful in those cases. We have many options to achieve this. Let’s look at a few methods.

We want to check whether CA appears in the city, state name column in the following data frame. We will use str_detect from the stringr package.

library(stringr)
some_cities <- head(select(maps::us.cities, name), n =  10)
some_cities
##              name
## 1      Abilene TX
## 2        Akron OH
## 3      Alameda CA
## 4       Albany GA
## 5       Albany NY
## 6       Albany OR
## 7  Albuquerque NM
## 8   Alexandria LA
## 9   Alexandria VA
## 10    Alhambra CA
mutate(some_cities, is_CA = str_detect(name, "CA"))
##              name is_CA
## 1      Abilene TX FALSE
## 2        Akron OH FALSE
## 3      Alameda CA  TRUE
## 4       Albany GA FALSE
## 5       Albany NY FALSE
## 6       Albany OR FALSE
## 7  Albuquerque NM FALSE
## 8   Alexandria LA FALSE
## 9   Alexandria VA FALSE
## 10    Alhambra CA  TRUE

If you wanted to check for multiple states, you can use regular expression patterns. For example, we want to check for TX and CA.

mutate(some_cities, is_CA_TX = str_detect(name, "CA|TX"))
##              name is_CA_TX
## 1      Abilene TX     TRUE
## 2        Akron OH    FALSE
## 3      Alameda CA     TRUE
## 4       Albany GA    FALSE
## 5       Albany NY    FALSE
## 6       Albany OR    FALSE
## 7  Albuquerque NM    FALSE
## 8   Alexandria LA    FALSE
## 9   Alexandria VA    FALSE
## 10    Alhambra CA     TRUE

How to create or make a new data frame

You rarely need to create a blank data frame. New programmers in R who are used to other languages are used to creating empty objects which are filled in loops. Most likely, that will not be the case in R. But creating a new data frame based on the existing data frame is easy: you simply assign the output of a statement to a new data frame. You saw some examples already, but let’s say we want to store the cities data frame in a new data frame, here’s one way to do so:

my_super_awesome_new_df <- some_cities

That’s it! It’s that simple.

There’s another operator from the magrittr package that’s useful when dealing with pipes or chains. That operator is: %<>% i.e. it pushes the data forward but also assigns it back. We will select a few columns from the iris_df_names_changed data frame and overwrite it.

library(magrittr)
glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 5
## $ sepal.length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ petal.width   0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ class         "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setos…
iris_df_names_changed %<>% select(1:3)
glimpse(iris_df_names_changed)
## Rows: 150
## Columns: 3
## $ sepal.length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ sepal.width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ petal.length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…

How to convert a vector into a data frame

Perhaps you noticed above, but we saw many examples of vectors stored in a data frame. Let’s say we want to create a new data frame from the LETTERS vector.

head(LETTERS)
## [1] "A" "B" "C" "D" "E" "F"
new_letters_df <- data.frame(upper_case_leter = head(LETTERS))
str(new_letters_df)
## 'data.frame':    6 obs. of  1 variable:
##  $ upper_case_leter: chr  "A" "B" "C" "D" ...

If we want to add two different vectors of the same length, we can simply add another column to the new data frame.

head(letters)
## [1] "a" "b" "c" "d" "e" "f"
new_letters_df <- data.frame(upper_case_leter = head(LETTERS),
                             lower_case_letter = head(letters))
str(new_letters_df)
## 'data.frame':    6 obs. of  2 variables:
##  $ upper_case_leter : chr  "A" "B" "C" "D" ...
##  $ lower_case_letter: chr  "a" "b" "c" "d" ...

If you need to stack different vectors in one column, you can simply use c to combine the vectors.

new_letters_df <- data.frame(mixed_case_leter = c(head(LETTERS), head(letters)))
glimpse(new_letters_df)
## Rows: 12
## Columns: 1
## $ mixed_case_leter  "A", "B", "C", "D", "E", "F", "a", "b", "c", "d", "e…

How to create subsets of a data frame

We can create subsets or filter data frames using the filter function from the dplyr package. For example, say we want to create a new data frame for California cities from the us.cities data frame.

Here’s one way to do so:

library(maps)
glimpse(us.cities)
## Rows: 1,005
## Columns: 6
## $ name         "Abilene TX", "Akron OH", "Alameda CA", "Albany GA", "Alb…
## $ country.etc  "TX", "OH", "CA", "GA", "NY", "OR", "NM", "LA", "VA", "CA…
## $ pop          113888, 206634, 70069, 75510, 93576, 45535, 494962, 44933…
## $ lat          32.45, 41.08, 37.77, 31.58, 42.67, 44.62, 35.12, 31.29, 3…
## $ long         -99.74, -81.52, -122.26, -84.18, -73.80, -123.09, -106.62…
## $ capital      0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
filter(us.cities, country.etc == 'CA') %>% 
  glimpse()
## Rows: 208
## Columns: 6
## $ name         "Alameda CA", "Alhambra CA", "Aliso Viejo CA", "Altadena …
## $ country.etc  "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA…
## $ pop          70069, 88857, 41975, 43280, 334909, 109485, 66361, 57429,…
## $ lat          37.77, 34.08, 33.57, 34.19, 33.84, 37.99, 34.53, 34.13, 3…
## $ long         -122.26, -118.13, -117.73, -118.13, -117.87, -121.80, -11…
## $ capital      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Note we need to equal signs to compare value. If you use a single equal to sign, which is reserved for assignment, you will likely get an error.

Let’s say we want to subset or filter the data frame for CA and TX. You can use the %in% operator to do so:

glimpse(us.cities)
## Rows: 1,005
## Columns: 6
## $ name         "Abilene TX", "Akron OH", "Alameda CA", "Albany GA", "Alb…
## $ country.etc  "TX", "OH", "CA", "GA", "NY", "OR", "NM", "LA", "VA", "CA…
## $ pop          113888, 206634, 70069, 75510, 93576, 45535, 494962, 44933…
## $ lat          32.45, 41.08, 37.77, 31.58, 42.67, 44.62, 35.12, 31.29, 3…
## $ long         -99.74, -81.52, -122.26, -84.18, -73.80, -123.09, -106.62…
## $ capital      0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
filter(us.cities, country.etc %in% c('CA', 'TX')) %>% 
  glimpse()
## Rows: 279
## Columns: 6
## $ name         "Abilene TX", "Alameda CA", "Alhambra CA", "Aliso Viejo C…
## $ country.etc  "TX", "CA", "CA", "CA", "TX", "CA", "TX", "CA", "CA", "CA…
## $ pop          113888, 70069, 88857, 41975, 84467, 43280, 183295, 334909…
## $ lat          32.45, 37.77, 34.08, 33.57, 33.11, 34.19, 35.20, 33.84, 3…
## $ long         -99.74, -122.26, -118.13, -117.73, -96.67, -118.13, -101.…
## $ capital      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, …

You can use filters on numeric columns also. For example, we want to select rows from the iris data frame where the sepal length is greater than the median.

summary(iris_df)
##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##     class          
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
nrow(iris_df) # check the number of rows
## [1] 150
filtered_iris_df <- filter(iris_df, sepal_length > median(sepal_length))
summary(filtered_iris_df)
##   sepal_length   sepal_width     petal_length    petal_width   
##  Min.   :5.90   Min.   :2.200   Min.   :4.000   Min.   :1.000  
##  1st Qu.:6.20   1st Qu.:2.800   1st Qu.:4.700   1st Qu.:1.500  
##  Median :6.45   Median :3.000   Median :5.100   Median :1.800  
##  Mean   :6.58   Mean   :2.970   Mean   :5.239   Mean   :1.811  
##  3rd Qu.:6.80   3rd Qu.:3.175   3rd Qu.:5.700   3rd Qu.:2.100  
##  Max.   :7.90   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##     class          
##  Length:70         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
nrow(filtered_iris_df)
## [1] 70

How to add, subtract or multiply columns of a data frame

The mutate family functions are very useful when performing simple or complex operations on the columns of a data frame. Let’s say we want to multiply all the numeric columns of the iris data frame by 10. Here’s how we can do so:

iris_df %>% mutate_if(is.numeric, ~.*10)
## # A tibble: 150 x 5
##    sepal_length sepal_width petal_length petal_width class      
##                                        
##  1           51          35           14           2 Iris-setosa
##  2           49          30           14           2 Iris-setosa
##  3           47          32           13           2 Iris-setosa
##  4           46          31           15           2 Iris-setosa
##  5           50          36           14           2 Iris-setosa
##  6           54          39           17           4 Iris-setosa
##  7           46          34           14           3 Iris-setosa
##  8           50          34           15           2 Iris-setosa
##  9           44          29           14           2 Iris-setosa
## 10           49          31           15           1 Iris-setosa
## # … with 140 more rows

The tilde (~) sign here signifies all columns selected by the mutate function.

If we want to create new columns instead of overwriting, here’s one way:

iris_df %>% mutate_if(is.numeric, list(multiplied = function(x) x*5)) %>% glimpse()
## Rows: 150
## Columns: 9
## $ sepal_length             5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ sepal_width              3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ petal_length             1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ petal_width              0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ class                    "Iris-setosa", "Iris-setosa", "Iris-setosa", …
## $ sepal_length_multiplied  25.5, 24.5, 23.5, 23.0, 25.0, 27.0, 23.0, 25.…
## $ sepal_width_multiplied   17.5, 15.0, 16.0, 15.5, 18.0, 19.5, 17.0, 17.…
## $ petal_length_multiplied  7.0, 7.0, 6.5, 7.5, 7.0, 8.5, 7.0, 7.5, 7.0, …
## $ petal_width_multiplied   1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.5, 1.0, 1.0, …

We can do multiple operations at once. For example, we will also add and subtract 20.

iris_df %>%
  mutate_if(
    is.numeric,
    list(
      multiplied = function(x) x * 5,
      added = function(x) x + 20,
      subtracted = function(x) x - 20
    )
  ) %>%
  glimpse()
## Rows: 150
## Columns: 17
## $ sepal_length             5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ sepal_width              3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ petal_length             1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ petal_width              0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ class                    "Iris-setosa", "Iris-setosa", "Iris-setosa", …
## $ sepal_length_multiplied  25.5, 24.5, 23.5, 23.0, 25.0, 27.0, 23.0, 25.…
## $ sepal_width_multiplied   17.5, 15.0, 16.0, 15.5, 18.0, 19.5, 17.0, 17.…
## $ petal_length_multiplied  7.0, 7.0, 6.5, 7.5, 7.0, 8.5, 7.0, 7.5, 7.0, …
## $ petal_width_multiplied   1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.5, 1.0, 1.0, …
## $ sepal_length_added       25.1, 24.9, 24.7, 24.6, 25.0, 25.4, 24.6, 25.…
## $ sepal_width_added        23.5, 23.0, 23.2, 23.1, 23.6, 23.9, 23.4, 23.…
## $ petal_length_added       21.4, 21.4, 21.3, 21.5, 21.4, 21.7, 21.4, 21.…
## $ petal_width_added        20.2, 20.2, 20.2, 20.2, 20.2, 20.4, 20.3, 20.…
## $ sepal_length_subtracted  -14.9, -15.1, -15.3, -15.4, -15.0, -14.6, -15…
## $ sepal_width_subtracted   -16.5, -17.0, -16.8, -16.9, -16.4, -16.1, -16…
## $ petal_length_subtracted  -18.6, -18.6, -18.7, -18.5, -18.6, -18.3, -18…
## $ petal_width_subtracted   -19.8, -19.8, -19.8, -19.8, -19.8, -19.6, -19…

Instead of applying an operation to all numeric columns, we can select columns such as the ones ending with length.

iris_df %>%
  mutate_at(vars(ends_with("length")),
            list(multiplied = function(x) x * 5)) %>%
  glimpse()
## Rows: 150
## Columns: 7
## $ sepal_length             5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ sepal_width              3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ petal_length             1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ petal_width              0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ class                    "Iris-setosa", "Iris-setosa", "Iris-setosa", …
## $ sepal_length_multiplied  25.5, 24.5, 23.5, 23.0, 25.0, 27.0, 23.0, 25.…
## $ petal_length_multiplied  7.0, 7.0, 6.5, 7.5, 7.0, 8.5, 7.0, 7.5, 7.0, …

One benefit of mutate is that you can use the newly created variable immediately.

For example:

squares_df <- data.frame(x = 1:10)

mutate(squares_df, 
       x_squared = x^2,
       sqrt_of_x_squre = x_squared^(1/2), # note we're using x_squared
       is_x_squared_even = x_squared %% 2 == 0) 
##     x x_squared sqrt_of_x_squre is_x_squared_even
## 1   1         1               1             FALSE
## 2   2         4               2              TRUE
## 3   3         9               3             FALSE
## 4   4        16               4              TRUE
## 5   5        25               5             FALSE
## 6   6        36               6              TRUE
## 7   7        49               7             FALSE
## 8   8        64               8              TRUE
## 9   9        81               9             FALSE
## 10 10       100              10              TRUE

How to use if, then, and else for data frame calculations

So far we have seen simple use cases of data manipulation, but in real-world problems, we get challenging problems. Often we need to create a new variable based on a series of conditions.

Let’s look at an if and else statement. We will create a new column with a value of yes or no when the sepal width exceeds 3 and the petal width is less than 0.3.

mutate(iris_df,
       does_it_meet_cond = ifelse(sepal_width > 3 & petal_width < 0.3,
                                  yes = "yes",
                                  no = "no")) %>% 
  filter(does_it_meet_cond == 'yes')
## # A tibble: 28 x 6
##    sepal_length sepal_width petal_length petal_width class      does_it_meet_co…
##                                                   
##  1          5.1         3.5          1.4         0.2 Iris-seto… yes             
##  2          4.7         3.2          1.3         0.2 Iris-seto… yes             
##  3          4.6         3.1          1.5         0.2 Iris-seto… yes             
##  4          5           3.6          1.4         0.2 Iris-seto… yes             
##  5          5           3.4          1.5         0.2 Iris-seto… yes             
##  6          4.9         3.1          1.5         0.1 Iris-seto… yes             
##  7          5.4         3.7          1.5         0.2 Iris-seto… yes             
##  8          4.8         3.4          1.6         0.2 Iris-seto… yes             
##  9          5.8         4            1.2         0.2 Iris-seto… yes             
## 10          5.4         3.4          1.7         0.2 Iris-seto… yes             
## # … with 18 more rows

We can also use case_when from the dplyr package to test multiple conditions. In this simple example, we will recreate the breaks we created earlier using the cut function. We will also use the between function to save us some typing.

glimpse(simple_df) 
## Rows: 20
## Columns: 2
## $ x  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
## $ y  0-5, 0-5, 0-5, 0-5, 0-5, 6-10, 6-10, 6-10, 6-10, 6-10, 11-15, 11-15…
mutate(simple_df, brks = case_when(
  between(x, 0, 5) ~ '0-5',
  between(x, 6, 10) ~ '6-10',
  between(x, 11, 15) ~ '11-15',
  x > 15 ~ '15+'
))
##     x     y  brks
## 1   1   0-5   0-5
## 2   2   0-5   0-5
## 3   3   0-5   0-5
## 4   4   0-5   0-5
## 5   5   0-5   0-5
## 6   6  6-10  6-10
## 7   7  6-10  6-10
## 8   8  6-10  6-10
## 9   9  6-10  6-10
## 10 10  6-10  6-10
## 11 11 11-15 11-15
## 12 12 11-15 11-15
## 13 13 11-15 11-15
## 14 14 11-15 11-15
## 15 15 11-15 11-15
## 16 16   15+   15+
## 17 17   15+   15+
## 18 18   15+   15+
## 19 19   15+   15+
## 20 20   15+   15+

We can use complicated conditions to create our new variable. We will use the mtcars data set to create some made-up conditions for car segments.

glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg   21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
## $ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
## $ disp  160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
## $ hp    110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
## $ drat  3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
## $ wt    2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
## $ qsec  16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
## $ vs    0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
## $ am    1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
## $ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
## $ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…
mutate(mtcars, car_segment = case_when(
  mpg > 20 & cyl > 5 & hp > 100 ~ "fancy car",
  mpg > 15 & cyl > 7 & hp > 150 ~ "power car",
  mpg > 14 & disp > 200 & hp > 200 ~ "strange car",
  TRUE ~ "others"
)) %>% head(10)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb car_segment
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   fancy car
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   fancy car
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      others
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   fancy car
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   power car
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      others
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 strange car
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      others
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2      others
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4      others

How to join, union, merge or append data frames using bind functions

Often we need to combine different data frames either by joining them or stacking them. Let’s look at some common use cases.

Join two data frames

Let’s say we have a product information data frame and a customer information data frame. We need to join these two data frames. dplyr provides convenient join functions to do so.

First, let’s create the data sets.

products_df <- data.frame(product = LETTERS[1:5],
                          price = round(runif(5)*10, 2))

glimpse(products_df)
## Rows: 5
## Columns: 2
## $ product  "A", "B", "C", "D", "E"
## $ price    5.96, 6.25, 1.65, 0.28, 0.02
customers_df <- data.frame(cust_name = randomNames::randomNames(10),
                           state = sample(state.abb, 10, replace = TRUE),
                           products_purchased = sample(LETTERS[1:3], 10, replace = TRUE))
glimpse(customers_df)
## Rows: 10
## Columns: 3
## $ cust_name           "Ward, Sarah", "el-Din, Munisa", "Tran, Zachary", …
## $ state               "DE", "ME", "ND", "OH", "IA", "AK", "MO", "IN", "U…
## $ products_purchased  "A", "C", "C", "C", "A", "A", "B", "B", "B", "B"

If we want to find out product information of the products purchased by customers, we will need to use an inner join.

Like this:

# Don't run. 
# This will give you an error since we haven't specified columns to join with
inner_join(products_df, customers_df) 

By default, the join functions will join data frames using the common variables found in both the data frames. But in our example, we have product column in one and products_purchased in another. We will need to use the by argument of the join function to specify the column names. This argument takes a character vector as an input.

Like this:

inner_join(products_df, 
           customers_df, 
           by = c("product" = "products_purchased")) 
##    product price             cust_name state
## 1        A  5.96           Ward, Sarah    DE
## 2        A  5.96          Dedios, Deja    IA
## 3        A  5.96      Lawrence, Lauren    AK
## 4        B  6.25       Brock, Tre'Sean    MO
## 5        B  6.25   Sullivan, Dominiqua    IN
## 6        B  6.25           Lee, Deanna    UT
## 7        B  6.25      Barber, Margaret    VT
## 8        C  1.65        el-Din, Munisa    ME
## 9        C  1.65         Tran, Zachary    ND
## 10       C  1.65 Wilfahrt, Katie-Lynne    OH

If you need to join data frames with multiple columns, you can specify those columns in the by argument like this: by = c("df.1.col1" = "df.2.col1", "df.1.col2" = "df.2.col2").

If we need to see all the products even if there are no customers for them, we will need to use a left join with the product data frame on the left side of the join.

left_join(products_df, 
           customers_df, 
           by = c("product" = "products_purchased")) 
##    product price             cust_name state
## 1        A  5.96           Ward, Sarah    DE
## 2        A  5.96          Dedios, Deja    IA
## 3        A  5.96      Lawrence, Lauren    AK
## 4        B  6.25       Brock, Tre'Sean    MO
## 5        B  6.25   Sullivan, Dominiqua    IN
## 6        B  6.25           Lee, Deanna    UT
## 7        B  6.25      Barber, Margaret    VT
## 8        C  1.65        el-Din, Munisa    ME
## 9        C  1.65         Tran, Zachary    ND
## 10       C  1.65 Wilfahrt, Katie-Lynne    OH
## 11       D  0.28                    
## 12       E  0.02                    

Similarly, we can right_join to show all the rows from the data frame on the right and only matching rows from the data frame on the left.

Create a union of data frames

If you need to simple stack rows from different or multiple data frames, you can use the bind_rows function from the dplyr package. This function offers a lot of flexibility to handle the mismatches in column names and missing values.

Let’s create some data frames first.

first_df <- data.frame(segment = LETTERS[1:5],
                       price = runif(5)*10)

second_df <- data.frame(segment = LETTERS[6:8],
                       price = runif(3)*10,
                       rating = 3:1)

We know that both of these data frames have segment and price in common, but the second data frame has an additional column called rating.

Let’s stack these two or create an union:

bind_rows(first_df, second_df)
##   segment     price rating
## 1       A 8.5454397     NA
## 2       B 5.9386813     NA
## 3       C 8.7418077     NA
## 4       D 2.2458716     NA
## 5       E 2.0619303     NA
## 6       F 5.7534880      3
## 7       G 0.3834018      2
## 8       H 4.3073954      1

You’ll see that the rating column has missing values for the rows from the first data frame. You can use the .id argument to create another column to show which data frame that row came from.

bind_rows(first_df, second_df, .id = "df_id")
##   df_id segment     price rating
## 1     1       A 8.5454397     NA
## 2     1       B 5.9386813     NA
## 3     1       C 8.7418077     NA
## 4     1       D 2.2458716     NA
## 5     1       E 2.0619303     NA
## 6     2       F 5.7534880      3
## 7     2       G 0.3834018      2
## 8     2       H 4.3073954      1

Append columns

Often the best way to handle this use case i.e. combining columns of different data frames is best handled with joins. But if you know that either the rows from different data frames belong to the same record and no sorting has taken place, or the results are inconsequential, you can use the bind_cols function. These data frames must contain the same number of rows.

For example, you will get an error if you run this:

# Don't run
bind_cols(first_df, second_df)

But you will be successful when you run this.

bind_cols(first_df[1:3, ], second_df)
##   segment    price segment1    price1 rating
## 1       A 8.545440        F 5.7534880      3
## 2       B 5.938681        G 0.3834018      2
## 3       C 8.741808        H 4.3073954      1

How to create a random sample

Sometimes you may want to work with a smaller sample of a data frame. If you want to select just a few rows, not random, you can use head or tail functions. By default, they select six rows; you can increase the number with the n argument.

# top six rows
head(iris_df)
## # A tibble: 6 x 5
##   sepal_length sepal_width petal_length petal_width class      
##                                       
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3            1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5           3.6          1.4         0.2 Iris-setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa
# bottom six rows
tail(iris_df)
## # A tibble: 6 x 5
##   sepal_length sepal_width petal_length petal_width class         
##                                          
## 1          6.7         3.3          5.7         2.5 Iris-virginica
## 2          6.7         3            5.2         2.3 Iris-virginica
## 3          6.3         2.5          5           1.9 Iris-virginica
## 4          6.5         3            5.2         2   Iris-virginica
## 5          6.2         3.4          5.4         2.3 Iris-virginica
## 6          5.9         3            5.1         1.8 Iris-virginica
# top ten rows
head(iris_df, n = 10)
## # A tibble: 10 x 5
##    sepal_length sepal_width petal_length petal_width class      
##                                        
##  1          5.1         3.5          1.4         0.2 Iris-setosa
##  2          4.9         3            1.4         0.2 Iris-setosa
##  3          4.7         3.2          1.3         0.2 Iris-setosa
##  4          4.6         3.1          1.5         0.2 Iris-setosa
##  5          5           3.6          1.4         0.2 Iris-setosa
##  6          5.4         3.9          1.7         0.4 Iris-setosa
##  7          4.6         3.4          1.4         0.3 Iris-setosa
##  8          5           3.4          1.5         0.2 Iris-setosa
##  9          4.4         2.9          1.4         0.2 Iris-setosa
## 10          4.9         3.1          1.5         0.1 Iris-setosa

To get a sample, however, you can use sample_n or sample_frac functions from dplyr. With sample_n you can select a fixed number of randomly selected rows and with sample_frac you can select a percent of records, say 10% or 15%.

# ten random rows
sample_n(iris_df,  size = 10)
## # A tibble: 10 x 5
##    sepal_length sepal_width petal_length petal_width class          
##                                            
##  1          5.1         3.8          1.9         0.4 Iris-setosa    
##  2          6.3         2.5          4.9         1.5 Iris-versicolor
##  3          4.9         3.1          1.5         0.1 Iris-setosa    
##  4          5           2            3.5         1   Iris-versicolor
##  5          4.5         2.3          1.3         0.3 Iris-setosa    
##  6          5.9         3.2          4.8         1.8 Iris-versicolor
##  7          6.6         3            4.4         1.4 Iris-versicolor
##  8          4.8         3.1          1.6         0.2 Iris-setosa    
##  9          6.4         2.7          5.3         1.9 Iris-virginica 
## 10          5.8         2.8          5.1         2.4 Iris-virginica
# 10% randomly selected rows 
# there are 150 rows in the Iris data frame
sample_frac(iris_df,  size = 0.1)
## # A tibble: 15 x 5
##    sepal_length sepal_width petal_length petal_width class          
##                                            
##  1          5.8         2.7          3.9         1.2 Iris-versicolor
##  2          4.4         3.2          1.3         0.2 Iris-setosa    
##  3          5.6         2.5          3.9         1.1 Iris-versicolor
##  4          6.4         2.7          5.3         1.9 Iris-virginica 
##  5          6.5         3            5.8         2.2 Iris-virginica 
##  6          5.4         3.9          1.3         0.4 Iris-setosa    
##  7          5.4         3.4          1.5         0.4 Iris-setosa    
##  8          6.2         2.2          4.5         1.5 Iris-versicolor
##  9          6.3         2.9          5.6         1.8 Iris-virginica 
## 10          5.6         3            4.5         1.5 Iris-versicolor
## 11          5.8         2.6          4           1.2 Iris-versicolor
## 12          5.1         3.3          1.7         0.5 Iris-setosa    
## 13          4.6         3.1          1.5         0.2 Iris-setosa    
## 14          6.4         2.8          5.6         2.2 Iris-virginica 
## 15          6.4         3.2          4.5         1.5 Iris-versicolor

These functions have additional useful arguments such as replace to re-sample or show duplicate rows and weight to prefer some rows over others.

How to efficiently loop through each row

As I mentioned earlier, if you are thinking of looping through each row of a data frame, you should reconsider your problem. Often there is a vectorized and efficient approach. But when you must there are a few ways of doing this.

Use a loop like this:

small_iris_df <- head(iris_df, 10)

for (i in seq_len(nrow(small_iris_df))){
  cat(paste(i, "row of sepal_length has the value of", small_iris_df[i, 1][[1]]))
  cat("\n")
}
## 1 row of sepal_length has the value of 5.1
## 2 row of sepal_length has the value of 4.9
## 3 row of sepal_length has the value of 4.7
## 4 row of sepal_length has the value of 4.6
## 5 row of sepal_length has the value of 5
## 6 row of sepal_length has the value of 5.4
## 7 row of sepal_length has the value of 4.6
## 8 row of sepal_length has the value of 5
## 9 row of sepal_length has the value of 4.4
## 10 row of sepal_length has the value of 4.9

You can also use the rowwise function from dplyr.

small_iris_df <- head(iris_df, 10)

rowwise(small_iris_df) %>% 
  do(something = paste("sepal_length has the value of", .$sepal_length)) %>% 
  .$something
## [[1]]
## [1] "sepal_length has the value of 5.1"
## 
## [[2]]
## [1] "sepal_length has the value of 4.9"
## 
## [[3]]
## [1] "sepal_length has the value of 4.7"
## 
## [[4]]
## [1] "sepal_length has the value of 4.6"
## 
## [[5]]
## [1] "sepal_length has the value of 5"
## 
## [[6]]
## [1] "sepal_length has the value of 5.4"
## 
## [[7]]
## [1] "sepal_length has the value of 4.6"
## 
## [[8]]
## [1] "sepal_length has the value of 5"
## 
## [[9]]
## [1] "sepal_length has the value of 4.4"
## 
## [[10]]
## [1] "sepal_length has the value of 4.9"

Again, there are better ways of handling going through each row of a data frame, especially using the apply family functions from base R or group_by and summarize functions from dplyr.

How to write or save a data frame to Excel or CSV

After the data manipulation and aggregation, you may want to save the data frame as a CSV or an Excel file. Here’s how we can do so:

Save as a CSV: we can use write_csv function from the readr package like this:

write_csv(sample_n(iris_df,  size = 10), 
          path = "my_sampled_iris_data.csv")

Save as an Excel: we can use write.xlsx function from the xlsx package like this:

library(xlsx)
write.xlsx(sample_n(iris_df,  size = 10), 
          path = "my_sampled_iris_data.xlsx")

This library requires Java and will not work if you don’t have JVM on your computer. You can use the openxlsx package if you run into issues with Java.

library(openxlsx)

openxlsx::write.xlsx(sample_n(iris_df,  size = 10), 
                     file = "my_sampled_iris_data.xlsx")

My R version and package information

R version 4.0.0 (2020-04-24)

OS: macOS Catalina 10.15.4

Attached packages

maps version 3.3.0

magrittr version 1.5

stringr version 1.4.0

tidyr version 1.1.0

dplyr version 0.8.5

readr version 1.3.1

Some Books You May Find Useful

Last update on 2020-09-25 / Affiliate links / Images from Amazon Product Advertising API

>