Step by Step Tutorial: Deep Learning with TensorFlow in R

Deep Learning with TensorFlow

Deep learning, also known as deep structured learning or hierarchical learning, is a type of machine learning focused on learning data representations and feature learning rather than individual or specific tasks. Feature learning, also known as representation learning, can be supervised, semi-supervised or unsupervised.

Deep learning architectures include deep neural networks, deep belief networks and recurrent neural networks. Real-world applications using deep learning include computer vision, speech recognition, machine translation, natural language processing, and image recognition.

The following recipe introduces how to implement a deep neural network using TensorFlow, which is an open source software library, originally developed at Google, for complex computation by constructing network graphs of mathematical operations and data (Abadi et al. 2016; Cheng et al. 2017). Tang et al. (2017) developed an R interface to the TensorFlow API for our use.

A deep neural network can be explained as a neural network with multiple hidden layers, which add complexity to the model, but also allows the network to learn the underlying patterns.

Deep learning deep neural network R tensorflow

Before we use this library, we need to install it. Since this is a very recent library, we will install the library from github directly.

devtools::install_github("rstudio/tfestimators")
library(tfestimators)

Although we installed the library, we don’t have the actual compiled code for TensorFlow, which we need to install using the install_tensorlfow() command that came with the tfestimators package.

install_tensorflow()

When you try to run this, you may run into an error like this one:

#> Error: Prerequisites for installing
#> TensorFlow not available.  Execute the
#> following at a terminal to install the
#> prerequisites: $ sudo
#> /usr/local/bin/pip install --upgrade
#> virtualenv

I was able to fix the error by running the above command on a Mac. On Windows, you may need further troubleshooting. After installing the prerequisites, you can try installing TensorFlow again.

install_tensorflow()

We will use the sample dononr data set from the book data science for fundraising. We’ll load it using read_csv function from the readr library.

library(readr)
library(dplyr)
 
donor_data <- read_csv("https://www.dropbox.com/s/ntd5tbhr7fxmrr4/DonorSampleDataCleaned.csv?raw=1")

Let’s see what this data looks like:

glimpse(donor_data)
 
#> Observations: 34,508
#> Variables: 23
#> $ ID                  <int> 1, 2, 3, 4, 5, 6,...
#> $ ZIPCODE             <chr> "23187", "77643",...
#> $ AGE                 <int> NA, 33, NA, 31, 6...
#> $ MARITAL_STATUS      <chr> "Married", NA, "M...
#> $ GENDER              <chr> "Female", "Female...
#> $ MEMBERSHIP_IND      <chr> "N", "N", "N", "N...
#> $ ALUMNUS_IND         <chr> "N", "Y", "N", "Y...
#> $ PARENT_IND          <chr> "N", "N", "N", "N...
#> $ HAS_INVOLVEMENT_IND <chr> "N", "Y", "N", "Y...
#> $ WEALTH_RATING       <chr> NA, NA, NA, NA, N...
#> $ DEGREE_LEVEL        <chr> NA, "UB", NA, NA,...
#> $ PREF_ADDRESS_TYPE   <chr> "HOME", NA, "HOME...
#> $ EMAIL_PRESENT_IND   <chr> "N", "Y", "N", "Y...
#> $ CON_YEARS           <int> 1, 0, 1, 0, 0, 0,...
#> $ PrevFYGiving        <chr> "$0", "$0", "$0",...
#> $ PrevFY1Giving       <chr> "$0", "$0", "$0",...
#> $ PrevFY2Giving       <chr> "$0", "$0", "$0",...
#> $ PrevFY3Giving       <chr> "$0", "$0", "$0",...
#> $ PrevFY4Giving       <chr> "$0", "$0", "$0",...
#> $ CurrFYGiving        <chr> "$0", "$0", "$200...
#> $ TotalGiving         <dbl> 10, 2100, 200, 0,...
#> $ DONOR_IND           <chr> "Y", "Y", "Y", "N...
#> $ BIRTH_DATE          <date> NA, 1984-06-16, ...

TensorFlow library doesn’t tolerate missing values, therefore, we will replace missing factor values with modes and missing numeric values with medians.

# function copied from
# https://stackoverflow.com/a/8189441/934898
my_mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}
 
donor_data <- donor_data %>% 
  mutate_if(is.numeric, 
            .funs = funs(
              ifelse(is.na(.), 
                     median(., na.rm = TRUE),
                     .))) %>%
  mutate_if(is.character, 
            .funs = funs(
              ifelse(is.na(.), 
                     my_mode(.),
                     .)))

Next, we need to convert the character variables to factors.

predictor_cols <- c("MARITAL_STATUS", "GENDER", 
                    "ALUMNUS_IND", "PARENT_IND", 
                    "WEALTH_RATING", "PREF_ADDRESS_TYPE")
 
# Convert feature to factor
donor_data <- mutate_at(donor_data, 
                        .vars = predictor_cols, 
                        .funs = as.factor)

Now, we need to let TensorFlow know about the column types. For factor columns, we need to specify all the values contained in those columns using column_categorical_with_vocabulary_list function. Then using the column_indicator function, we convert each of the factor values in a column to its own column with 0 and 1s – this process is known as one hot encoding. For example, for the GENDER column, say we have two possible values of male and female. One hot encoding process will create two columns: one for male and the other for female. Each of these columns will contain either 0 or 1 depending on the data value the GENDER column contained.

feature_cols <- feature_columns(
  column_indicator(
    column_categorical_with_vocabulary_list(
      "MARITAL_STATUS", 
      vocabulary_list = unique(donor_data$MARITAL_STATUS))), 
    column_indicator(
      column_categorical_with_vocabulary_list(
        "GENDER", 
        vocabulary_list = unique(donor_data$GENDER))), 
    column_indicator(
      column_categorical_with_vocabulary_list(
        "ALUMNUS_IND", 
        vocabulary_list = unique(donor_data$ALUMNUS_IND))), 
    column_indicator(
      column_categorical_with_vocabulary_list(
        "PARENT_IND", 
        vocabulary_list = unique(donor_data$PARENT_IND))), 
    column_indicator(
      column_categorical_with_vocabulary_list(
        "WEALTH_RATING", 
        vocabulary_list = unique(donor_data$WEALTH_RATING))), 
    column_indicator(
      column_categorical_with_vocabulary_list(
        "PREF_ADDRESS_TYPE", 
        vocabulary_list = unique(donor_data$PREF_ADDRESS_TYPE))), 
    column_numeric("AGE"))

After we created the column types, let’s the data set into train and test datasets.

row_indices <- sample(1:nrow(donor_data), 
                      size = 0.8 * nrow(donor_data))
donor_data_train <- donor_data[row_indices, ]
donor_data_test <- donor_data[-row_indices, ]

The TensorFlow package then requires that we create an input function with the listing of input and out variables. We will predict the likelihood of a person’s donation.

donor_pred_fn <- function(data) {
    input_fn(data, 
             features = c("AGE", "MARITAL_STATUS", 
                          "GENDER", "ALUMNUS_IND", 
                          "PARENT_IND", "WEALTH_RATING", 
                          "PREF_ADDRESS_TYPE"), 
             response = "DONOR_IND")
}

Learn More

This is a modified excerpt from the book Data Science for Fundraising (Build Data Driven Solutions Using R).
Learn more.

Build a Deep Learning Classifier

Finally, we can use the prepared data set as well as the input function to build a deep learning classifier. We will create three hidden layers with 80, 40 and 30 nodes respectively.

classifier <- dnn_classifier(
  feature_columns = feature_cols, 
  hidden_units = c(80, 40, 30), 
  n_classes = 2, 
  label_vocabulary = c("N", "Y"))

Using the train function we will build the classifier.

train(classifier, 
      input_fn = donor_pred_fn(donor_data_train))

We will next predict the values using the model for the test data set as well as the full data set.

predictions_test <- predict(
  classifier, 
  input_fn = donor_pred_fn(donor_data_test))
predictions_all <- predict(
  classifier, 
  input_fn = donor_pred_fn(donor_data))

Similarly, we will evaluate the model for both the test data and the full data set. You can see the evaluation on the test data in Table @ref(tab:evaltftest) and for the full data set in Table @ref(tab:evaltfall).

evaluation_test <- evaluate(
  classifier, 
  input_fn = donor_pred_fn(donor_data_test))
evaluation_all <- evaluate(
  classifier, 
  input_fn = donor_pred_fn(donor_data))
TensorFlow evaluation on test data
Measure Value
accuracy 84.34
accuracy_baseline 0.63
auc 216.00
auc_precision_recall 0.51
average_loss 0.62
global_step 0.63
label/mean 0.66
loss 0.63
prediction/mean 0.63
TensorFlow evaluation on full data
Measure Value
accuracy 84.87
accuracy_baseline 0.62
auc 216.00
auc_precision_recall 0.51
average_loss 0.62
global_step 0.62
label/mean 0.66
loss 0.62
prediction/mean 0.62

The overall accuarcy doesn’t seem too impressive, even though we used large number of nodes in the hidden layers. This is partially due to the data itself – it is a synthetic data set afterall. But you should try the above recipe with your own data set and see if you can get better results. All the best.

References

Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, et al. 2016. “Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.” arXiv Preprint arXiv:1603.04467.

Cheng, Heng-Tze, Lichan Hong, Mustafa Ispir, Clemens Mewald, Zakaria Haque, Illia Polosukhin, Georgios Roumpos, et al. 2017. “TensorFlow Estimators: Managing Simplicity Vs. Flexibility in High-Level Machine Learning Frameworks.” In Proceedings of the 23rd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1763–71. New York, NY, USA: ACM. http://doi.acm.org/10.1145/3097983.3098171.

Tang, Yuan, JJ Allaire, RStudio, Kevin Ushey, Daniel Falbel, and Google Inc. 2017. Tfestimators: High-Level Estimator Interface to Tensorflow in R. https://github.com/rstudio/tfestimators.

About the Author

The author of Tableau Data Visualization Cookbook and an award winning keynote speaker, Ashutosh R. Nandeshwar is one of the few analytics professionals in the higher education industry who has developed analytical solutions for all stages of the student life cycle (from recruitment to giving). He enjoys speaking about the power of data, as well as ranting about data professionals who chase after “interesting” things. He earned his PhD/MS from West Virginia University and his BEng from Nagpur University, all in industrial engineering. Currently, he is leading the data science, reporting, and prospect development efforts at the University of Southern California.

>