Data Analysis

Intro to Data Analysis

3 Modules 12 Lessons

About this course

Thank you for joining the course. As you can see in the title, this course is really a short intro to data analysis. There’s a lot to learn in the analysis field. There many scattered resources on the web and you may feel overwhelmed. I know I do.


In this course I want to help you understand the basics of the analysis. These basics will then build the foundation for your later learning.


After completing this course you will learn:

  • the history of data analysis
  • different types of analysis
  • understand analysis process
    analysis, and
  • be able to do some of this analysis yourself


Let’s get started then.

Course Structure

History of data analysis

This field got its start with capturing and counting census and other demographic data. Later, people figured out that there are better ways of the presentation of information and started tabulating the data.

Then using probabilities, early statisticians estimated the population of cities to plan for the future.

As mathematics advanced, various theories of distribution were formed. One of the most well-known -- and popular still -- theorems is the Bayes law.

Later, statistics as a field really took off. Various sampling methods, least squares methods and hypothesis testing followed soon.

Types of analyses

These are the main types of analyses: descriptive analysis, inferential, exploratory and visual analysis.

Descriptive: summary of data and quick look into the data. Some common measures are: range, mean, mode, standard deviation

Inferential: estimates properties of underlying distribution of the data. Some common tools in this type of analysis: random sampling, hypothesis testing, confidence intervals

Exploratory: helps us uncover insights and visualize statistical properties. Some common techniques are: steam and leaf, box-plots, histograms.

Visual: builds upon exploratory analysis. Some common visualizations are: bar charts, part-to-whole charts, correlations, and geographic.


Methods of analysis

Sometimes you start with a question in mind, other times you are exploring the data to ask the questions nobody has asked before.

Here’s a typical process: keep in mind, however, this is not a linear process. In some projects, you may be jumping, skipping, or combining steps. Even if you explicitly don’t follow these steps, you will get very close to this workflow.

Here are the steps:

You get the desired data sometimes yourself, or sometimes you have to beg the data overlords, affectionately known as database administrators

You have to clean the data once you get your hands on.

You then manipulate certain fields to get the desired format. This is where we spend most of our time in analysis. For example, you may create a “distance” field to calculate the distance between distribution center location and home of your customer.

You then do some exploratory analysis, to detect outliers, or get an idea of distribution of the data

In the analysis step, you may create plots, find patterns, create predictive models

The last steps are the most crucial ones. You draw insights from your learning of the data.

Any analyst worth her money will first ask “why” and not “how”. After seeing the results, she will say “so-what” rather than interesting We shouldn't be satisfied with our first question and even worse, with our first answer.

The last step is reporting your insights. Here you spend your time proving recommendation and answering key questions, not creating useless infographics.

You have to make sure that your reader is able to comprehend the information easily. You provided recommendations from your insights and not spend too much time on your theory and tools. Nobody really cares about the tools; at least your reader doesn’t.

At the end, your reader cares about “how any of this is going to improve my bottom line” and that is what we have to focus on. You have to make your reports visually appealing with a simple language and meet where the reader is.


Types of tools available

There are many tools available for data analysis. The most commonly used is Microsoft Excel. Excel offers the most flexibility, ease of use and the shortest learning curves. I got my start using Excel. Excel has huge communities and many, many books on it.

Some other free options auctions are: R and Pythin. They have long learning curves, but fortunately since these tools are used by a lot of practitioners you can find answers to most of your questions easily.

By far, R is my favorite. It is very powerful: you can connect to many data sources, it helps you clean analyze, visualize, and create automated reports.

Python is favored by many in the tech industry. It is a complete programming language unlike are R, which is statistical programming language. Python works really well with web components. If you want to do text mining, you should go with Python.

Some other specialized tools are SAS, SPSS, and Tableau.

SAS and SPSS are commonly used in regulated industries such as Health Care and insurance as well as used in academia.

SAS is very powerful and tons of various packages to help you with specialized tasks.

Tableau is relatively newcomer on the market, but has captured a big portion of the market and for many good reasons. It is extremely easy to use; offers great flexibility in creating interactive realizations; and is amazingly fast.


Which tool should I use?

I wish there was an easy answer.

This really falls in “it depends” category.

It simply depends on time, money and resources. Do you have all of these to acquire, learn, and implement a tool of your choice?

Perhaps, the most important is: business need. I've seen many analysts fall prey to this one. Do you really need a power saw where a hacksaw will do? To use a tool just because it is fancy is never a good reason. You may end up spending a lot of money, time and resources on a problem that could have been easily solved by a simpler tool.


Data Organization

Data comes in various shapes and sizes.

  • text files
  • csv files
  • Excel files
  • Access databases
  • Database management systems
  • Unstructured data

A typical data set has rows and columns.

Columns define the type of the data we collected or are studying.

Columns are also called variables, fields, attributes, and really cool people call them features

Rows capture various observations as categorized by columns.

Data cleanup

In the real world, data will rarely be neatly organized for your analysis. Some problems that we often see in messy data are:

  • missing data
  • numeric data coded as text
  • quotes in data
  • commas in data
  • duplicates
  • miscoded data: data did not belong in this field
  • data entry error: outliers


Once you have organized and cleaned your data, sometimes preprocessing is helpful.

Data preprocessing

Once you have organized and cleaned your data, sometimes preprocessing is helpful.

Some ways that you can preprocess the data are:

missing data: you can place the mean or mode values in place of missing values
binning: you can create discrete ranges out of numeric values.
normalize: convert values to a scale
transformation: log scale, or sqrt
feature extraction: you create new columns based on existing data. For example, gender columns
feature or variable selection: through a combination of domain knowledge and algorithms, you select the most important features. This is an advanced topic and I will cover this in later courses.

Descriptive Stats

Provides summary and a quick look of data. Before you start your analysis, it is very helpful to take a look at these summaries of your data points. These can alert you if something is off. And better yet, can give you a quick indication into importance of certain fields.

Here are some common measures:

  • Range:difference between the largest value and the smallest value. Max - Min
  • Mean: The mean aka the average, and shows the center of a data set. You total up all the values and divide it by number of observations
  • Mode: mode is most frequently occurring value
  • Median: mean shows the average of all the values, median gets the exact center of the data. The average can be thrown off by extreme values, whereas median gets the center of the data.
  • Standard deviation: shows how far the individual data points are from the average
  • Quintiles: quintiles aka quartiles shows the distribution of the data points at various percentages. Usually, we compute 25%, 50%, 75%. 50% is the median of the values.

Exploratory Data Analysis

John Tukey promoted exploratory data analysis as a way to understand data better, especially via data visualization. It is very easy to jump into analyzing the data, building models, but visualizing data can sometimes show glaring outliers or problems with the data as well as help overcome our biases.

Anscombe created four data sets with almost equal mean and standard deviation. If you were to look at the these statistical properties just as numbers, you would think that all the datasets are equal or similar. But only when you plot them, do you see the stark differences.

Some common techniques are:

  • steam-and-leaf
  • histograms
  • Pareto chart
  • scatter plots
  • Box plots

Simple linear regression

At its core, linear regression is about minimizing the distance between the predicted value of an observation and the actual value of an observation, the technique is called least-squares and simple refers to one column or variable of data.

With the least squares method, you will get a best fit by minimizing distance between the predicted value of an observation and the actual value of an observation. The best fit is given in the form:


Problems with simple linear regression
You may have heard this many times that correlation does not equal to causation

I gave this fictitious example of yellow cars and accidents for a reason

There maybe a correlation between # of yellow cars and # of accidents, but the yellow cars most likely don’t cause all the accidents. This is a very important point. Because you see in the news quite often that eating spinach cures cancer and the next day spinach causes cancer. The scientists most likely conclude on the correlation, but often the media makes it sensational. A good paper on this topic predicts the stock market based on planetary motions, presidents in the white house etc.

Multiple linear regression

This is an extension of simple linear regression, but we consider more variables to predict.

For example, for example, average rainfall, average snowfall, # of yellow cars) (?_1, ?_2, ?_3)

Which create multiple estimates of ?1, ?2, ?3

At the end this is the form we get, the best fit is of the form:

?=? + ?1??1 + ?2??2 + ?3??3