Data Scientist Training
Even after searching for many hours on how to get trained to become a data scientist, are you confused and tired? Relax. Your search ends here.
There are thousands of resources on the internet that show you how to get trained as a data scientist, but still we can't find a clear path.
That is frustrating and I can relate to that.
I have spent hundreds of hours learning and practicing these skills, while not pulling all my hair out, and I want to help you get started quickly.
In this free training self-study guide, you will find 30+ days path towards becoming a data scientist. You will find the key skills of data science: data manipulation, data analysis and data visualization/storytelling. I've excluded the "big data" tools and technologies for a simple reason: you can't run before you learn to walk. This self-study curriculum will teach you to walk and maybe to jog.
If you are able to complete this complete curriculum, you will show a key skill required of data scientists: perseverance. Keep going no matter how hard you find this journey. Leave your thoughts here and let me know how I can be of any help.
Let's get started then.
First, you need tools and books
Tools you need
- R and RStudio
- Weka
- sqlite (optional: ?sqlite manager for Firefox)
- Notepad?++ or Sublime Text
Books you need
SQL
R
Machine Learning and Statistics
Data Visualization
Storytelling
Self-study Training Guide
Understand Relational Databases
- Read Appendix L from Krieger (2008) http://bit.ly/1urSPml
- Read Chapter 3 Krieger (2011)
Practice
- Repeat these exercises using your own data examples
Understand Joins
- Read Introduction to joins http://bit.ly/g6LH8
- Visualize joins http://bit.ly/ceW6QZ
Write SQL
- Complete Exercise 6: Select Across Many Tables from Shaw
- Complete Exercise 9: Updating Data from Shaw
- Complete Exercise 10: Updating Complex Data from Shaw
- Complete Exercise 15: Data Modeling from Shaw
Practice
- Write a small paragraph about your understanding of joins
- Repeat these exercises using your own data examples
Learn Advanced SQL
- Read Chapter 3: Calculations and Aliases from Rockoff
- Read Chapter 4: Using Functions from Rockoff
- Read Chapter 6: Column-Based Logic from Rockoff
- Read Chapter 7: Row-Based Logic from Rockoff
- Read Chapter 8: Boolean Logic from Rockoff
- Read Chapter 10: Summarizing Data from Rockoff
Practice
- Repeat these exercises using your own data examples
Continue Advanced SQL
- Read Chapter 11: Combining Tables with an Inner Join from Rockoff
- Read Chapter 12: Combining Tables with an Outer Join from Rockoff
- Read Chapter 14: Subqueries from Rockoff
- Read Chapter 15: Set Logic from Rockoff
Practice
- Repeat these exercises using your own data examples
Learn Data Handling in R
- Read Getting data into R from Zurr
- Read Recipe 3.8 Accessing built-in datasets from Teetor
- Read Accessing variables and managing subsets of data from Zurr
- Read Chapter 1: Getting started from Matloff
- Read Chapter 5: Data Frames from Matloff
Practice
- Write your understanding of these concepts?
- Repeat these exercises using your own data examples
Learn Data Handling in R
- Read Chapter 6: Factors and Tables from Matloff
- Appendix B: Installing and using packages from Matloff
- Read Recipe 4.10 Writing to CSV files from Teetor
- Read Chapter 6 Data Transformations from Teetor
- Follow these slides on Accessing Databases from R
Practice
- Write your understanding of these concepts?
Learn Data Manipulation in R
Practice
- Write your understanding of these concepts?
- Repeat these tutorials using your own data examples
Learn ggplot2 Graphics
Practice
- Repeat these tutorials using your own data examples
R Miscellaneous
- Plot Polygons
- Learn best practices of R programming from Google's R Style Guide
Practice
- Plot your data using various ggplot2 functions
Understand Statistical Concepts
- Read theory of linear regression chapter 3 from James
- Run examples of linear regression in R section 3.6 from James
- Read theory of logistic regression
- Read theory of logistic regression section 4.3 from James
- Run examples of logistic regression in R from section 4.6 of James
Practice
- Write your understanding of these concepts
Understand Statistical Concepts
- Read theory of Naive Bayes section 4.2 from Witten
- Understand ?statisitcal? distribution: part I
- Understand Statistical Distributions: part II
- Understand Statistical Distributions: part III
Practice
- Run Bayes classifier on Titanic data in Weka
- Plot various distributions in R
Learn Data Preparation
- Learn normalization (scaling/std. dev difference) of data
- Learn discretization (unsupervised/supervised) from section 7.2 of Witten
- Learn sampling (cross-validation, bootstrapping) from section 5.2 and 5.3 of Witten
Practice
- Implement normalization in R
- Test various discretization techniques in Weka
- Test various sampling techniques in Weka
Learn Data Preparation
- Learn feature subset selection (FSS) from section 7.1 of Witten
Practice
- Test various FSS techniques in Weka
Understand Machine Learning
- Understand algorithms from chapter 4 of Witten
- Understand advanced methods from chapter 6 of Witten
Practice
- Write your understanding of these algorithms?
- Test various algorithms in Weka
Learn Effective Data Visualization
- Learn Best Practices
- Read Cleveland's Book
- Read Wong's Book
- Read Tufte's Book
Become Better Writer
- Learn Effective Writing from Strunk
Become Better Programmer
- Learn Good Programming Practices from McConnell
Become Better Storyteller
- Read Confessions of a Public Speaker by Berkun
- Read How to Give a TED Talk by Donovan