This document covers the curriculum that I use to teach data science in Python. It is still a work in progress and a few sections are incomplete.
Before you follow this curriculum in detail, I recommend reviewing my article on which language to use.
R vs Python for data science R curriculum
When you learn skills that you perform (as distinct from learning facts that you just remember), you need to form in your mind two types of mappings. The first and easier is the mapping from a tool or technique to what it does. This is the mapping that you will learn just by reading. The second and more important is the mapping from the problem you need to solve back to the tools and techniques that are applicable to it. This mapping is developed only by doing problems, and without this mapping, your learning is of no practical use. Therefore it is imperative that you do exercises; do the ones from the book, or even better, make up your own related to a domain which is interesting to you.
Research indicates that you will form better memories by typing commands instead of cutting and pasting from examples. If you are using an online book, I suggest that you resist the temptation to cut and paste from the book into the R session — pretend that it’s a paper book and type out the commands.
If you're here, it's because you prefer to learn in Python instead of R (or your boss strongly prefers for you to do so). Be aware that the path is steeper and rockier going up the Python side of the mountain: you're going to have to memorize a bunch of boring data manipulation first, before you get to make any pretty graphs, and there are several topics which are not adequately covered by any available books. I've done the best I can to smooth over these gaps, and the overall experience is still going to be a rougher one than you would get with R.
The book is graciously available online from the author, or you can buy it as a paperback from any of the usual sources.
As of October 2018, I am having success with Anaconda 5.2.0 based on Python 3.6, but I had problems with Anaconda 5.3.0 based on Python 3.7 - some packages won't build under Python 3.7, including dependencies of plotnine. I recommend Python 3.6.
I use pyenv to manage versions and have had good results with it so far. You can use whatever you prefer.
Also install nbextensions and enable collapsible_headings and toc2.
This book does not contain exercises, so I've written some.
There is quite a lot in chapter 3; I wouldn't be surprised if it takes you more than one week to get through it.
Python notebooks with additional materials:
Stop here, do not read chapter 4. We will not use matplotlib in this class.
Prior to selecting Altair, I reviewed the available libraries. As of late 2018, the two best are Altair and Plotnine. I favor the former because, although it is not as good today, the fact that it renders using Javascript appeals to the kind of people who want to do data science in Python, and it has backing from bigger names in the Python community.
There is currently no book which covers Altair, so here are some online resources to use instead:
This is a verbose book with repetitive examples and exercises, so once you understand a concept, move on to the next one. Do a representative sample of the exercises, but stop doing exercises of a certain type once you understand how to do them.
This curriculum refers to the 3rd edition of this book. There is a 4th edition but I have not had time to read it and update, so you'll have to stick with the 3rd edition for now.
We'll be using chapters 1 through 6. Read all sections, including those marked “special topic.” You are all special.
This book is available printed, or as a PDF at openintro.org. It offers labs, but only in R or SAS. Here's a Python notebook which demonstrates the key functions you will need:
We do not use chapters 7 and 8 in OpenIntro Statistics
My recommendation is that you try to do at least some of the exercises in OpenIntro both by hand and also in Python, and confirm you got the same answer. If you have trouble understanding, I recommend supplementing this book with the Cartoon Guide to Statistics by Gonick and Smith.
This book is so good that although its examples are all given in R, the best option for Python is currently to use this book anyway and substitute in alternative labs. There is a a full set of labs from J. Warmenhoven on Github. I have also developed some of my own demonstration notebooks, below.
Chapter 1 is a trivial introduction. Chapter 2 contains important conceptual information about bias and variance; there is nothing in the labs (which are all about basic R programming) that you should try to do.
These chapters are about basic regression and classification. I strongly recommend using the statsmodels library for this, and not the sklearn library, because the latter is missing all the useful diagnostics for your model.
TODO: decide which libraries I'm using for these
The h2o.ai library has reasonably good support for random forests and boosted trees.
TODO: decide which library I'm using for this
Multi-Library Clustering Demos Multi-Library PCA Demos
All students do a final project to complete the the program. At a minimum, the final project must include: