When you learn skills that you perform (as distinct from learning facts that you just remember), you need to form in your mind two types of mappings. The first and easier is the mapping from a tool or technique to what it does. This is the mapping that you will learn just by reading. The second and more important is the mapping from the problem you need to solve back to the tools and techniques that are applicable to it. This mapping is developed only by doing problems, and without this mapping, your learning is of no practical use. Therefore it is imperative that you do exercises; do the ones from the book, or even better, make up your own related to a domain which is interesting to you.
Research indicates that you will form better memories by typing commands instead of cutting and pasting from examples. If you are using an online book, I suggest that you resist the temptation to cut and paste from the book into the R session — pretend that it’s a paper book and type out the commands.
For some books and sections, recommended timing information appears. This takes the form of a median and maximum time. The median is from those students who studied the same material and ultimately graduated from the training program. The maximum is a way of evaluating your progress: if you take longer than the listed maximum amount of time without successfully completing the material, it indicates that you may be inadequately prepared for this class and have a low probability of ultimately graduating. These timings are derived from data about software engineers, so if you have a business or analytics background, they may not apply to you.
Your corporate program sponsor can tell you if there are specific policies regarding the maximum time at your company.
This book is available in print or online from the author at http://r4ds.had.co.nz . Note that the chapters are numbered differently in the online version. Since most people seem to be using the online version, the numbers in this document are the online numbers. If you’re using the print version, do what I did and write the online numbers in your table of contents.
Part I
Part II
Part III
Part IV
We will not be using part IV from this book. It presents linear regression modeling in a very non-rigorous way, and I prefer to go through the required statistics first and then do modeling properly.
Part V
Roughly the first half of your time in this book should go into chapters 3, 5, and 7, which contain substantial material that you need to practice repeatedly, and the second half of your time should go into everything else.
Supplemental material on programming with dplyr and dealing with the nonstandard argument passing: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
This is a verbose book with repetitive examples and exercises, so once you understand a concept, move on to the next one. Do a representative sample of the exercises, but stop doing exercises of a certain type once you understand how to do them.
This curriculum refers to the 3rd edition of this book. There is a 4th edition but I have not had time to read it and update, so you'll have to stick with the 3rd edition for now.
We’ll be using chapters 1 through 6. Read all sections, including those marked “special topic.” You are all special.
This book is conceptual and does not tell you any R commands, although whenever they provide example output, it corresponds perfectly to what R would produce. There is an online software supplement to the book here: https://www.openintro.org/stat/labs.php?stat_lab_software=R I recommend only the “Probability” through “Confidence Level” sections. I do not recommend the Inference sections, because they teach everything using their own custom functions which are not standard R functions.
This book is available printed, or as a PDF at openintro.org. Note that there is now a 4th edition, but my chapter numbers refer to the 3rd edition.
Earlier versions of this document also recommended Introductory Statistics with R by Peter Dalgaard. I have since decided that this book is too long relative to the value of its material, so I have cut it and started directly teaching the required material instead. If you are following this curriculum on your own, you will still need it. See the alternative version with Dalgaard, suitable for self-study
We do not use chapters 7 and 8 in OpenIntro Statistics
Additional Material in the OI Labs:
Additional Material I Will Teach You:
My recommendation is that you try to do at least some of the exercises in OpenIntro both by hand and also in R, and confirm you got the same answer. If you have trouble understanding, I recommend supplementing this book with the Cartoon Guide to Statistics by Gonick and Smith.
(For a time I referred to this book as “JWHT” because some students had been getting it confused with another similarly named book, but that other book has been removed from the curriculum, so I am now back to the standard “ISLR” abbreviation for this book)
Chapter 1, the introduction, is trivial.
Every other chapter, 2 through 10, is both substantial and important. For ch2, the content is important but you don’t need to do the labs or exercises, which just cover basics of R that we already know. For all the subsequent chapters, the labs are important. Chapter 10 on unsupervised learning can be done out of order if you need it sooner for your project.
Supplemental material:
Plotmo: Plotmo README (see also the "Plotting regression surfaces with plotmo" vignette on that page)
Prior to reading this book, take a couple hours to go back and review the material in the first part of r4ds; some people find that they got rusty while doing statistics.
This book uses base-R graphics and manipulation. I suggest that you do not try to deeply understand these; when you work the examples and exercises, do them using the ggplot and dplyr functions that you already learned from r4ds.
Community-created solutions reference: https://github.com/asadoughi/stat-learning/
I have not evaluated the quality of these solutions. They may contain serious errors.
All students do a final project to complete the the program. At a minimum, the final project must include:
Everyone does a different project, so timing will vary.
This book is recommended for everyone who intends to continue learning, because it provides a adequate mathematical foundation in probability for reading advanced works. The level of mathematics provided by OpenIntro Statistics is adequate for ISLR but is not adequate for many of the more advanced books.
Unfortunately this otherwise-excellent book has numerous serious errata which impede understanding. You will want to go through the list and mark them in your copy prior to reading. If you skip this, you will regret it.
The minimum set of chapters I recommend are 1-6 and 9.
Review linear algebra - remember at least LU, QR, and SVD decompositions. If you need a book, I liked Bau and Trefethen’s, which offers some nice geometric interpretations.
Remember change of variables in calculus.
Additional R libraries which can be very useful:
See the Advanced Reading List Let's Talk About Classes at Your Company