Machine Learning DS-GA 1003 · Spring 2019 · NYU Center for Data Science

Lead Instructor Julia Kempe
Co-Instructor David Rosenberg
Lecture Tuesday 5:20pm–7pm, MEYER 121 (4 Washington Pl)
Lab Wednesday 6:45pm–7:35pm, MEYER 121 (4 Washington Pl)
Office Hours Instructor: Tuesdays 4:00-5:00pm CDS (60 5th Ave.), 6th floor, Room 620
Section Leaders: Wednesdays 8:00-9:00pm, CDS (60 5th Ave.) Room C-15
Graders: Wed 1:30-2:30pm, Thu 12:30-1:30pm CDS (60 5th Ave.), Room 667
Note In Weeks 5, 7 and 11 David Rosenberg will teach and hold office hours -
perhaps not at the usual time or place. We will announce these changes.

This week

About This Course

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build. A tentative syllabus can be found here.

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza, where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers.

Other information:

Prerequisites

Grading

Homework (40%) + Midterm Exam (30%) + Final Exam (30%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

Resources

Textbooks

The cover of Elements of Statistical Learning The cover of An Introduction to Statistical Learning The cover of Understanding Machine Learning: From Theory to Algorithms The cover of Pattern Recognition and Machine Learning The cover of Bayesian Reasoning and Machine Learning
The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani)
This will be our main textbook for L1 and L2 regularization, trees, bagging, random forests, and boosting. It's written by three statisticians who invented many of the techniques discussed. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani)
This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
Understanding Machine Learning: From Theory to Algorithms (Shalev-Shwartz and Ben-David)
Last year this was our primary reference for kernel methods and multiclass classification, and we may use it even more this year. Covers a lot of theory that we don't go into, but it would be a good supplemental resource for a more theoretical course, such as Mohri's Foundations of Machine Learning course. (Available for free as a PDF.)
Pattern Recognition and Machine Learning (Christopher Bishop)
Our primary reference for probabilistic methods, including bayesian regression, latent variable models, and the EM algorithm. It's highly recommended, but unfortunately not free online.
Bayesian Reasoning and Machine Learning (David Barber)
A very nice resource for our topics in probabilistic modeling, and a possible substitute for the Bishop book. Would serve as a good supplemental reference for a more advanced course in probabilistic modeling, such as DS-GA 1005: Inference and Representation (Available for free as a PDF.)
Hands-On Machine Learning with Scikit-Learn and TensorFlow (Aurélien Géron)
This is a practical guide to machine learning that corresponds fairly well with the content and level of our course. While most of our homework is about coding ML from scratch with numpy, this book makes heavy use of scikit-learn and TensorFlow. Comfort with the first two chapters of this book would be part of the ideal preparation for this course, and it will also be a handy reference for practical projects and work beyond this course, when you'll want to make use of existing ML packages, rather than rolling your own.
Data Science for Business (Provost and Fawcett)
Ideally, this would be everybody's first book on machine learning. The intended audience is both the ML practitioner and the ML product manager. It's full of important core concepts and practical wisdom. The math is so minimal that it's perfect for reading on your phone, and I encourage you to read it in parallel to doing this class, especially if you haven't taken DS-GA 1001.

Other tutorials and references

Software

Lectures

Week 0

Slides Notes References

ML Prereqs Jan 1

Slides

Notes

References

Week 1

Slides Notes References

Lecture Jan 29

Slides

Notes

References

(None)

Lab Jan 30

Slides

Notes

References

Week 2

Slides Notes References

Lecture Feb 5

Slides

Notes

References

  • HTF Ch. 3

Lab Feb 6

Slides

Notes

(None)

References

(None)

Week 3

Slides Notes References

Lecture Feb 12

Slides

Notes

References

Lab Feb 13

Slides

Notes

References

Week 4

Slides Notes References

Lecture Feb 19

Slides

Notes

References

Lab Feb 20

Slides

Notes

References

(None)

Week 5

Slides Notes References

Lecture Feb 26

Slides

Notes

References

Lab Feb 27

Slides

Notes

(None)

References

(None)

Week 6

Slides Notes References

Lecture Mar 5

Slides

Notes

References

(None)

Lab Mar 6

Slides

Notes

(None)

References

(None)

Week 7

Slides Notes References

Midterm Exam Mar 12

Slides

(None)

Notes

(None)

References

(None)

Lab Mar 13

Slides

Notes

(None)

References

(None)

Week 8

Slides Notes References

Lecture Mar 26

Slides

Notes

References

  • Barber 9.1, 18.1
  • Bishop 3.3

Midterm Solution Discussion Mar 27

Slides

(None)

Notes

(None)

References

(None)

Week 9

Slides Notes References

Lecture Apr 2

Slides

Notes

References

Lab Apr 3

Slides

Notes

(None)

References

(None)

Week 10

Slides Notes References

Lecture Apr 9

Slides

Notes

(None)

References

  • JWHT 8.1 (Trees)
  • HTF 9.2 (Trees)

Lab Apr 10

Slides

Notes

(None)

References

(None)

Week 11

Slides Notes References

Lecture Apr 16

Slides

Notes

References

Lab Apr 17

Slides

Notes

(None)

References

(None)

Week 12

Slides Notes References

Lecture Apr 23

Slides

Notes

References

Lab Apr 24

Slides

Notes

(None)

References

Week 13

Slides Notes References

Lecture Apr 30

Slides

Notes

(None)

References

Lab May 1

Slides

Notes

(None)

References

(None)

Week 14

Slides Notes References

Lecture May 7

Slides

Notes

(None)

References

Course Review May 8

Slides

Notes

(None)

References

(None)

Assignments

Late Policy: Homeworks are due at 11:59 PM on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework Submission: Homework should be submitted through Gradescope. If you have not used Gradescope before, please watch this short video: "For students: submitting homework." At the beginning of the semester, you will be added to the Gradescope class roster. This will give you access to the course page, and the assignment submission form. To submit assignments, you will need to:

  1. Upload a single PDF document containing all the math, code, plots, and exposition required for each problem.
  2. Where homework assignments are divided into sections, please begin each section on a new page.
  3. You will then select the appropriate page ranges for each homework problem, as described in the "submitting homework" video.

Homework Feedback: Check Gradescope to get your scores on each individual problem, as well as comments on your answers. Since Gradescope cannot distinguish between required and optional problems, final homework scores, separated into required and optional parts, will be posted on NYUClasses.

Homework 0

Typesetting your homework

Due: January 1st, 11:59 PM

Homework 1

Due: February 9th, 11:59 PM

Homework 2

Due: February 18th, 11:59 PM

Homework 3

Due: February 25th, 11:59 PM

Homework 4

Due: March 8th, 11:59 PM

Homework 5

Due: April 5th, 11:59 PM

Homework 6

Due: April 29th, 11:59 PM

Homework 7

Due: May 10th, 11:59 PM

People

Instructors

Section Leaders

Graders