Machine Learning DS-GA 1003 / CSCI-GA 2567 · Spring 2018 · NYU Center for Data Science

Instructor David Rosenberg
Lecture Tuesday 5:20pm–7pm, GSACL C95 (238 Thompson St.)
Lab Wednesday 6:45pm–7:35pm, MEYER 121 (4 Washington Pl)
Office Hours Instructor: Wednesdays 5:00-6:00pm CDS (60 5th Ave.), 6th floor, Room 650
Section Leader: Wednesdays 7:45-8:45pm, CDS (60 5th Ave.) Room C15
Graders: Mondays 3:30-4:30pm CDS (60 5th Ave.), 6th floor, Room 660

About This Course

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build.

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza, where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers.

Other information:

Prerequisites

Grading

Homework (40%) + Midterm Exam (20%) + Final Exam (20%) + Project (20%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

Resources

Textbooks

The cover of Elements of Statistical Learning The cover of An Introduction to Statistical Learning The cover of Understanding Machine Learning: From Theory to Algorithms The cover of Pattern Recognition and Machine Learning The cover of Bayesian Reasoning and Machine Learning
The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani)
This will be our main textbook for L1 and L2 regularization, trees, bagging, random forests, and boosting. It's written by three statisticians who invented many of the techniques discussed. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani)
This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
Understanding Machine Learning: From Theory to Algorithms (Shalev-Shwartz and Ben-David)
Last year this was our primary reference for kernel methods and multiclass classification, and we may use it even more this year. Covers a lot of theory that we don't go into, but it would be a good supplemental resource for a more theoretical course, such as Mohri's Foundations of Machine Learning course. (Available for free as a PDF.)
Pattern Recognition and Machine Learning (Christopher Bishop)
Our primary reference for probabilistic methods, including bayesian regression, latent variable models, and the EM algorithm. It's highly recommended, but unfortunately not free online.
Bayesian Reasoning and Machine Learning (David Barber)
A very nice resource for our topics in probabilistic modeling, and a possible substitute for the Bishop book. Would serve as a good supplemental reference for a more advanced course in probabilistic modeling, such as DS-GA 1005: Inference and Representation (Available for free as a PDF.)
Hands-On Machine Learning with Scikit-Learn and TensorFlow (Aurélien Géron)
This is a practical guide to machine learning that corresponds fairly well with the content and level of our course. While most of our homework is about coding ML from scratch with numpy, this book makes heavy use of scikit-learn and TensorFlow. Comfort with the first two chapters of this book would be part of the ideal preparation for this course, and it will also be a handy reference for your projects and work beyond this course, when you'll want to make use of existing ML packages, rather than rolling your own.
Data Science for Business (Provost and Fawcett)
Ideally, this would be everybody's first book on machine learning. The intended audience is both the ML practitioner and the ML product manager. It's full of important core concepts and practical wisdom. The math is so minimal that it's perfect for reading on your phone, and I encourage you to read it in parallel to doing this class, especially if you haven't taken DS-GA 1001.

Other tutorials and references

Software

Lectures

Week 0

Slides Notes References

ML Prereqs Jan 1

Slides

Notes

References

Week 1

Slides Notes References

Lecture Jan 23

Slides

Notes

References

Lab Jan 24

Slides

Notes

References

Week 2

Slides Notes References

Lecture Jan 30

Slides

Notes

References

  • HTF Ch. 3

Lab Jan 31

Slides

(None)

Notes

(None)

References

  • HTF 3.4

Week 3

Slides Notes References

Lecture Feb 6

Slides

Notes

References

Lab Feb 7

Slides

Notes

(None)

References

(None)

Week 4

Slides Notes References

Lecture Feb 13

Slides

Notes

References

Lab Feb 14

Slides

Notes

References

Week 5

Slides Notes References

Lecture Feb 20

Slides

Notes

References

Lab Feb 21

Slides

Notes

(None)

References

(None)

Week 6

Slides Notes References

Lecture Feb 27

Slides

Notes

References

(None)

Lab Feb 28

Slides

Notes

(None)

References

(None)

Week 7

Slides Notes References

Midterm Exam Mar 6

Slides

(None)

Notes

(None)

References

(None)

Project Adviser Meetings Mar 7

Slides

(None)

Notes

(None)

References

(None)

Week 8

Slides Notes References

Lecture Mar 20

Slides

Notes

References

  • Barber 9.1, 18.1
  • Bishop 3.3

Canceled for snow Mar 21

Slides

(None)

Notes

(None)

References

(None)

Week 9

Slides Notes References

Lecture Mar 27

Slides

Notes

References

  • Barber 9.1, 18.1
  • Bishop 3.3

Lab Mar 28

Slides

Notes

References

Week 10

Slides Notes References

Lecture Apr 3

Slides

Notes

(None)

References

  • JWHT 8.1 (Trees)
  • HTF 9.2 (Trees)

Lab Apr 4

Slides

Notes

(None)

References

  • JWHT 5.2 (Bootstrap)
  • HTF 7.11 (Bootstrap)

Week 11

Slides Notes References

Lecture Apr 10

Slides

Notes

References

Lab Apr 11

Slides

Notes

References

Week 12

Slides Notes References

Lecture Apr 17

Slides

Notes

(None)

References

Project Adviser Meetings Apr 18

Slides

(None)

Notes

(None)

References

(None)

Week 13

Slides Notes References

Lecture Apr 24 Video

Slides

Notes

(None)

References

Course Review Apr 25

Slides

(None)

Notes

(None)

References

(None)

Week 14

Slides Notes References

Lecture May 1

Slides

Notes

(None)

References

Project advisor meetings. May 2

Slides

(None)

Notes

(None)

References

(None)

Assignments

Late Policy: Homeworks are due at 10pm on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework Submission: Homework should be submitted through Gradescope. If you have not used Gradescope before, please watch this short video: "For students: submitting homework." At the beginning of the semester, you will be added to the Gradescope class roster. This will give you access to the course page, and the assignment submission form. To submit assignments, you will need to:

  1. Upload a single PDF document containing all the math, code, plots, and exposition required for each problem.
  2. Where homework assignments are divided into sections, please begin each section on a new page.
  3. You will then select the appropriate page ranges for each homework problem, as described in the "submitting homework" video.

Homework Feedback: Check Gradescope to get your scores on each individual problem, as well as comments on your answers. Since Gradescope cannot distinguish between required and optional problems, final homework scores, separated into required and optional parts, will be posted on NYUClasses.

Homework 0

Typesetting your homework

Due: January 1st, 10pm

Homework 1

GD, SGD, and Ridge Regression

Due: February 1st, 10pm

Homework 2

Lasso Regression

Due: February 13th, 10pm

Homework 3

SVM and Sentiment Analysis

Due: February 22nd, 10pm

Homework 4

Kernel Methods

Due: March 2nd, 10pm

Homework 5

Probabilistic Modeling

Due: April 9th, 10pm

Homework 6

Multiclass, Trees, and Gradient Boosting

Due: April 23rd, 10pm

Homework 7

Computation Graphs, Backprop, and Neural Networks

Due: May 11th, 10pm

Project

Overview

The project is your opportunity for in-depth engagement with a data science problem. In job interviews, it's often your course projects that you end up discussing, so it has some importance even beyond this class. That said, it's better to pick a project that you will be able to go deep with (in terms of trying different methods, feature engineering, error analysis, etc.), than choosing a very ambitious project that requires so much setup that you will only have time to try one or two approaches.

Key Dates

Guidelines for Project Topics

A good project for this class is one that's a real "problem", in the sense that you have something you want to accomplish, and it's not necessarily clear from the beginning the best approach. The techiques used should be relevant to our class, so most likely you will be building a prediction system. A probabilistic model would also be acceptable, though we will not be covering these topics until later in the semester.

To be clear, the following approaches would be less than ideal:

  1. Finding an interesting ML algorithm, implementing it, and seeing how it works on some data. This is not appropriate because I want your choice of methods to be driven by the problem you are trying to solve, and not the other way around.
  2. Choosing a well-known problem (e.g. MNIST digit classification or the Netflix problem) and trying out some of our ML methods on it. This is better than the previous example, but with a very well-established dataset, a lot of the most important and challenging parts of real-world data science are left out, including defining the problem, defining the success metric, and finding the right way to encode the data.
  3. Choosing a problem related to predicting stock prices. Historically, these projects are the most troubled. Interestingly, our project advisers who have worked in this field are the ones who advise against this most strongly.

Project proposal guidelines

The project proposal should be roughly 2 pages, though it can be longer if you want to include figures or sample data that will be helpful to your presentation. Your proposal should do the following:

  1. Clearly explain the high-level problem you are trying to solve (e.g. predict movie ratings, predict the outcome of a court case, ...).
  2. Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the data (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
  3. How will you evaluate performance? In certain settings, you may want to try a few different performance measures.
  4. Identify a few "baseline algorithms". These are simple algorithms for solving the problem, such as always predicting the majority class for a classification problem, using a small set of decision rules designed by hand, or using a ridge regression model on a basic feature set. Ideally, you will be able to report the performance of a couple baseline algorithms in your proposal. The goal will be to beat the baseline, so if the baseline is already quite high, you will have a challenge.
  5. What methods do you plan to try to solve your problem, along with a rough timeline. Methods include data preprocessing, feature generation, and the ML models you'll be trying. Once you start your investigation, it's best to use an iterative approach, where the method you choose next is based on an understanding of the results of the previous step.

Project writeup guidelines

The main objective of the project writeup is to explain what you did in a self-contained report. No strict guidelines on the format of the report, but the goal is to make it something you'd be proud to share with a potential employer. Some of the content will resemble your project proposals. Make sure to:

  1. Clearly explain the high-level problem you are trying to solve (e.g. predict movie ratings, predict the outcome of a court ase, ...).
  2. Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the ata (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
  3. How did you evaluate performance and measure success?
  4. What did you use for features, and explain any feature engineering that you did.
  5. What did you do to attempt to improve performance over your baseline algorithms (e.g. error analysis, new features, new parameter tuning,...)
  6. What challenges did you encounter? What insights into your problem did you get?
  7. What would be good next steps to take if you were to continue this work?
  8. If you got ideas from other sources, please cite them.

Some Previous Projects

Some Public Data Sets (just to get you thinking)

People

Instructor

A photo of David Rosenberg

David Rosenberg

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP.

Section Leader

A photo of Ben Jakubowski

Ben Jakubowski

Ben is a 2017 NYU Data Science MS graduate. He currently works as a data scientist for the University of Chicago's Crime Lab New York (CLNY), where his portfolio includes several prediction problems that arise in criminal justice and social policy.

Graders

Project Advisers