Machine Learning and Computational Statistics DS-GA 1003 · Spring 2017 · NYU Center for Data Science

Instructor David Rosenberg
Lecture Tuesday 5:20pm–7pm, GSACL C95 (238 Thompson St.)
Lab Wednesday 8:35pm–9:25pm, GSACL C95 (238 Thompson St.)
Office Hours Instructor: Wednesdays 5:30-6:30pm (ARRIVE BEFORE 6pm) CDS (60 5th Ave.), 6th floor, Room 650
TA: Friday 5pm-6pm, CDS (60 5th Ave.), 6th floor, Room 650
Graders: Tuesdays 1pm–3pm, CDS (60 5th Ave.), 6th floor, Room 606

This week

Slides

Notes

References

Homework 4

Kernel Methods and Lagrangian Duality

Due: March 27th, 10pm

Ensemble methods this week.

About This Course

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build.

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza, where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers.

Other information:

Prerequisites

Grading

Homework (35%) + One-Hour Test (15%) + Two-Hour Test (30%) + Project (20%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

Resources

Textbooks

The cover of Elements of Statistical Learning The cover of An Introduction to Statistical Learning The cover of Understanding Machine Learning: From Theory to Algorithms The cover of Pattern Recognition and Machine Learning The cover of Bayesian Reasoning and Machine Learning
The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani)
This will be our main textbook for L1 and L2 regularization, trees, bagging, random forests, and boosting. It's written by three statisticians who invented many of the techniques discussed. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani)
This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
Understanding Machine Learning: From Theory to Algorithms (Shalev-Shwartz and Ben-David)
Last year this was our primary reference for kernel methods and multiclass classification, and we may use it even more this year. Covers a lot of theory that we don't go into, but it would be a good supplemental resource for a more theoretical course, such as Mohri's Foundations of Machine Learning course. (Available for free as a PDF.)
Pattern Recognition and Machine Learning (Christopher Bishop)
Our primary reference for probabilistic methods, including bayesian regression, latent variable models, and the EM algorithm. It's highly recommended, but unfortunately not free online.
Bayesian Reasoning and Machine Learning (David Barber)
A very nice resource for our topics in probabilistic modeling, and a possible substitute for the Bishop book. Would serve as a good supplemental reference for a more advanced course in probabilistic modeling, such as DS-GA 1005: Inference and Representation (Available for free as a PDF.)

Other tutorials and references

Software

Lectures

Week 1

Slides Notes References

Lecture Jan 24

Slides

Notes

References

Lab Jan 25

Slides

Notes

References

Week 2

Slides Notes References

Lecture Jan 31

Slides

Notes

References

Lab Feb 1

Slides

Notes

(None)

References

Week 3

Slides Notes References

Lecture Feb 7

Slides

Notes

References

  • HTF 3.4

Lab Feb 8

Slides

Notes

References

(None)

Week 4

Slides Notes References

Lecture Feb 14

Slides

Notes

References

Lab Feb 15

Slides

Notes

References

Week 5

Slides Notes References

Lecture Feb 21

Slides

Notes

References

Lab Feb 15

Slides

Notes

References

(None)

Week 6

Slides Notes References

Lecture Feb 28

Slides

Notes

(None)

References

  • SSBD Chapter 16

One-Hour Test Mar 1

Slides

(None)

Notes

(None)

References

(None)

Week 7

Slides Notes References

Lecture Mar 7

Slides

Notes

References

Project Adviser Meetings Mar 8

Slides

(None)

Notes

(None)

References

(None)

Week 8

Slides Notes References

Lecture Mar 21

Slides

Notes

(None)

References

  • JWHT 8.1
  • HTF 9.2

Lab Mar 22

Slides

Notes

(None)

References

  • JWHT 5.2
  • HTF 7.11

Week 9

Slides Notes References

Lecture Mar 21

Slides

Notes

(None)

References

Assignments

Late Policy: Homeworks are due at 10pm on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework Submission: Homework should be submitted through Gradescope. If you have not used Gradescope before, please watch this short video: "For students: submitting homework." At the beginning of the semester, you will be added to the Gradescope class roster. This will give you access to the course page, and the assignment submission form. To submit assignments, you will need to:

  1. Upload a single PDF document containing all the math, code, plots, and exposition required for each problem.
  2. Where homework assignments are divided into sections, please begin each section on a new page.
  3. You will then select the appropriate page ranges for each homework problem, as described in the "submitting homework" video.

Homework Feedback: Check Gradescope to get your scores on each individual problem, as well as comments on your answers. Since Gradescope cannot distinguish between required and optional problems, final homework scores, separated into required and optional parts, will be posted on NYUClasses.

Homework 1

GD, SGD, and Ridge Regression

Due: February 5th, 10pm

Homework 2

Lasso Regression

Due: February 13th, 10pm

Homework 3

SVM and Sentiment Analysis

Due: February 23rd, 10pm

Homework 4

Kernel Methods and Lagrangian Duality

Due: March 27th, 10pm

Project

Overview

The project is your opportunity for in-depth engagement with a data science problem. In job interviews, it's often your course projects that you end up discussing, so it has some importance even beyond this class. That said, it's better to pick a project that you will be able to go deep with (in terms of trying different methods, feature engineering, error analysis, etc.), than choosing a very ambitious project that requires so much setup that you will only have time to try one or two approaches.

Key Dates

Guidelines for Project Topics

A good project for this class is one that's a real "problem", in the sense that you have something you want to accomplish, and it's not necessarily clear from the beginning the best approach. The techiques used should be relevant to our class, so most likely you will be building a prediction system. A probabilistic model would also be acceptable, though we will not be covering these topics until later in the semester.

To be clear, the following approaches would be less than ideal:

  1. Finding an interesting ML algorithm, implementing it, and seeing how it works on some data. This is not appropriate because I want your choice of methods to be driven by the problem you are trying to solve, and not the other way around.
  2. Choosing a well-known problem (e.g. MNIST digit classification or the Netflix problem) and trying out some of our ML methods on it. This is better than the previous example, but with a very well-established dataset, a lot of the most important and challenging parts of real-world data science are left out, including defining the problem, defining the success metric, and finding the right way to encode the data.
  3. Choosing a problem related to predicting stock prices. Historically, these projects are the most troubled. Interestingly, our project advisers who have worked in this field are the ones who advise against this most strongly.

Project proposal guidelines

The project proposal should be roughly 2 pages, though it can be longer if you want to include figures or sample data that will be helpful to your presentation. Your proposal should do the following:

  1. Clearly explain the high-level problem you are trying to solve (e.g. predict movie ratings, predict the outcome of a court case, ...).
  2. Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the data (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
  3. How will you evaluate performance? In certain settings, you may want to try a few different performance measures.
  4. Identify a few "baseline algorithms". These are simple algorithms for solving the problem, such as always predicting the majority class for a classification problem, using a small set of decision rules designed by hand, or using a ridge regression model on a basic feature set. Ideally, you will be able to report the performance of a couple baseline algorithms in your proposal. The goal will be to beat the baseline, so if the baseline is already quite high, you will have a challenge.
  5. What methods do you plan to try to solve your problem, along with a rough timeline. Methods include data preprocessing, feature generation, and the ML models you'll be trying. Once you start your investigation, it's best to use an iterative approach, where the method you choose next is based on an understanding of the results of the previous step.

Some Previous Projects

Some Public Data Sets (just to get you thinking)

People

Instructor

A photo of David Rosenberg

David Rosenberg

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP.

Teaching Assistants

A photo of Brett Bernstein

Brett Bernstein

Brett is a third year PhD student in the Math department at Courant working with Prof. Carlos Fernandez-Granda

A photo of Vlad Kobzar

Vladimir Kobzar

Vlad is a math graduate student at Courant Institute, where he works on algorithms at the intersection of mathematics and machine learning. He is also a lawyer and was previously an Executive Director at Goldman Sachs.

Graders

Project Advisers