Instructor	David Rosenberg
Lecture	Wednesdays 7:10pm–9pm, Warren Weaver Hall 109
Lab	Thursday 7:10pm–8pm, Warren Weaver Hall 109
Office Hours	Instructor: Thursday 8pm–9pm, Warren Weaver Hall 109
TA: Wednesdays 2pm–3pm, Warren Weaver Hall 605
Graders: Tuesdays 2pm–4pm in the CDS common area

About This Course

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build.

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. Note that class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, graders, and the instructor. Rather than emailing questions to the teaching staff, you are encouraged to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

Other information:

Course details can be found in the syllabus.
The Course Calendar contains all schedule information.
All course materials are stored in a GitHub repository. Check the repository to see when something was last updated.
For registration information, please contact David J Clark.
The course conforms to NYU’s policy on academic integrity for students.

Prerequisites

DS-GA-1001: Intro to Data Science or its equivalent
DS-GA-1002: Statistical and Mathematical Methods or its equivalent
Solid mathematical background, equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate calculus (primarily differential calculus), probability theory, and statistics. (The coverage in DS-GA 1002 is sufficient.)
Python programming required for most homework assignments.
Recommended: Computer science background up to a "data structures and algorithms" course
Recommended: At least one advanced, proof-based mathematics course
Some prerequisites may be waived with permission of the instructor.

Grading

Homework (40%) + One-Hour Test (15%) + Two-Hour Test (25%) + Project (20%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

First test (50 min) Thursday, March 3rd, 7:10-8pm.
Second test (110 min) Wednesday, April 13th, 7:10-9pm.
See Assignments section for homework-related deadlines.
See Project section for project-related deadlines.

Resources

Textbooks

The cover of Elements of Statistical Learning

The cover of An Introduction to Statistical Learning

The cover of Bayesian Reasoning and Machine Learning

The cover of Pattern Recognition and Machine Learning

The cover of Machine Learning: a Probabilistic Perspective

The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani): This will be our principal textbook for the first part of the course. It's written by three statisticians who invented many of the techniques discussed. Despite its popularity and the pretty pictures, this is not an easy book. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani): This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
Bayesian Reasoning and Machine Learning (David Barber): We'll use this as a reference for probabilistic modeling, including Bayesian methods, and Bayesian networks. (Available for free as a PDF.)
Pattern Recognition and Machine Learning (Christopher Bishop): This book is another very nice reference for probabilistic models and beyond. It's highly recommended.
Machine Learning: A Probabilistic Perspective (Kevin P. Murphy): This book covers an unusually broad set of topics, including recent advances in the field. As such, it's a great reference to have, particularly if you continue your study of data science beyond this course. That said, it was the required textbook for this course in 2015, and many students found it a bit overwhelming. It's really intended as a comprehensive, PhD-level textbook.
Convex Optimization (Boyd and Vandenberghe): This book was an instant hit in the machine learning community when it was published in 2004. We will be making light use of this book, mostly for its coverage of Lagrangians and duality. However, it's a good book to get familiar with, as it's very well written, and it covers a lot of techniques used in more advanced machine learning literature. (Available for free as a PDF.)

Software

NumPy is "the fundamental package for scientific computing with Python." Our homework assignments will use NumPy arrays extensively.
scikit-learn is a comprehensive machine learning toolkit for Python. We won't use this for most of the homework assignments, since we'll be coding things from scratch. However, you may want to run the scikit-learn version of the algorithms to compare the outputs to your own, as a check. Besides that, it should be useful for many final projects, and studying the source code can be a good learning experience. One of the core developers, Andreas Müller, is a Research Engineer in NYU's Center for Data Science.
CVXPY and CVXOPT are for solving convex optimization problems in Python. Could be useful for checking your homework results.

Lectures

(HTF) refers to Hastie, Tibshirani, and Friedman's book The Elements of Statistical Learning
(KPM) refers to Kevin P. Murphy's book Machine Learning: a Probabilistic Perspective
(BV) refers to Boyd and Vandenberghe's Convex Optimization

Week 1

Lecture Jan 27 Video

Course mechanics
Statistical learning theory framework
Gradient and stochastic gradient descent

References

KPM Ch 1
Bottou's SGD Tricks
BV Preface, Ch 1

Slides and Notes

Lab Jan 28 Video

Matrix differentiation (Levent Sagun)

References

Slides and Notes

Week 2

Lecture Feb 3

Excess Risk Decomposition
L1/L2 regularization
Optimization methods for Lasso

References

HTF Ch. 3
Bottou's SGD Tricks

Slides and Notes

Lab Feb 4

Elastic Net
Directional Derivatives and Optima

References

Boyd's subgradient notes

Slides and Notes

Week 3

Lecture Feb 10

Loss Functions
Convex Optimization
SVM

References

Warmup Exercises (Pre)
Andrew Ng's CS229 SVM Notes
BV Ch 1-5
HTF 12.2.1 - 12.2.2

Slides and Notes

Lab Feb 11

Projections
SVM (Geometric Motivation)

References

HTF 3.2.0,4.5
HTF pp. 417-419

Slides and Notes

Projections and SVM

Week 4

Lecture Feb 17

SGD and GD Revisited
Subgradient descent
Features

References

Boyd's subgradient notes

Slides and Notes

Lab Feb 18

Kernel Methods

References

HTF 12.3.1

Slides and Notes

Kernels

Week 5

Lecture Feb 24

Hilbert Spaces and Projections
Kernel Methods

References

HTF 9.2
Shalev-Shwartz and Ben-David's Book (Ch 16)
HTF 12.3.1

Slides and Notes

Lab Feb 25

Regression Trees

References

HTF 9.2

Slides and Notes

Regression Trees

Week 6

Lecture Mar 2

Classification Trees
Review for One-Hour Test

References

HTF 9.2

Slides and Notes

Classification Trees

One-Hour Test Mar 3

References

Slides and Notes

Week 7

Lecture Mar 9

Bootstrap and Bagging
Random Forests
Boosting

References

HTF 8.7
HTF Ch. 15
HTF Ch. 10

Slides and Notes

Project Adviser Meetings Mar 10

References

Slides and Notes

Week 8

Lecture Mar 23

Boosting
Gradient Boosting

References

Slides and Notes

Lab Mar 24

Variations on Gradient Boosting

References

Slides and Notes

More Boosting

Week 9

Lecture Mar 30

Multiclass Classification

References

Shalev-Shwartz and Ben-David's Book (Ch 17)

Slides and Notes

Multiclass Classification

Lab Apr 2

Midterm Exam Recap

References

Slides and Notes

Week 10

Lecture Apr 6 Video

Conditional Probability Models

References

Slides and Notes

Lab Apr 7 Video

Test Review

References

Slides and Notes

Test Review

Week 11

Two-Hour Test Apr 13

References

Slides and Notes

Project Adviser Meetings Apr 14

References

Slides and Notes

Week 12

Lecture Apr 20 Video

Conditional Independence
Bayesian Networks
Naive Bayes
Bayesian Methods

References

Slides and Notes

Lab Apr 21 Video

Second Test Recap

References

Slides and Notes

Week 13

Lecture Apr 27

Bayesian Regression
k-means Clustering
Gaussian Mixture Models

References

Slides and Notes

Lab Apr 28

EM Algorithm

References

Vaida's EM convergence paper

Slides and Notes

EM Algorithm

Week 14

Lecture May 4

EM Algorithm (continued)
Neural Networks

References

Slides and Notes

Project Adviser Meetings May 5

References

Slides and Notes

Week 15

Poster Session (In CDS, 6-8pm) May 11

References

Slides and Notes

Assignments

Homework Submission: Homework should be submitted through NYU Classes.

Late Policy: Homeworks are due at 6pm on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework 1

Ridge regression and SGD

Due: February 5th, 6pm

hw1.pdf hw1.zip

Homework 2

Lasso regression

Due: February 16th, 6pm

hw2.pdf hw2.zip

Homework 3

SVM and Sentiment Analysis

Due: February 29th, 6pm

hw3.pdf hw3.zip

Homework 4

Linear Algebra, Kernels, Duality, and Trees

Due: March 22nd, 6pm

hw4.pdf hw4.zip

Homework 5

Trees and Ensemble Methods

Due: April 4th, 6pm

hw5.pdf hw5.zip

Homework 6

Multiclass Hinge Loss and Multiclass SVM

Due: April 11th, 6pm

hw6.pdf hw6.zip

Homework 7

Bayesian Methods and the Beta/Binomial Model

Due: May 10th, 6pm

hw7.pdf hw7.zip

Project

Overview

The project is your opportunity for in-depth engagement with a data science problem. In job interviews, it's often your course projects that you end up discussing, so it has some importance even beyond this class. That said, it's better to pick a project that you will be able to go deep with (in terms of trying different methods, feature engineering, error analysis, etc.), than choosing a very ambitious project that requires so much setup that you will only have time to try one or two approaches.

Key Dates

Feb 29 (Mon 6pm): Deadline for choosing project groups
March 10 (Thur 7-9pm): First meeting with advisers. Each group will give a 5-minute "pitch" of their project idea to their assigned project adviser. The adviser may give immediate feedback or ask follow-up questions.
March 24 (Thurs 6pm): Project Proposals Due
Apr 14th (Thurs 7-9pm): Second meeting with advisers
May 5th (Thurs 7-9pm): Third meeting with advisers
May 11th (Wed 6-8pm): Project Poster Session
May 13th, 6pm: Final Project Reports Due

Guidelines for Project Topics

A good project for this class is one that's a real "problem", in the sense that you have something you want to accomplish, and it's not necessarily clear from the beginning the best approach. The techiques used should be relevant to our class, so most likely you will be building a prediction system. A probabilistic model would also be acceptable, though we will not be covering these topics until later in the semester.

To be clear, the following approaches would be less than ideal:

Finding an interesting ML algorithm, implementing it, and seeing how it works on some data. This is not appropriate because I want your choice of methods to be driven by the problem you are trying to solve, and not the other way around.
Choosing a well-known problem (e.g. MNIST digit classification or the Netflix problem) and trying out some of our ML methods on it. This is better than the previous example, but with a very well-established dataset, a lot of the most important and challenging parts of real-world data science are left out, including defining the problem, defining the success metric, and finding the right way to encode the data.
Choosing a problem related to predicting stock prices. Historically, these projects are the most troubled. Interestingly, our project advisers who have worked at hedge funds are the ones who advise against this most strongly.

Project proposal guidelines

The project proposal should be roughly 2 pages, though it can be longer if you want to include figures or sample data that will be helpful to your presentation. Your proposal should do the following:

Clearly explain the high-level problem you are trying to solve. (e.g. Predict movie ratings, predict the outcome of a court case, find a low-dimensional characterization of input examples.)
Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the data (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
How will you evaluate performance? In certain settings, you may want to try a few different performance measures.
Identify a few "baseline algorithms". These are simple algorithms for solving the problem, such as always predicting the majority class for a classification problem, using a small set of decision rules designed by hand, or using a ridge regression model on a basic feature set. Ideally, you will be able to report the performance of a couple baseline algorithms in your proposal, though this is not necessary. The goal will be to beat the baseline, so if the baseline is already quite high, you will have a challenge.
What methods do you plan to try to solve your problem, along with a rough timeline. Methods include data preprocessing, feature generation, and the ML models you'll be trying. Once you start your investigation, it's best to use an iterative approach, where the method you choose next is based on an understanding of the results of the previous step.

Some Public Data Sets (just to get you thinking)

Quandl financial, economic, social datasets

People

Instructor

David Rosenberg

dr129@nyu.edu

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP.

Teaching Assistant

Levent Sagun

sagun@cims.nyu.edu

Levent is a PhD student at the Courant Institute of Mathematical Sciences.

Graders

Peter Li (Head Grader)
phl232@nyu.edu
Peter is a second year student in the Data Science program at NYU.
Lucy Wang
lw1582@nyu.edu
Lucy is a Master's student in Data Science at NYU. She is also working as an investor and in-house data scientist at Greycroft Partners, a venture capital firm making investments in early-stage tech companies.
Jacqueline Gutman
jacqueline.gutman@nyu.edu
Jackie is a second-year student in the Center for Data Science. She currently works as a researcher on educational measurement issues in computer-supported collaborative learning and has experience in statistical consulting.
Tian Wang
t.wang@nyu.edu
Tian is a second year student in the Data Science program at NYU.

Project Advisers

Daniel L. Chen

Daniel is at the Institute for Advanced Studies in Toulouse and Toulouse School of Economics. He is a former Chair of Law and Economics at ETH Zurich (2012-2015), Duke Assistant Professor of Law, Economics, and Public Policy (2010-2012), and Kauffman Fellow at the University of Chicago Law School (2009-2010).
Brian d'Alessandro

Brian is Director of Data Science at Zocdoc, and he was formerly the VP of Data Science at Dstillery. He is also an Adjunct Professor of Data Science at NYU Stern School of Business.
Kurt Miller

Kurt is a researcher at the quantitative hedge fund PDT Partners.
Bonnie Ray

Bonnie is VP Data Science at Pegged Software. Prior to Pegged, she was Director, Cognitive Algorithms, at IBM Research and has also served on the faculty at the New Jersey Institute of Technology.
Kush R. Varshney

Kush is a research staff member at IBM Research and a data ambassador with DataKind.

Machine Learning and Computational Statistics DS-GA 1003 · Spring 2016 · NYU Center for Data Science

This week

Topics

References

Slides and Notes

Homework 7

About This Course

Prerequisites

Grading

Important Dates

Resources

Textbooks

Other tutorials and references

Software

Lectures

Week 1

Lecture Jan 27 Video

References

Slides and Notes

Lab Jan 28 Video

References

Slides and Notes

Week 2

Lecture Feb 3

References

Slides and Notes

Lab Feb 4

References

Slides and Notes

Week 3

Lecture Feb 10

References

Slides and Notes

Lab Feb 11

References

Slides and Notes

Week 4

Lecture Feb 17

References

Slides and Notes

Lab Feb 18

References

Slides and Notes

Week 5

Lecture Feb 24

References

Slides and Notes

Lab Feb 25

References

Slides and Notes

Week 6

Lecture Mar 2

References

Slides and Notes

One-Hour Test Mar 3

References

Slides and Notes

Week 7

Lecture Mar 9

References

Slides and Notes

Project Adviser Meetings Mar 10

References

Slides and Notes

Week 8

Lecture Mar 23

References

Slides and Notes

Lab Mar 24

References

Slides and Notes

Week 9

Lecture Mar 30

References

Slides and Notes

Lab Apr 2

References

Slides and Notes

Week 10

Lecture Apr 6 Video