Instructor	David Rosenberg
Lecture	Tuesday 5:20pm–7pm, GSACL C95 (238 Thompson St.)
Lab	Wednesday 8:35pm–9:25pm, GSACL C95 (238 Thompson St.)
Office Hours	Instructor: Wednesdays 5:30-6:30pm (ARRIVE BEFORE 6pm) CDS (60 5th Ave.), 6th floor, Room 650
TA: Friday 5pm-6pm, CDS (60 5th Ave.), 6th floor, Room 650
Graders: Tuesdays 1pm–3pm, CDS (60 5th Ave.), 6th floor, Room 606

About This Course

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build.

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza, where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers.

Other information:

Course details can be found in the syllabus.
The Course Calendar contains all class meeting dates.
All course materials are stored in a GitHub repository. Check the repository to see when something was last updated.
For registration information, please contact Kathryn Angeles.
The course conforms to NYU’s policy on academic integrity for students.

Prerequisites

DS-GA-1001: Intro to Data Science or its equivalent
DS-GA-1002: Statistical and Mathematical Methods or its equivalent
Solid mathematical background, equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate calculus (primarily differential calculus), probability theory, and statistics. (The coverage in DS-GA 1002 is sufficient.)
Python programming required for most homework assignments.
Recommended: Computer science background up to a "data structures and algorithms" course
Recommended: At least one advanced, proof-based mathematics course
Some prerequisites may be waived with permission of the instructor

Grading

Homework (35%) + One-Hour Test (15%) + Two-Hour Test (30%) + Project (20%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

First test (50 min) Wednesday, March 1st, 8:35–9:25pm.
Second test (100 min) Tuesday, April 18th, 5:20–7pm.
See Assignments section for homework-related deadlines.
See Project section for project-related deadlines.

Resources

Textbooks

The cover of Elements of Statistical Learning

The cover of An Introduction to Statistical Learning

The cover of Understanding Machine Learning: From Theory to Algorithms

The cover of Pattern Recognition and Machine Learning

The cover of Bayesian Reasoning and Machine Learning

The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani): This will be our main textbook for L1 and L2 regularization, trees, bagging, random forests, and boosting. It's written by three statisticians who invented many of the techniques discussed. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani): This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
Understanding Machine Learning: From Theory to Algorithms (Shalev-Shwartz and Ben-David): Last year this was our primary reference for kernel methods and multiclass classification, and we may use it even more this year. Covers a lot of theory that we don't go into, but it would be a good supplemental resource for a more theoretical course, such as Mohri's Foundations of Machine Learning course. (Available for free as a PDF.)
Pattern Recognition and Machine Learning (Christopher Bishop): Our primary reference for probabilistic methods, including bayesian regression, latent variable models, and the EM algorithm. It's highly recommended, but unfortunately not free online.
Bayesian Reasoning and Machine Learning (David Barber): A very nice resource for our topics in probabilistic modeling, and a possible substitute for the Bishop book. Would serve as a good supplemental reference for a more advanced course in probabilistic modeling, such as DS-GA 1005: Inference and Representation (Available for free as a PDF.)

Software

NumPy is "the fundamental package for scientific computing with Python." Our homework assignments will use NumPy arrays extensively.
scikit-learn is a comprehensive machine learning toolkit for Python. We won't use this for most of the homework assignments, since we'll be coding things from scratch. However, you may want to run the scikit-learn version of the algorithms to check tha your own outputs are correct. Most people will use it for their final projects. Also, studying the source code can be a good learning experience.

	Slides	Notes	References
Lecture Jan 24	Slides Course Logistics Statistical Learning Theory Stochastic Gradient Descent	Notes Lecture Concept Check Questions Lecture Concept Check Solutions	References Bottou's SGD Tricks
Lab Jan 25	Slides Gradients and Directional Derivatives	Notes Gradients and Directional Derivatives Lab Concept Check Questions Lab Concept Check Solutions	References Barnes's "Matrix Differentiation" notes Felippa's "Matrix Calculus" chapter

	Slides	Notes	References
Lecture Jan 31	Slides Excess Risk Decomposition L1 and L2 regularization	Notes PreLecture Concept Check Questions PreLecture Concept Check Solutions Completing the Square Lecture Concept Check Questions Lecture Concept Check Solutions	References HTF Ch. 3 Mairal, Bach, and Ponce on Sparse Modeling
Lab Feb 1	Slides Directional Derivatives and Optima (Homework Hints)	Notes (None)	References Bach et al on Optimization with Sparsity

	Slides	Notes	References
Lecture Feb 7	Slides Grouping and Elastic Net Loss Functions	Notes Lasso and Elastic Net (ipynb)	References HTF 3.4
Lab Feb 8	Slides Geometric Derivation of SVMs	Notes Geometric Derivation of SVMs Note on the Uniqueness of SVMs	References (None)

	Slides	Notes	References
Lecture Feb 14	Slides Convex Optimization Support Vector Machines	Notes Pre-lecture warmup for SVM and Lagrangians Support Vector Machines (notes) Extreme Abridgement of BV Lecture Concept Check Questions Lecture Concept Check Solutions	References Andrew Ng's CS229 SVM Notes HTF 12.2.1 - 12.2.2
Lab Feb 15	Slides Subgradients	Notes Subgradients Lab Concept Check Questions Lab Concept Check Solutions	References Boyd's subgradient notes

	Slides	Notes	References
Lecture Feb 21	Slides SVMs and Complementary Slackness Intro to Kernel Methods	Notes Support Vector Machines (notes) Extreme Abridgement of BV	References A Survey of Kernels for Structured Data Andrew Ng's CS229 SVM Notes
Lab Feb 22	Slides Kernels	Notes Lab Concept Check Questions Lab Concept Check Solutions	References (None)

	Slides	Notes	References
Lecture Feb 28	Slides Kernel Methods Wrapup Features	Notes (None)	References SSBD Chapter 16
One-Hour Test Mar 1	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture Mar 7	Slides Multiclass and Intro to Structured Prediction	Notes Lecture Concept Check Questions Lecture Concept Check Solutions	References SSBD Chapter 17 In Defense of One-Vs-All Classification Reducing Multiclass to Binary
Project Adviser Meetings Mar 8	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture Mar 21	Slides Classification and Regression Trees	Notes (None)	References JWHT 8.1 HTF 9.2
Lab Mar 22	Slides Intro to the Bootstrap	Notes (None)	References JWHT 5.2 HTF 7.11

	Slides	Notes	References
Lecture Mar 28	Slides Bagging and Random Forests Adaboost	Notes Lecture Concept Check Questions Lecture Concept Check Solutions	References JWHT 8.2 HTF 8.7, 15, 10 A Conversation with Jerry Friedman
Lab Mar 29	Slides Gradient Boosting	Notes Gradient Boosting gbm.py	References (None)

	Slides	Notes	References
Lecture Apr 4	Slides Gradient Boosting (continued) CitySense and CabSense Conditional Probability Models (1)	Notes (None)	References Friedman's GBM Paper Ridgeway's GBM Guide XGBoost Paper Bühlmann and Hothorn's Boosting Paper
Lab Apr 5	Slides Conditional Probability Models (2)	Notes Exponential Distribution Example Lab Concept Check Questions Lab Concept Check Solutions	References (None)

	Slides	Notes	References
Lecture Apr 11	Slides Bayesian Methods Bayesian Linear Regression	Notes Proportionality Review Multivariate Gaussians Bayesian Linear Regression Lecture Concept Check Questions Lecture Concept Check Solutions	References Barber 9.1, 18.1 Bishop 3.3
Lab Apr 12	Slides Test Review	Notes (None)	References (None)

	Slides	Notes	References
Two-Hour Test Apr 18	Slides (None)	Notes (None)	References (None)
Project Adviser Meetings Apr 19	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture Apr 25	Slides Principal Component Analysis k-Means Clustering	Notes Principal Component Analysis	References Jolliffe, Principal Component Analysis HTF, 3.5.1 HTF, 13.2.1
Lab Apr 26	Slides Gaussian Mixture Models	Notes (None)	References (None)

	Slides	Notes	References
Lecture May 2	Slides General EM Algorithm Introduction to Neural Networks Next Steps	Notes (None)	References Bishop 9.2,9.3
Project Adviser Meetings May 3	Slides (None)	Notes (None)	References (None)

Assignments

Late Policy: Homeworks are due at 10pm on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework Submission: Homework should be submitted through Gradescope. If you have not used Gradescope before, please watch this short video: "For students: submitting homework." At the beginning of the semester, you will be added to the Gradescope class roster. This will give you access to the course page, and the assignment submission form. To submit assignments, you will need to:

Upload a single PDF document containing all the math, code, plots, and exposition required for each problem.
Where homework assignments are divided into sections, please begin each section on a new page.
You will then select the appropriate page ranges for each homework problem, as described in the "submitting homework" video.

Homework Feedback: Check Gradescope to get your scores on each individual problem, as well as comments on your answers. Since Gradescope cannot distinguish between required and optional problems, final homework scores, separated into required and optional parts, will be posted on NYUClasses.

Homework 1

GD, SGD, and Ridge Regression

Due: February 5th, 10pm

hw1.pdf hw1.zip

Homework 2

Lasso Regression

Due: February 13th, 10pm

hw2.pdf hw2.zip

Homework 3

SVM and Sentiment Analysis

Due: February 23rd, 10pm

hw3.pdf hw3.zip

Homework 4

Kernel Methods and Lagrangian Duality

Due: March 27th, 10pm

hw4.pdf hw4.zip

Homework 5

Multiclass and Sound Classification

Due: April 11th, 10pm

hw5.pdf hw5.zip

Homework 6

Trees and Boosting

Due: May 1st, 10pm

hw6.pdf hw6.zip

Homework 7

Bayesian Modeling

Due: May 15th, 10pm

hw7.pdf hw7.zip

Project

Overview

The project is your opportunity for in-depth engagement with a data science problem. In job interviews, it's often your course projects that you end up discussing, so it has some importance even beyond this class. That said, it's better to pick a project that you will be able to go deep with (in terms of trying different methods, feature engineering, error analysis, etc.), than choosing a very ambitious project that requires so much setup that you will only have time to try one or two approaches.

Key Dates

Feb 28 (Tues 5pm): Deadline for choosing project groups
March 6 (Mon 9pm): Email one-sentence project idea(s) to adviser, along with personal intros
March 8 (Wed 8:35–9:25pm): First meeting with advisers. Each group will give a 5-minute presentation of their project idea to their assigned project adviser, followed by brief discussing with adviser.
March 23 (Thurs 6pm): Project Proposals Due
Apr 19th (Wed 8:35–9L25pm): Second meeting with advisers
May 3rd (Wed 8:35–9:25pm): Third meeting with advisers
May 9th (Tues 5–7pm): Project Poster Session in CDS, 7th floor
May 12th, 6pm: Final Project Reports Due

Guidelines for Project Topics

A good project for this class is one that's a real "problem", in the sense that you have something you want to accomplish, and it's not necessarily clear from the beginning the best approach. The techiques used should be relevant to our class, so most likely you will be building a prediction system. A probabilistic model would also be acceptable, though we will not be covering these topics until later in the semester.

To be clear, the following approaches would be less than ideal:

Finding an interesting ML algorithm, implementing it, and seeing how it works on some data. This is not appropriate because I want your choice of methods to be driven by the problem you are trying to solve, and not the other way around.
Choosing a well-known problem (e.g. MNIST digit classification or the Netflix problem) and trying out some of our ML methods on it. This is better than the previous example, but with a very well-established dataset, a lot of the most important and challenging parts of real-world data science are left out, including defining the problem, defining the success metric, and finding the right way to encode the data.
Choosing a problem related to predicting stock prices. Historically, these projects are the most troubled. Interestingly, our project advisers who have worked in this field are the ones who advise against this most strongly.

Project proposal guidelines

The project proposal should be roughly 2 pages, though it can be longer if you want to include figures or sample data that will be helpful to your presentation. Your proposal should do the following:

Clearly explain the high-level problem you are trying to solve (e.g. predict movie ratings, predict the outcome of a court case, ...).
Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the data (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
How will you evaluate performance? In certain settings, you may want to try a few different performance measures.
Identify a few "baseline algorithms". These are simple algorithms for solving the problem, such as always predicting the majority class for a classification problem, using a small set of decision rules designed by hand, or using a ridge regression model on a basic feature set. Ideally, you will be able to report the performance of a couple baseline algorithms in your proposal. The goal will be to beat the baseline, so if the baseline is already quite high, you will have a challenge.
What methods do you plan to try to solve your problem, along with a rough timeline. Methods include data preprocessing, feature generation, and the ML models you'll be trying. Once you start your investigation, it's best to use an iterative approach, where the method you choose next is based on an understanding of the results of the previous step.

Some Previous Projects

Some Public Data Sets (just to get you thinking)

People

Instructor

David Rosenberg

dr129@nyu.edu

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP.

Teaching Assistants

Brett Bernstein

brettb@cims.nyu.edu

Brett is a third year PhD student in the Math department at Courant working with Prof. Carlos Fernandez-Granda

Vladimir Kobzar

vkobzar@cims.nyu.edu

Vlad is a math graduate student at Courant Institute, where he works on algorithms at the intersection of mathematics and machine learning. He is also a lawyer and was previously an Executive Director at Goldman Sachs.

Graders

Ben Jakubowski (Head Grader)

buj201@nyu.edu

Ben is a second year Data Science MS student. He also works part-time as a data science instructor with the start-up Cognitir.
Hao Liu

hl2514@nyu.edu

Hao is a second year student in the Data Science program at NYU.
Yuhao Zhao

yuhao.zhao@nyu.edu

Yuhao is a second-year student in the Data Science program at NYU, interested in sequential data learning.
Xinyi Gong

xg555@nyu.edu

Xinyi is a second year student in the Data Science program at NYU.
Lanyu Shang

lanyu.shang@nyu.edu

Lanyu is a second year student in the Data Science program at NYU.
Prithvi Krishna Gattamaneni

pkg238@nyu.edu

PK is in his final semester of his pursuit of an MSCS degree. Prior to this he worked for Morgan Stanley and is currently focused on the Machine learning space.

Project Advisers

Kurt Miller

Kurt is a researcher at the quantitative hedge fund PDT Partners.
Brian d'Alessandro

Brian is Director of Data Science at Zocdoc, and he was formerly the VP of Data Science at Dstillery. He is also an Adjunct Professor of Data Science at NYU Stern School of Business.
Bonnie Ray

Bonnie is VP Data Science at Pegged Software. Prior to Pegged, she was Director, Cognitive Algorithms, at IBM Research and has also served on the faculty at the New Jersey Institute of Technology.
Daniel L. Chen

Daniel is at the Institute for Advanced Studies in Toulouse and Toulouse School of Economics. He is a former Chair of Law and Economics at ETH Zurich (2012-2015), Duke Assistant Professor of Law, Economics, and Public Policy (2010-2012), and Kauffman Fellow at the University of Chicago Law School (2009-2010).
Vitaly Kuznetsov

Vitaly is a Research Scientist at Google Research, New York.
Elliott Ash

Elliott is a Visiting Scholar at Princeton University's Woodrow Wilson School of Public Affairs and Assistant Professor of Economics at University of Warwick. His research combines methods from applied microeconometrics, natural language processing, and machine learning to provide empirical evidence on the socioeconomic impacts of legal and political institutions.

Machine Learning and Computational Statistics DS-GA 1003 · Spring 2017 · NYU Center for Data Science

About This Course

Prerequisites

Grading

Important Dates

Resources

Textbooks

Other tutorials and references

Software

Lectures

Week 1

Lecture Jan 24

Slides

Notes

References

Lab Jan 25

Slides

Notes

References

Week 2

Lecture Jan 31

Slides

Notes

References

Lab Feb 1

Slides

Notes

References

Week 3

Lecture Feb 7

Slides

Notes

References

Lab Feb 8

Slides

Notes

References

Week 4

Lecture Feb 14

Slides

Notes

References

Lab Feb 15

Slides

Notes

References

Week 5

Lecture Feb 21

Slides

Notes

References

Lab Feb 22

Slides

Notes

References

Week 6

Lecture Feb 28

Slides

Notes

References

One-Hour Test Mar 1

Slides

Notes

References

Week 7

Lecture Mar 7

Slides

Notes

References

Project Adviser Meetings Mar 8

Slides

Notes

References

Week 8

Lecture Mar 21

Slides

Notes

References

Lab Mar 22

Slides