Instructor	David Rosenberg
Lecture	Tuesday 5:20pm–7pm, GSACL C95 (238 Thompson St.)
Lab	Wednesday 6:45pm–7:35pm, MEYER 121 (4 Washington Pl)
Office Hours	Instructor: Wednesdays 5:00-6:00pm CDS (60 5th Ave.), 6th floor, Room 650
Section Leader: Wednesdays 7:45-8:45pm, CDS (60 5th Ave.) Room C15
Graders: Mondays 3:30-4:30pm CDS (60 5th Ave.), 6th floor, Room 660

About This Course

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build.

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza, where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers.

Other information:

All course materials are stored in a GitHub repository. Check the repository to see when something was last updated.
For registration information, please contact Kathryn Angeles.
The course conforms to NYU’s policy on academic integrity for students.

Prerequisites

DS-GA-1001: Intro to Data Science or its equivalent
DS-GA-1002: Statistical and Mathematical Methods or its equivalent
Solid mathematical background, equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate calculus (primarily differential calculus), probability theory, and statistics. (The coverage in the 2015 version of DS-GA 1002, linked above, is sufficient.)
Python programming required for most homework assignments.
Recommended: Computer science background up to a "data structures and algorithms" course
Recommended: At least one advanced, proof-based mathematics course
Some prerequisites may be waived with permission of the instructor
You can also self-assess your preparation by filling out the Prerequisite Questionnaire

Grading

Homework (40%) + Midterm Exam (20%) + Final Exam (20%) + Project (20%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

Midterm Exam (100 min) Tuesday, March 6th, 5:20–7pm.
Final Exam (100 min) Tuesday, May 15th, 6-7:50pm (confirmed).
See Assignments section for homework-related deadlines.
See Project section for project-related deadlines.

Resources

Textbooks

The cover of Elements of Statistical Learning

The cover of An Introduction to Statistical Learning

The cover of Understanding Machine Learning: From Theory to Algorithms

The cover of Pattern Recognition and Machine Learning

The cover of Bayesian Reasoning and Machine Learning

The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani): This will be our main textbook for L1 and L2 regularization, trees, bagging, random forests, and boosting. It's written by three statisticians who invented many of the techniques discussed. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani): This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
Understanding Machine Learning: From Theory to Algorithms (Shalev-Shwartz and Ben-David): Last year this was our primary reference for kernel methods and multiclass classification, and we may use it even more this year. Covers a lot of theory that we don't go into, but it would be a good supplemental resource for a more theoretical course, such as Mohri's Foundations of Machine Learning course. (Available for free as a PDF.)
Pattern Recognition and Machine Learning (Christopher Bishop): Our primary reference for probabilistic methods, including bayesian regression, latent variable models, and the EM algorithm. It's highly recommended, but unfortunately not free online.
Bayesian Reasoning and Machine Learning (David Barber): A very nice resource for our topics in probabilistic modeling, and a possible substitute for the Bishop book. Would serve as a good supplemental reference for a more advanced course in probabilistic modeling, such as DS-GA 1005: Inference and Representation (Available for free as a PDF.)

Hands-On Machine Learning with Scikit-Learn and TensorFlow (Aurélien Géron)

This is a practical guide to machine learning that corresponds fairly well with the content and level of our course. While most of our homework is about coding ML from scratch with numpy, this book makes heavy use of scikit-learn and TensorFlow. Comfort with the first two chapters of this book would be part of the ideal preparation for this course, and it will also be a handy reference for your projects and work beyond this course, when you'll want to make use of existing ML packages, rather than rolling your own.

Data Science for Business (Provost and Fawcett)

Ideally, this would be everybody's first book on machine learning. The intended audience is both the ML practitioner and the ML product manager. It's full of important core concepts and practical wisdom. The math is so minimal that it's perfect for reading on your phone, and I encourage you to read it in parallel to doing this class, especially if you haven't taken DS-GA 1001.

Software

NumPy is "the fundamental package for scientific computing with Python." Our homework assignments will use NumPy arrays extensively.
scikit-learn is a comprehensive machine learning toolkit for Python. We won't use this for most of the homework assignments, since we'll be coding things from scratch. However, you may want to run the scikit-learn version of the algorithms to check that your own outputs are correct. Most people will use it for their final projects. Also, studying the source code can be a good learning experience.

	Slides	Notes	References
ML Prereqs Jan 1	Slides Black Box Machine Learning Evaluating Classifiers	Notes Géron's Machine Learning Landscape Géron's End-to-End Machine Learning	References Provost and Fawcett book Géron Ch 1,2

	Slides	Notes	References
Lecture Jan 23	Slides Course Logistics and Overview Statistical Learning Theory	Notes Conditional Expectations SLT and SGD Concept Check Questions SLT and SGD Concept Check Solutions	References Bottou's SGD Tricks
Lab Jan 24	Slides Gradients and Directional Derivatives	Notes Gradients and Directional Derivatives Gradient Concept Check Questions Gradient Concept Check Solutions Directional Derivatives and Approximation (Short)	References Barnes "Matrix Differentiation" notes

	Slides	Notes	References
Lecture Jan 30	Slides Stochastic Gradient Descent Excess Risk Decomposition L1 and L2 regularization	Notes PreLecture Concept Check Questions PreLecture Concept Check Solutions Completing the Square Excess Risk and L1/L2 Questions Excess Risk and L1/L2 Solutions	References HTF Ch. 3
Lab Jan 31	Slides (None)	Notes (None)	References HTF 3.4

	Slides	Notes	References
Lecture Feb 6	Slides Elastic Net and Correlation Loss Functions	Notes Elastic Net correlation theorem Lasso and Elastic Net (ipynb)	References HTF 3.4 Mairal, Bach, and Ponce on Sparse Modeling
Lab Feb 7	Slides Subgradient Descent	Notes (None)	References (None)

	Slides	Notes	References
Lecture Feb 13	Slides Subgradient Descent (continued) SVM Insights from Duality The Representer Theorem Lagrangian Duality in 10 Minutes (Optional))	Notes Subgradients Pre-lecture warmup for SVM and Lagrangians SVM Insights from Duality Subgradients and Lagrangian Duality Questions Subgradients and Lagrangian Duality Solutions	References Boyd subgradient notes Support Vector Machines (Optional) Extreme Abridgement of BV (Optional)
Lab Feb 14	Slides Features	Notes Simplest Example Ingesting text with BOW Polynomial features Vector quantization with k-means	References Feature Engineering for Machine Learning by Casari and Zheng

	Slides	Notes	References
Lecture Feb 20	Slides Kernel Methods	Notes Kernel Concept Check Questions Kernel Concept Check Solutions	References SSBD Chapter 16 A Survey of Kernels for Structured Data
Lab Feb 21	Slides CitySense Maximum Likelihood	Notes (None)	References (None)

	Slides	Notes	References
Lecture Feb 27	Slides Conditional Probability Models	Notes Exponential Distribution Example (First part) Conditional Model Questions Conditional Model Solutions	References (None)
Lab Feb 28	Slides Review for Midterm	Notes (None)	References (None)

	Slides	Notes	References
Midterm Exam Mar 6	Slides (None)	Notes (None)	References (None)
Project Adviser Meetings Mar 7	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture Mar 20	Slides Bayesian Methods Bayesian Conditional Models	Notes Proportionality Review Thompson Sampling for Bernoulli Bandits [Optional]	References Barber 9.1, 18.1 Bishop 3.3
Canceled for snow Mar 21	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture Mar 27	Slides Bayesian Conditional Models	Notes Proportionality Review Multivariate Gaussians (draft) Bayesian Linear Regression (draft) Bayesian Methods and Regression Questions Bayesian Methods and Regression Solutions	References Barber 9.1, 18.1 Bishop 3.3
Lab Mar 28	Slides Multiclass	Notes Multiclass Questions Multiclass Solutions	References SSBD Chapter 17 In Defense of One-Vs-All Classification Reducing Multiclass to Binary

	Slides	Notes	References
Lecture Apr 3	Slides Classification and Regression Trees	Notes (None)	References JWHT 8.1 (Trees) HTF 9.2 (Trees)
Lab Apr 4	Slides Intro to the Bootstrap	Notes (None)	References JWHT 5.2 (Bootstrap) HTF 7.11 (Bootstrap)

	Slides	Notes	References
Lecture Apr 10	Slides Bagging and Random Forests Adaboost [Optional]	Notes Trees, Bootstrap, Bagging, and RF Questions Trees, Bootstrap, Bagging, and RF Solutions	References JWHT 8.2 (Bagging/RF) HTF 8.7, 15, 10 (Bagging/RF) A Conversation with Jerry Friedman
Lab Apr 11	Slides Gradient Boosting	Notes Gradient Boosting Boosting Questions Boosting Solutions gbm.py Exponential Distribution Gradient Boosting Poisson Gradient Boosting	References Friedman's GBM Paper Ridgeway's GBM Guide XGBoost Paper

	Slides	Notes	References
Lecture Apr 17	Slides Neural Networks Back Propagation	Notes (None)	References Yes you should understand backprop (Karpathy) Challenges with backprop (Karpathy Lecture)
Project Adviser Meetings Apr 18	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture Apr 24 Video	Slides k-Means (See Video Link) Gaussian Mixture Models General EM Algorithm	Notes (None)	References HTF, 13.2.1 (k-means) Bishop 9.2,9.3 (GMM/EM) An Alternative to EM for GMM [Optional]
Course Review Apr 25	Slides (None)	Notes (None)	References (None)

	Slides	Notes	References
Lecture May 1	Slides Feature Importance (single webpage) Feature Importance (as slides)	Notes (None)	References See references in ipynb Jupyter notebook for slides
Project advisor meetings. May 2	Slides (None)	Notes (None)	References (None)

Assignments

Late Policy: Homeworks are due at 10pm on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework Submission: Homework should be submitted through Gradescope. If you have not used Gradescope before, please watch this short video: "For students: submitting homework." At the beginning of the semester, you will be added to the Gradescope class roster. This will give you access to the course page, and the assignment submission form. To submit assignments, you will need to:

Upload a single PDF document containing all the math, code, plots, and exposition required for each problem.
Where homework assignments are divided into sections, please begin each section on a new page.
You will then select the appropriate page ranges for each homework problem, as described in the "submitting homework" video.

Homework Feedback: Check Gradescope to get your scores on each individual problem, as well as comments on your answers. Since Gradescope cannot distinguish between required and optional problems, final homework scores, separated into required and optional parts, will be posted on NYUClasses.

Homework 0

Typesetting your homework

Due: January 1st, 10pm

hw0.pdf hw0.zip

Homework 1

GD, SGD, and Ridge Regression

Due: February 1st, 10pm

hw1.pdf hw1.zip

Homework 2

Lasso Regression

Due: February 13th, 10pm

hw2.pdf hw2.zip

Homework 3

SVM and Sentiment Analysis

Due: February 22nd, 10pm

hw3.pdf hw3.zip

Homework 4

Kernel Methods

Due: March 2nd, 10pm

hw4.pdf hw4.zip

Homework 5

Probabilistic Modeling

Due: April 9th, 10pm

hw5.pdf hw5.zip

Homework 6

Multiclass, Trees, and Gradient Boosting

Due: April 23rd, 10pm

hw6.pdf hw6.zip

Homework 7

Computation Graphs, Backprop, and Neural Networks

Due: May 11th, 10pm

hw7.pdf hw7.zip

Project

Overview

The project is your opportunity for in-depth engagement with a data science problem. In job interviews, it's often your course projects that you end up discussing, so it has some importance even beyond this class. That said, it's better to pick a project that you will be able to go deep with (in terms of trying different methods, feature engineering, error analysis, etc.), than choosing a very ambitious project that requires so much setup that you will only have time to try one or two approaches.

Key Dates

Feb 26 (Mon 10pm): Deadline for choosing project groups
March 2 (Fri 10pm): Email short description (few sentences) of project idea(s) to adviser, along with personal intros
March 7 (Wed, Lab time): First meeting with advisers. Each group will give a 5-minute presentation of their project idea to their assigned project adviser, followed by brief discussion.
March 22 (Thurs 10pm): Project Proposals Due
Apr 18th (Wed, Lab time): Second meeting with advisers
May 2nd (Wed, Lab time): Third meeting with advisers
May 17th or May 18th, depending on adviser: Final Project Reports Due to Advisers

Guidelines for Project Topics

A good project for this class is one that's a real "problem", in the sense that you have something you want to accomplish, and it's not necessarily clear from the beginning the best approach. The techiques used should be relevant to our class, so most likely you will be building a prediction system. A probabilistic model would also be acceptable, though we will not be covering these topics until later in the semester.

To be clear, the following approaches would be less than ideal:

Finding an interesting ML algorithm, implementing it, and seeing how it works on some data. This is not appropriate because I want your choice of methods to be driven by the problem you are trying to solve, and not the other way around.
Choosing a well-known problem (e.g. MNIST digit classification or the Netflix problem) and trying out some of our ML methods on it. This is better than the previous example, but with a very well-established dataset, a lot of the most important and challenging parts of real-world data science are left out, including defining the problem, defining the success metric, and finding the right way to encode the data.
Choosing a problem related to predicting stock prices. Historically, these projects are the most troubled. Interestingly, our project advisers who have worked in this field are the ones who advise against this most strongly.

Project proposal guidelines

The project proposal should be roughly 2 pages, though it can be longer if you want to include figures or sample data that will be helpful to your presentation. Your proposal should do the following:

Clearly explain the high-level problem you are trying to solve (e.g. predict movie ratings, predict the outcome of a court case, ...).
Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the data (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
How will you evaluate performance? In certain settings, you may want to try a few different performance measures.
Identify a few "baseline algorithms". These are simple algorithms for solving the problem, such as always predicting the majority class for a classification problem, using a small set of decision rules designed by hand, or using a ridge regression model on a basic feature set. Ideally, you will be able to report the performance of a couple baseline algorithms in your proposal. The goal will be to beat the baseline, so if the baseline is already quite high, you will have a challenge.
What methods do you plan to try to solve your problem, along with a rough timeline. Methods include data preprocessing, feature generation, and the ML models you'll be trying. Once you start your investigation, it's best to use an iterative approach, where the method you choose next is based on an understanding of the results of the previous step.

Project writeup guidelines

The main objective of the project writeup is to explain what you did in a self-contained report. No strict guidelines on the format of the report, but the goal is to make it something you'd be proud to share with a potential employer. Some of the content will resemble your project proposals. Make sure to:

Clearly explain the high-level problem you are trying to solve (e.g. predict movie ratings, predict the outcome of a court ase, ...).
Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the ata (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
How did you evaluate performance and measure success?
What did you use for features, and explain any feature engineering that you did.
What did you do to attempt to improve performance over your baseline algorithms (e.g. error analysis, new features, new parameter tuning,...)
What challenges did you encounter? What insights into your problem did you get?
What would be good next steps to take if you were to continue this work?
If you got ideas from other sources, please cite them.

Some Previous Projects

Some Public Data Sets (just to get you thinking)

People

Instructor

David Rosenberg

dr129@nyu.edu

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP.

Section Leader

Ben Jakubowski

Ben is a 2017 NYU Data Science MS graduate. He currently works as a data scientist for the University of Chicago's Crime Lab New York (CLNY), where his portfolio includes several prediction problems that arise in criminal justice and social policy.

Graders

Lisa Ren (Head Grader)

tr1312@nyu.edu

Lisa is a second-year student in the Data Science program at NYU.
Utku Evci

Utku is a second year Courant Computer Science M.Sc. student from Turkey interested in Neural Networks and their energy landscape.
Mi Fang

Mi is a second year student in CS department at Courant.
Sanyam Kapoor

Sanyam is a Masters student in Computer Science at NYU Courant and currently works as a Researcher in Machine Learning at the NYU Center for Data Science.
Nan Wu

Nan is a second year student in the Data Science program at NYU.
Zemin Yu

Zemin is a second year student in the Data Science program at NYU.

Project Advisers

Kurt Miller

Kurt is a researcher at the quantitative hedge fund PDT Partners.
Brian d'Alessandro

Brian is Director of Data Science at Zocdoc, and he was formerly the VP of Data Science at Dstillery. He is also an Adjunct Professor of Data Science at NYU Stern School of Business.
Bonnie Ray

Bonnie is VP Data Science at Pegged Software. Prior to Pegged, she was Director, Cognitive Algorithms, at IBM Research and has also served on the faculty at the New Jersey Institute of Technology.
Daniel L. Chen

Daniel is at the Institute for Advanced Studies in Toulouse and Toulouse School of Economics. He is a former Chair of Law and Economics at ETH Zurich (2012-2015), Duke Assistant Professor of Law, Economics, and Public Policy (2010-2012), and Kauffman Fellow at the University of Chicago Law School (2009-2010).
Vitaly Kuznetsov

Vitaly is a Research Scientist at Google Research, New York.
Elliott Ash

Elliott is a Visiting Scholar at Princeton University's Woodrow Wilson School of Public Affairs and Assistant Professor of Economics at University of Warwick. His research combines methods from applied microeconometrics, natural language processing, and machine learning to provide empirical evidence on the socioeconomic impacts of legal and political institutions.
David Frohardt-Lane

David Frohardt-Lane is a portfolio manager at 3Red Trading, overseeing a quantitative trading team. Previously he worked as a trader at GETCO for 8 years. He is a former professional gambler who has been involved sports analytics for over 15 years.

Machine Learning DS-GA 1003 / CSCI-GA 2567 · Spring 2018 · NYU Center for Data Science

About This Course

Prerequisites

Grading

Important Dates

Resources

Textbooks

Other tutorials and references

Software

Lectures

Week 0

ML Prereqs Jan 1

Slides

Notes

References

Week 1

Lecture Jan 23

Slides

Notes

References

Lab Jan 24

Slides

Notes

References

Week 2

Lecture Jan 30

Slides

Notes

References

Lab Jan 31

Slides

Notes

References

Week 3

Lecture Feb 6

Slides

Notes

References

Lab Feb 7

Slides

Notes

References

Week 4

Lecture Feb 13

Slides

Notes

References

Lab Feb 14

Slides

Notes

References

Week 5

Lecture Feb 20

Slides

Notes

References

Lab Feb 21

Slides

Notes

References

Week 6

Lecture Feb 27

Slides

Notes

References

Lab Feb 28

Slides

Notes

References

Week 7

Midterm Exam Mar 6

Slides

Notes

References

Project Adviser Meetings Mar 7

Slides

Notes

References

Week 8

Lecture Mar 20