Machine Learning and Computational Statistics DS-GA 1003 · Spring 2016 · NYU Center for Data Science

Instructor David Rosenberg
Lecture Wednesdays 7:10pm–9pm, Warren Weaver Hall 109
Lab Thursday 7:10pm–8pm, Warren Weaver Hall 109
Office Hours Instructor: Thursday 8pm–9pm, Warren Weaver Hall 109
TA: Wednesdays 2pm–3pm, Warren Weaver Hall 605
Graders: Tuesdays 2pm–4pm in the CDS common area

This week

Topics

References

    Slides and Notes

      Homework 7

      Bayesian Methods and the Beta/Binomial Model

      Due: May 10th, 6pm

      About This Course

      This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build.

      This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science. Other interested students who satisfy the prerequisites are welcome to take the class as well. Note that class is intended as a continuation of DS-GA-1001 Intro to Data Science, which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

      This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, graders, and the instructor. Rather than emailing questions to the teaching staff, you are encouraged to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

      Other information:

      Prerequisites

      Grading

      Homework (40%) + One-Hour Test (15%) + Two-Hour Test (25%) + Project (20%)

      Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

      Important Dates

      Resources

      Textbooks

      The cover of Elements of Statistical Learning The cover of An Introduction to Statistical Learning The cover of Bayesian Reasoning and Machine Learning The cover of Pattern Recognition and Machine Learning The cover of Machine Learning: a Probabilistic Perspective The cover of Convex Optimization
      The Elements of Statistical Learning (Hastie, Friedman, and Tibshirani)
      This will be our principal textbook for the first part of the course. It's written by three statisticians who invented many of the techniques discussed. Despite its popularity and the pretty pictures, this is not an easy book. There's an easier version of this book that covers many of the same topics, described below. (Available for free as a PDF.)
      An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani)
      This book is written by two of the same authors as The Elements of Statistical Learning. It's much less intense mathematically, and it's good for a lighter introduction to the topics. (Available for free as a PDF.)
      Bayesian Reasoning and Machine Learning (David Barber)
      We'll use this as a reference for probabilistic modeling, including Bayesian methods, and Bayesian networks. (Available for free as a PDF.)
      Pattern Recognition and Machine Learning (Christopher Bishop)
      This book is another very nice reference for probabilistic models and beyond. It's highly recommended.
      Machine Learning: A Probabilistic Perspective (Kevin P. Murphy)
      This book covers an unusually broad set of topics, including recent advances in the field. As such, it's a great reference to have, particularly if you continue your study of data science beyond this course. That said, it was the required textbook for this course in 2015, and many students found it a bit overwhelming. It's really intended as a comprehensive, PhD-level textbook.
      Convex Optimization (Boyd and Vandenberghe)
      This book was an instant hit in the machine learning community when it was published in 2004. We will be making light use of this book, mostly for its coverage of Lagrangians and duality. However, it's a good book to get familiar with, as it's very well written, and it covers a lot of techniques used in more advanced machine learning literature. (Available for free as a PDF.)

      Other tutorials and references

      (If you find additional references that you recommend, please share them on Piazza and we can add them here.)

      Software

      Lectures

      Week 1

      Lecture Jan 27 Video

      • Course mechanics
      • Statistical learning theory framework
      • Gradient and stochastic gradient descent

      References

      Lab Jan 28 Video

      • Matrix differentiation (Levent Sagun)

      Week 2

      Lecture Feb 3

      • Excess Risk Decomposition
      • L1/L2 regularization
      • Optimization methods for Lasso

      References

      Lab Feb 4

      • Elastic Net
      • Directional Derivatives and Optima

      Week 3

      Lecture Feb 10

      • Loss Functions
      • Convex Optimization
      • SVM

      References

      Lab Feb 11

      • Projections
      • SVM (Geometric Motivation)

      References

      • HTF 3.2.0,4.5
      • HTF pp. 417-419

      Slides and Notes

      Week 4

      Lecture Feb 17

      • SGD and GD Revisited
      • Subgradient descent
      • Features

      Lab Feb 18

      • Kernel Methods

      References

      • HTF 12.3.1

      Slides and Notes

      Week 5

      Lecture Feb 24

      • Hilbert Spaces and Projections
      • Kernel Methods

      References

      Lab Feb 25

      • Regression Trees

      References

      • HTF 9.2

      Slides and Notes

      Week 6

      Lecture Mar 2

      • Classification Trees
      • Review for One-Hour Test

      References

      • HTF 9.2

      Slides and Notes

      One-Hour Test Mar 3

      References

        Slides and Notes

          Week 7

          Lecture Mar 9

          • Bootstrap and Bagging
          • Random Forests
          • Boosting

          References

          • HTF 8.7
          • HTF Ch. 15
          • HTF Ch. 10

          Project Adviser Meetings Mar 10

          References

            Slides and Notes

              Week 8

              Lecture Mar 23

              • Boosting
              • Gradient Boosting

              Slides and Notes

              Lab Mar 24

              • Variations on Gradient Boosting

              Slides and Notes

              Week 9

              Lecture Mar 30

              • Multiclass Classification

              Slides and Notes

              Lab Apr 2

              • Midterm Exam Recap

              References

                Slides and Notes

                  Week 10

                  Lecture Apr 6 Video

                  • Conditional Probability Models

                  References

                    Lab Apr 7 Video

                    • Test Review

                    References

                      Slides and Notes

                      Week 11

                      Two-Hour Test Apr 13

                      References

                        Slides and Notes

                          Project Adviser Meetings Apr 14

                          References

                            Slides and Notes

                              Week 12

                              Lecture Apr 20 Video

                              • Conditional Independence
                              • Bayesian Networks
                              • Naive Bayes
                              • Bayesian Methods

                              References

                                Lab Apr 21 Video

                                • Second Test Recap

                                References

                                  Slides and Notes

                                    Week 13

                                    Lecture Apr 27

                                    • Bayesian Regression
                                    • k-means Clustering
                                    • Gaussian Mixture Models

                                    References

                                      Lab Apr 28

                                      • EM Algorithm

                                      Slides and Notes

                                      Week 14

                                      Lecture May 4

                                      • EM Algorithm (continued)
                                      • Neural Networks

                                      References

                                        Slides and Notes

                                        Project Adviser Meetings May 5

                                        References

                                          Slides and Notes

                                            Week 15

                                            Poster Session (In CDS, 6-8pm) May 11

                                            References

                                              Slides and Notes

                                                Assignments

                                                Homework Submission: Homework should be submitted through NYU Classes.

                                                Late Policy: Homeworks are due at 6pm on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

                                                Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

                                                Homework 1

                                                Ridge regression and SGD

                                                Due: February 5th, 6pm

                                                Homework 2

                                                Lasso regression

                                                Due: February 16th, 6pm

                                                Homework 3

                                                SVM and Sentiment Analysis

                                                Due: February 29th, 6pm

                                                Homework 4

                                                Linear Algebra, Kernels, Duality, and Trees

                                                Due: March 22nd, 6pm

                                                Homework 5

                                                Trees and Ensemble Methods

                                                Due: April 4th, 6pm

                                                Homework 6

                                                Multiclass Hinge Loss and Multiclass SVM

                                                Due: April 11th, 6pm

                                                Homework 7

                                                Bayesian Methods and the Beta/Binomial Model

                                                Due: May 10th, 6pm

                                                Project

                                                Overview

                                                The project is your opportunity for in-depth engagement with a data science problem. In job interviews, it's often your course projects that you end up discussing, so it has some importance even beyond this class. That said, it's better to pick a project that you will be able to go deep with (in terms of trying different methods, feature engineering, error analysis, etc.), than choosing a very ambitious project that requires so much setup that you will only have time to try one or two approaches.

                                                Key Dates

                                                Guidelines for Project Topics

                                                A good project for this class is one that's a real "problem", in the sense that you have something you want to accomplish, and it's not necessarily clear from the beginning the best approach. The techiques used should be relevant to our class, so most likely you will be building a prediction system. A probabilistic model would also be acceptable, though we will not be covering these topics until later in the semester.

                                                To be clear, the following approaches would be less than ideal:

                                                1. Finding an interesting ML algorithm, implementing it, and seeing how it works on some data. This is not appropriate because I want your choice of methods to be driven by the problem you are trying to solve, and not the other way around.
                                                2. Choosing a well-known problem (e.g. MNIST digit classification or the Netflix problem) and trying out some of our ML methods on it. This is better than the previous example, but with a very well-established dataset, a lot of the most important and challenging parts of real-world data science are left out, including defining the problem, defining the success metric, and finding the right way to encode the data.
                                                3. Choosing a problem related to predicting stock prices. Historically, these projects are the most troubled. Interestingly, our project advisers who have worked at hedge funds are the ones who advise against this most strongly.

                                                Project proposal guidelines

                                                The project proposal should be roughly 2 pages, though it can be longer if you want to include figures or sample data that will be helpful to your presentation. Your proposal should do the following:

                                                1. Clearly explain the high-level problem you are trying to solve. (e.g. Predict movie ratings, predict the outcome of a court case, find a low-dimensional characterization of input examples.)
                                                2. Identify the data set or data sets that you will be using. You should give a clear description of the characteristics of the data (how many examples, what kinds of features do we have for each example, are there issues with missing data or bad data, etc.).
                                                3. How will you evaluate performance? In certain settings, you may want to try a few different performance measures.
                                                4. Identify a few "baseline algorithms". These are simple algorithms for solving the problem, such as always predicting the majority class for a classification problem, using a small set of decision rules designed by hand, or using a ridge regression model on a basic feature set. Ideally, you will be able to report the performance of a couple baseline algorithms in your proposal, though this is not necessary. The goal will be to beat the baseline, so if the baseline is already quite high, you will have a challenge.
                                                5. What methods do you plan to try to solve your problem, along with a rough timeline. Methods include data preprocessing, feature generation, and the ML models you'll be trying. Once you start your investigation, it's best to use an iterative approach, where the method you choose next is based on an understanding of the results of the previous step.

                                                Some Public Data Sets (just to get you thinking)

                                                People

                                                Instructor

                                                A photo of David Rosenberg

                                                David Rosenberg

                                                David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP.

                                                Teaching Assistant

                                                A photo of Levent Sagun

                                                Levent Sagun

                                                Levent is a PhD student at the Courant Institute of Mathematical Sciences.

                                                Graders

                                                Project Advisers