2. We have mainly focused on squared loss, but there are other interesting losses in machine learning. Consider the following loss function which we denote by φ(z) = max(0, −z).

1. Consider running the Perceptron algorithm on a training set S arranged in a certain order. Now suppose we run it with the same initial weights and on the same training set but in a different order, S 0 . Does Perceptron make the same number of mistakes? Does it end up with the same final weights? If so, prove it. If not, give a counterexample, i.e. an S and S 0 where order matters.

2. We have mainly focused on squared loss, but there are other interesting losses in machine learning. Consider the following loss function which we denote by φ(z) = max(0, −z). Let S be a training set (x 1 , y1 ), . . . ,(x m, ym) where each x i ∈ R n and y i ∈ {−1, 1}. Consider running stochastic gradient descent (SGD) to find a weight vector w that minimizes 1 m Pm i=1 φ(y i · w T x i ). Explain the explicit relationship between this algorithm and the Perceptron algorithm. Recall that for SGD, the update rule when the i th example is picked at random is wnew = wold − η∇φ  y iw T x i .

3. Here we will give an illustrative example of a weak learner for a simple concept class. Let the domain be the real line, R, and let C refer to the concept class of “3-piece classifiers”, which are functions of the following form: for θ1 < θ2 and b ∈ {−1, 1}, hθ1,θ2,b(x) is b if x ∈ [θ1, θ2] and −b otherwise. In other words, they take a certain Boolean value inside a certain interval and the opposite value everywhere else. For example, h10,20,1(x) would be +1 on [10, 20], and −1 everywhere else. Let H refer to the simpler class of “decision stumps”, i.e. functions hθ,b such that h(x) is b for all x ≤ θ and −b otherwise.

(a) Show formally that for any distribution on R (assume finite support, for simplicity; i.e., assume the distribution is bounded within [−B, B] for some large B) and any unknown labeling function c ∈ C that is a 3-piece classifier, there exists a decision stump h ∈ H that has error at most 1/3, i.e. P[h(x) 6= c(x)] ≤ 1/3. (b) Describe a simple, efficient procedure for finding a decision stump that minimizes error with respect to a finite training set of size m. Such a procedure is called an empirical risk minimizer (ERM).

(c) Give a short intuitive explanation for why we should expect that we can easily pick m sufficiently large that the training error is a good approximation of the true error, i.e. why we can ensure generalization. (Your answer should relate to what we have gained in going from requiring a learner for C to requiring a learner for H.) This lets us conclude that we can weakly learn C using H.