统计机器学习(斯坦福大学讲义)1-12(全).pdf
《统计机器学习(斯坦福大学讲义)1-12(全).pdf》由会员分享,可在线阅读,更多相关《统计机器学习(斯坦福大学讲义)1-12(全).pdf(142页珍藏版)》请在文库网上搜索。
1、CS229 Lecture notesAndrew NgSupervised learningLets start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540. .We can
2、 plot this data:500 1000 1500 2000 2500 3000 3500 4000 4500 500001002003004005006007008009001000housing pricessquare feetprice (in $1000)Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?1CS229 Fall 2012 2To estab
3、lish notation for future use, well use x(i) to denote the “input”variables (living area in this example), also called input features, and y(i)to denote the “output” or target variable that we are trying to predict(price). A pair (x(i),y(i) is called a training example, and the datasetthat well be us
4、ing to learna list of m training examples (x(i),y(i);i =1,.,mis called a training set. Note that the superscript “(i)” in thenotation is simply an index into the training set, and has nothing to do withexponentiation. We will also use X denote the space of input values, and Ythe space of output valu
5、es. In this example, X = Y = R.To describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a function h : X mapsto Y so that h(x) is a“good” predictor for the corresponding value of y. For historical reasons, thisfunction h is called a hypothesis. S
6、een pictorially, the process is thereforelike this:Training sethouse.)(living area ofLearning algorithmh predicted yx(predicted price)of house)When the target variable that were trying to predict is continuous, suchas in our housing example, we call the learning problem a regression prob-lem. When y
7、 can take on only a small number of discrete values (such asif, given the living area, we wanted to predict if a dwelling is a house or anapartment, say), we call it a classification problem.3Part ILinear RegressionTo make our housing example more interesting, lets consider a slightly richerdataset
8、in which we also know the number of bedrooms in each house:Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540. . .Here, the xs are two-dimensional vectors in R2. For instance, x(i)1 is theliving area of the i-th house in the training set, and x(i)2 is its
9、number ofbedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gatheringhousing data, you might also decide to include other features such as whethereach house has a fireplace, the number of bathrooms, and so on
10、. Well saymore about feature selection later, but for now lets take the features asgiven.)To perform supervised learning, we must decide how were going to rep-resent functions/hypotheses h in a computer. As an initial choice, lets saywe decide to approximate y as a linear function of x:h(x) = 0 + 1x
11、1 + 2x2Here, the is are the parameters (also called weights) parameterizing thespace of linear functions mapping from X to Y. When there is no risk ofconfusion, we will drop the subscript in h(x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx
12、0 = 1 (this is the intercept term), so thath(x) =nsummationdisplayi=0ixi = Tx,where on the right-hand side above we are viewing and x both as vectors,and here n is the number of input variables (not counting x0).4Now, given a training set, how do we pick, or learn, the parameters ?One reasonable met
13、hod seems to be to make h(x) close to y, at least forthe training examples we have. To formalize this, we will define a functionthat measures, for each value of the s, how close the h(x(i)s are to thecorresponding y(i)s. We define the cost function:J() = 12msummationdisplayi=1(h(x(i)y(i)2.If youve s
14、een linear regression before, you may recognize this as the familiarleast-squares cost function that gives rise to the ordinary least squaresregression model. Whether or not you have seen it previously, lets keepgoing, and well eventually show this to be a special case of a much broaderfamily of alg
15、orithms.1 LMS algorithmWe want to choose so as to minimize J(). To do so, lets use a searchalgorithm that starts with some “initial guess” for , and that repeatedlychanges to make J() smaller, until hopefully we converge to a value of that minimizes J(). Specifically, lets consider the gradient desc
16、entalgorithm, which starts with some initial , and repeatedly performs theupdate:j := j jJ().(This update is simultaneously performed for all values of j = 0,.,n.)Here, is called the learning rate. This is a very natural algorithm thatrepeatedly takes a step in the direction of steepest decrease of
17、J.In order to implement this algorithm, we have to work out what is thepartial derivative term on the right hand side. Lets first work it out for thecase of if we have only one training example (x,y), so that we can neglectthe sum in the definition of J. We have:j J() =j12 (h(x)y)2= 2 12 (h(x)y) j(h
18、(x)y)= (h(x)y) jparenleftBigg nsummationdisplayi=0ixi yparenrightBigg= (h(x)y)xj5For a single training example, this gives the update rule:1j := j + parenleftbigy(i) h(x(i)parenrightbigx(i)j .The rule is called theLMSupdate rule (LMS stands for “least mean squares”),and is also known as the Widrow-H
19、off learning rule. This rule has severalproperties that seem natural and intuitive. For instance, the magnitude ofthe update is proportional to the error term (y(i) h(x(i); thus, for in-stance, if we are encountering a training example on which our predictionnearly matches the actual value of y(i),
20、then we find that there is little needto change the parameters; in contrast, a larger change to the parameters willbe made if our prediction h(x(i) has a large error (i.e., if it is very far fromy(i).Wed derived the LMS rule for when there was only a single trainingexample. There are two ways to mod
21、ify this method for a training set ofmore than one example. The first is replace it with the following algorithm:Repeat until convergence j := j + summationtextmi=1parenleftbigy(i) h(x(i)parenrightbigx(i)j (for every j).The reader can easily verify that the quantity in the summation in the updaterul
22、e above is just J()/j (for the original definition of J). So, this issimply gradient descent on the original cost function J. This method looksat every example in the entire training set on every step, and is called batchgradient descent. Note that, while gradient descent can be susceptibleto local
23、minima in general, the optimization problem we have posed herefor linear regression has only one global, and no other local, optima; thusgradient descent always converges (assuming the learning rate is not toolarge) to the global minimum. Indeed, J is a convex quadratic function.Here is an example o
24、f gradient descent as it is run to minimize a quadraticfunction.1We use the notation “a := b” to denote an operation (in a computer program) inwhich we set the value of a variable a to be equal to the value of b. In other words, thisoperation overwrites a with the value of b. In contrast, we will wr
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 统计 机器 学习 斯坦福大学 讲义 12