### Week 1 Introduction

- Learning algorithms mainly fall into two categories, namely
**the supervised**

**learning and the unsupervised learning.** - Notice that the supervised learning has correct answers for every item in the

training set and we teach the machine to learn, while in the unsupervised

learning ,we tell the machine only that there is a dataset and let them to**find**

**the structure by themselves.** - The unsupervised learning mainly contains two types ,one of which is the

**clustering**and the other one is non-clustering,namely finding struture**in a**

**chaotic environment.**

### Week 2 Linear Regression

- We should update the theta0 and theta1
**simultaneously**in the gradient descent

algorithm

The learning rate should**neither be too large or too small**to ensure that the

gradient algorithm converges in a reasonable time. - Notice that our matrices and vectors are
**1-indexed**while the arrays in some

programming languages are**0-indexed**. - Tricks in performing gradient descent(ensure the gradient descent can run a

lot more faster):

1.**scale the features,**get every feature into approximately (-1,1) range

2.**normalization,**ensure the result is centered around zero

so based on the 1 and 2 point, the formula is x = x-avg/range

3.**debug and choose a proper learning rate**: try a range of alpha and then plot

the J(theta) and then choose the one which keep decreaseing at a proper speed - Learn how to choose different features and sometimes design our own features

instead of using the features already given

Besides, we can give the features different values and accomplish polynomial

regression through linear regression. In this way, don't forget feature scaling - Comparison between the normal equaiton and the gradient descent:

The normal equaition need no iteration while the gradient descent need many

the normal equation does not need to choose the alpha. Also, we needn't scale the features when performing the normal equation.

However, as the normal equation need to compute the pinv(x'x)*x , it does not

perform as well as the gradient descent when the number of features is lage

actually, when the number is over 10000, we usually turn to the gradient

descent

### Week 3 Logistic Regression

- Actually we can let the algorithm learn the parameters such as the learning

rate , and we can let the algorithm to select the features by itself, we can

let the algorithm drop the features which it think are useless.(actually the

model selection also can help select the value of the lambda) **Overfitting**: the problem mainly arises when the features are much more than the training examples, in comparison to the insufficient fitting which happens when the contrary situation occurs. Overfitting is also called high variance, in comparison to the high bias.- Avoid overfitting ----
**regularization**

small values for parameters theta0, theta1....thetan mean simpler hypothesis

because in those cases we can remove some features led by small thetas, and

thus means less prone to overfitting

in regularization, we call in the lambda as the modification parameter

### Week 4 Model Representation And Applications

1.The dimensions of these matrices of weights is determined as follows:

2.Application: the expression of "and" "or" "xnor"

3.When we perform multi-class classification, the output goes as follows:

### Week 5 Propagation And Several Skills

1.The main steps of performing back propagation

2.Unrolling parameters and the reversed process: **The advantage of the matrix representation** is that when your parameters are stored as matrices it's more convenient when you're doing forward propagation and back propagation and it's easier when your parameters are stored as matrices to take advantage of the, sort of, vectorized implementations. ** Whereas in contrast the advantage of the vector representation,** when you have like thetaVec or DVec is that when you are using the advanced optimization algorithms. Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so with what we just went through, hopefully you can now quickly convert between the two as needed.

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ] deltaVector = [ D1(:); D2(:); D3(:) ] %get back our matrices Theta1 = reshape(thetaVector(1:110),10,11) Theta2 = reshape(thetaVector(111:220),10,11) Theta3 = reshape(thetaVector(221:231),1,11)

3.Gradient checking: we can use the slope to compute the gradient and compute the result got by the slope with the result got by the propagation and make sure that our codes are right, then we can turn off the gradient checking and perform the training process.

epsilon = 1e-4; for i = 1:n, thetaPlus = theta; thetaPlus(i) += epsilon; thetaMinus = theta; thetaMinus(i) -= epsilon; gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon) end;

4.Random Initialization: we can't initialize the theta parameters to be zero, because then we will end up with each parameter corresponding to different inputs turning out to be identical, though maybe unzero.

5.Put them together

Other advices:

Reasonable default: 1 hidden layer, or if > 1 hidden layer, then the reasonable default is that having the same number of hidden units in each layer(usually the more the better.)

### Week 6 Evaluation And Improvements

We often need to evaluate our hypothesis to improve our learning algorithm. Usually we can first complete a quick and dirty implementation and plot the curves to see where we can improve. Don't hurry to some options like collecting a lot of examples.

To ensure that our cost is rational , we introduce the cross validation set. When we need to choose the number of d, namely the dimension of our features, we usually get different thetas by minimizing the J_train and then apply them to the J_cv and seek for the theta which can minimize the J_cv and then we value the J by computing the J_test with the theta we get.

**When we need to determine the lambda which serves as the regularization term, we can also turn to the method stated above.**

High bias and high variance: it has to be pointed out that these two things are different concepts, namely the high bias doesn't mean the low variance.

The high variance problem corresponds to the overfitting problem, which will arise when the lambda is small. Similarly, the high bias problem corresponds to the underfitting problem, which will arise when lambda is small.

**As the 'd' increases or the lambda decreases**, the J_train will keep decreasing, while the J_cv will **first decrease and then increase.**

For the m, namely the number of the training example:when the underfitting problem occurs, the J_train and J_cv will finally come very close to each other, in which case increasing the number of the training examples will not help significantly. When the overfitting problem occurs, there will be a gap between the J_cv and the J_train, and in that case the increase of 'm' will help.

The J_train is always smaller than the J_cv**in any case.**

**Error Analysis: **we first implement the quick version and then manually find the places where misclassifications mainly occur and then make changes correspondingly. When we use some strategy, we also need to compare the numerical value.(for example, the stemming software)

Handing skewed data: in some cases like diagnosing the cancers, the accuracy can no longer help us to evaluate our hypothesis. Instead we use the precision and recall strategy. The precision is defined by precision = True positives/(True positives + False positives) and the recall is defined by recall = True positives/(True positives + False Negatives) and the F score is defined by the F score = 2 * P * R / ( P + R) to evaluate on the whole.

Data for machine learning: usually the increase of number of training examples will help when there is sufficient information to make the prediction ( for example, the increase of 'm' may not help when we want to predict the price of the house only with the size of the house)

### Week 7 SVM

The support vector machines:

We omit the 1/m parameter when computing the cost function and replace the lambda with C and when C equals 1/lambda, we get the same theta vector.

We introduce new ways of computing the cost function, namely replace the curves with straight lines.

With the optimization objective, we often get the large margin, which is also what the SVM is called. Because with the intention to minimize theta, we suggest a larger p_i in order to make sure that the theta' * X >=1 or <= -1 , which leads to the large margin.

Kernel function : the kernel function is also called the similarity function ,which we use to find new features . we often use GAUSS kernel and the linear kernel, namely no kernel. Because we use each example in the similarity function to generate new features, here the n is equal to m

GAUSS kernel: with the increase of the sigma, the variance will be more slow, corresponding to the decrease of C in earlier cases.

When performing the multi-class classification, we use the one-vs-all method. We train K SVMs and get theta1,theta2...thetaK, pick the class i with the largest theta_i'*X

Logistic Regression Vs SVMs:

### Week 8 Unsupervised Learning (K-means and PCA algorithm)

- We are given data that have no label associated with it. We ask the algorithm to find some structure in the data for us. This week we mainly talk about the clustering algorithm. Possible applications are market segmentation, social network analysis and understanding of galaxy information.

#### K-means algorithm

- K-means algorithm is the one which is mainly applicated in cluserting algorithm.
- First we initialize two pointes called cluster centroids and then assign the points and move the centroids.
- If there is any centroid which no example is assigned to ,then we
**usually eliminate that centroid**, causing the N < K ( here N is the number of the features we finally get) - There are mainly two steps in the K-means algorithm. One is the update of c_i (it will be assigned to the index of the cluster which is closest to it)and the other is the computation of the miu, which is the average of the x . Notice that the c_i here is similar to the k(index of the cluster)
- To decide the number of the clusters,if the number of features is not large ,we can consider repeat the initialization
**(a way to initialize is to select k examples and assign the miu_i = x_i in the k examples selected**) and select the one clustering which minimize the J(c_1,c_2,miu_1,miu_2 . We may use the elbow method, but usually we just consider the downstream effects. About the curves of the J against the k, if**with the increase of k ,we find that the J also increase, then the k may be stuck in a local optima**and we need to perform the initialization agian.

**About PCA**

- PCA is often used to reduce the dimension of input data. And PCA is different from the logistic regression, them it may seem alike when we reduce data from 2D to 1D.
- We get the U matrix to compute the z from x, notice that the dimension of x is
**n*1, of z is k*1, and of U is n*k.** - Before using the PCA, we may need to consider about performing the feature scaling and the mean normalization.
- The principle of choosing the k is that keeping 95% or 99% of the variance, we can express that by using the S component in the [U S V] = svd(sigma). That's the same of minimizing the square(x,x_approx)
- PCA is a very powerful unsupervised learning algorithm. It is often used to compress data to
**reduce the computer meomory and disk memory, to visualize data (with k = 2 or k = 3), to reduce the features.** - We can also use them in supervised learning, namely select the x_train and generate the z_train and the hypothesis and then map the x_cv or x_test to z_cv or z_test, apply the hypothesis on that and predict. But remeber that we will not consider y and may drp some valuable information. So it's not a good choice to use PCA to prevent overfitting.
- It's better to first train the algorithm on the oringal data and see how it performs
**rather than rush to**the PCA algorithm.