Week 1 Introduction
- Learning algorithms mainly fall into two categories, namelythe supervised
learning and the unsupervised learning.
- Notice that the supervised learning has correct answers for every item in the
training set and we teach the machine to learn, while in the unsupervised
learning ,we tell the machine only that there is a dataset and let them to find
the structure by themselves.
- The unsupervised learning mainly contains two types ,one of which is the
clustering and the other one is non-clustering,namely finding struture in a
Week 2 Linear Regression
- We should update the theta0 and theta1simultaneously in the gradient descent
The learning rate should neither be too large or too small to ensure that the
gradient algorithm converges in a reasonable time.
- Notice that our matrices and vectors are 1-indexed while the arrays in some
programming languages are 0-indexed.
- Tricks in performing gradient descent(ensure the gradient descent can run a
lot more faster):
1.scale the features, get every feature into approximately (-1,1) range
2.normalization, ensure the result is centered around zero
so based on the 1 and 2 point, the formula is x = x-avg/range
3.debug and choose a proper learning rate: try a range of alpha and then plot
the J(theta) and then choose the one which keep decreaseing at a proper speed
- Learn how to choose different features and sometimes design our own features
instead of using the features already given
Besides, we can give the features different values and accomplish polynomial
regression through linear regression. In this way, don't forget feature scaling
- Comparison between the normal equaiton and the gradient descent:
The normal equaition need no iteration while the gradient descent need many
the normal equation does not need to choose the alpha. Also, we needn't scale the features when performing the normal equation.
However, as the normal equation need to compute the pinv(x'x)*x , it does not
perform as well as the gradient descent when the number of features is lage
actually, when the number is over 10000, we usually turn to the gradient
Week 3 Logistic Regression
- Actually we can let the algorithm learn the parameters such as the learning
rate , and we can let the algorithm to select the features by itself, we can
let the algorithm drop the features which it think are useless.(actually the
model selection also can help select the value of the lambda)
- Overfitting: the problem mainly arises when the features are much more than the training examples, in comparison to the insufficient fitting which happens when the contrary situation occurs. Overfitting is also called high variance, in comparison to the high bias.
- Avoid overfitting ----regularization
small values for parameters theta0, theta1....thetan mean simpler hypothesis
because in those cases we can remove some features led by small thetas, and
thus means less prone to overfitting
in regularization, we call in the lambda as the modification parameter
Week 4 Model Representation And Applications
1.The dimensions of these matrices of weights is determined as follows:
2.Application: the expression of "and" "or" "xnor"
3.When we perform multi-class classification, the output goes as follows:
Week 5 Propagation And Several Skills
1.The main steps of performing back propagation
2.Unrolling parameters and the reversed process: The advantage of the matrix representation is that when your parameters are stored as matrices it's more convenient when you're doing forward propagation and back propagation and it's easier when your parameters are stored as matrices to take advantage of the, sort of, vectorized implementations. Whereas in contrast the advantage of the vector representation, when you have like thetaVec or DVec is that when you are using the advanced optimization algorithms. Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so with what we just went through, hopefully you can now quickly convert between the two as needed.
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ] deltaVector = [ D1(:); D2(:); D3(:) ] %get back our matrices Theta1 = reshape(thetaVector(1:110),10,11) Theta2 = reshape(thetaVector(111:220),10,11) Theta3 = reshape(thetaVector(221:231),1,11)
3.Gradient checking: we can use the slope to compute the gradient and compute the result got by the slope with the result got by the propagation and make sure that our codes are right, then we can turn off the gradient checking and perform the training process.
epsilon = 1e-4; for i = 1:n, thetaPlus = theta; thetaPlus(i) += epsilon; thetaMinus = theta; thetaMinus(i) -= epsilon; gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon) end;
4.Random Initialization: we can't initialize the theta parameters to be zero, because then we will end up with each parameter corresponding to different inputs turning out to be identical, though maybe unzero.
5.Put them together
Reasonable default: 1 hidden layer, or if > 1 hidden layer, then the reasonable default is that having the same number of hidden units in each layer(usually the more the better.)
Week 6 Evaluation And Improvements
We often need to evaluate our hypothesis to improve our learning algorithm. Usually we can first complete a quick and dirty implementation and plot the curves to see where we can improve. Don't hurry to some options like collecting a lot of examples.
To ensure that our cost is rational , we introduce the cross validation set. When we need to choose the number of d, namely the dimension of our features, we usually get different thetas by minimizing the J_train and then apply them to the J_cv and seek for the theta which can minimize the J_cv and then we value the J by computing the J_test with the theta we get.
When we need to determine the lambda which serves as the regularization term, we can also turn to the method stated above.
High bias and high variance: it has to be pointed out that these two things are different concepts, namely the high bias doesn't mean the low variance.
The high variance problem corresponds to the overfitting problem, which will arise when the lambda is small. Similarly, the high bias problem corresponds to the underfitting problem, which will arise when lambda is small.
As the 'd' increases or the lambda decreases, the J_train will keep decreasing, while the J_cv will first decrease and then increase.
For the m, namely the number of the training example:when the underfitting problem occurs, the J_train and J_cv will finally come very close to each other, in which case increasing the number of the training examples will not help significantly. When the overfitting problem occurs, there will be a gap between the J_cv and the J_train, and in that case the increase of 'm' will help.
The J_train is always smaller than the J_cvin any case.
Error Analysis: we first implement the quick version and then manually find the places where misclassifications mainly occur and then make changes correspondingly. When we use some strategy, we also need to compare the numerical value.(for example, the stemming software)
Handing skewed data: in some cases like diagnosing the cancers, the accuracy can no longer help us to evaluate our hypothesis. Instead we use the precision and recall strategy. The precision is defined by precision = True positives/(True positives + False positives) and the recall is defined by recall = True positives/(True positives + False Negatives) and the F score is defined by the F score = 2 * P * R / ( P + R) to evaluate on the whole.
Data for machine learning: usually the increase of number of training examples will help when there is sufficient information to make the prediction ( for example, the increase of 'm' may not help when we want to predict the price of the house only with the size of the house)
Week 7 SVM
The support vector machines:
We omit the 1/m parameter when computing the cost function and replace the lambda with C and when C equals 1/lambda, we get the same theta vector.
We introduce new ways of computing the cost function, namely replace the curves with straight lines.
With the optimization objective, we often get the large margin, which is also what the SVM is called. Because with the intention to minimize theta, we suggest a larger p_i in order to make sure that the theta' * X >=1 or <= -1 , which leads to the large margin.
Kernel function : the kernel function is also called the similarity function ,which we use to find new features . we often use GAUSS kernel and the linear kernel, namely no kernel. Because we use each example in the similarity function to generate new features, here the n is equal to m
GAUSS kernel: with the increase of the sigma, the variance will be more slow, corresponding to the decrease of C in earlier cases.
When performing the multi-class classification, we use the one-vs-all method. We train K SVMs and get theta1,theta2...thetaK, pick the class i with the largest theta_i'*X
Logistic Regression Vs SVMs:
Week 8 Unsupervised Learning (K-means and PCA algorithm)
- We are given data that have no label associated with it. We ask the algorithm to find some structure in the data for us. This week we mainly talk about the clustering algorithm. Possible applications are market segmentation, social network analysis and understanding of galaxy information.
- K-means algorithm is the one which is mainly applicated in cluserting algorithm.
- First we initialize two pointes called cluster centroids and then assign the points and move the centroids.
- If there is any centroid which no example is assigned to ,then we usually eliminate that centroid, causing the N < K ( here N is the number of the features we finally get)
- There are mainly two steps in the K-means algorithm. One is the update of c_i (it will be assigned to the index of the cluster which is closest to it)and the other is the computation of the miu, which is the average of the x . Notice that the c_i here is similar to the k(index of the cluster)
- To decide the number of the clusters,if the number of features is not large ,we can consider repeat the initialization(a way to initialize is to select k examples and assign the miu_i = x_i in the k examples selected) and select the one clustering which minimize the J(c_1,c_2,miu_1,miu_2 . We may use the elbow method, but usually we just consider the downstream effects. About the curves of the J against the k, if with the increase of k ,we find that the J also increase, then the k may be stuck in a local optima and we need to perform the initialization agian.
- PCA is often used to reduce the dimension of input data. And PCA is different from the logistic regression, them it may seem alike when we reduce data from 2D to 1D.
- We get the U matrix to compute the z from x, notice that the dimension of x is n*1, of z is k*1, and of U is n*k.
- Before using the PCA, we may need to consider about performing the feature scaling and the mean normalization.
- The principle of choosing the k is that keeping 95% or 99% of the variance, we can express that by using the S component in the [U S V] = svd(sigma). That's the same of minimizing the square(x,x_approx)
- PCA is a very powerful unsupervised learning algorithm. It is often used to compress data to reduce the computer meomory and disk memory, to visualize data (with k = 2 or k = 3), to reduce the features.
- We can also use them in supervised learning, namely select the x_train and generate the z_train and the hypothesis and then map the x_cv or x_test to z_cv or z_test, apply the hypothesis on that and predict. But remeber that we will not consider y and may drp some valuable information. So it's not a good choice to use PCA to prevent overfitting.
- It's better to first train the algorithm on the oringal data and see how it performs rather than rush to the PCA algorithm.