- # of layers
- # of hidden units
- learning rates
- what activation functions used for different layers
Area: NLP, Vision, Speech, Structured data(ads, search, security, logistic...)
Idea -> code -> Experiment -> Idea
- 60/20/20 in old times when data is quite small (100)
- Now in big data era,
- The
dev sets
goal is to test which algorithm works better - The main goal of your
test set
is, given your final classifier to give you a pretty confident estimate of how well it's doing - So if we have a 1,000,000 data, we need just 10,000 for dev and 10,000 for testing, the ratio should be
98/1/1
- For data more than 1 million, we can use 99.5/0.25/0.25
Examples:
Set type | Desc | Features |
---|---|---|
Training | Cat pictures from webpages | Hight resolution, nicely framed pictures |
Dev/test | Cat pictures from users using your app | blurrier, lower res images |
⚠ Make sure the dev and test come from same distribution
- But deep learning training requires huge amount of data, you might use all sorts of creative tactics such as crawling webpages, so the dev set and test set may not come from same distribution, but if they do, the progress in machine learning algorithm will be faster.
- Not having a test set might be Ok
- Just train different models on you training sets
- evaluate these models with dev sets, dev sets also called test sets in this situation
- Having set up a train dev and test set will allow you to integrate more quickly
- Allow you to more efficiently measure the bias and variance of the algorithm
- Be more efficient select ways to improve the algorithm
- High bias: under fitting
- high variance: over fitting
![](/assets/Screen Shot 2018-03-03 at 11.44.28 PM.png) ![](/assets/Screen Shot 2018-03-03 at 11.44.40 PM.png) ** The purple line represent the high bias and high variance output**
Example: Cat classification
Train set error | Dev set error | Type |
---|---|---|
1% | 11% | High variance, unfitting the dev set data |
15% | 16% | High bias, unfitting the training set data |
0.5% | 1% | low bias and low variance |
Under the assumption that human level performance gets nearly 0% error or more generally, that the optimal error, called Bayes error, which is nearly 0%
By looking at the algorithm error on training set and dev set, you can diagnose whether it has high bias or high variance, or both, or maybe neither.
high bias(Training set performance) | V Bigger network, train longer | V high variance (Dev set performance) | V More data, regularition, (NN Architecture) | V Done
Training a NN can reduce both variance and bias, without worrying about increasing the other, the main cost of training a neural network that's too big is that computational time
- Adding regularization will help to prevent overfitting, or to reduce the errors in network
- Get more training data
L2 Regularization
- Logistic regression:
- Try to minimize the cost function J, some of the training examples of the losses of the individual predictions in the different examples, w and b are parameters
- w is an x-dimensional parameter vector, and b is a real number
- lambda: regularization parameter, hyper parameter need to tune
(
- Neural network
- Cost function: sum of losses
- Forbenius norm: Means the sum of square of elements of a a matrix
- L2 regularization is also called weight decay
![](/assets/Screen Shot 2018-03-04 at 12.44.16 AM.png)
- Reduce impact of hidden units
- Make network simpler and smaller network
![](/assets/Screen Shot 2018-03-04 at 12.44.04 AM.png)
- tanh activation function, g(z) =tanh(z)
- if lambda is large
- => the weight function become small, because they are penalized being large into a cos function, because z = wa +b
- => z become small, according to tanh function graph
- => the line become linear when z is small
- => thus every layer will be roughly linear
- => the whole network is more just a linear network
If the regularization parameter becomes very large, w become small, z become small or take small range of values, tanh will become linear, and the whole nn will more like a linear function