본문 바로가기
Computer Science and Engineering/Computer Vision

CS231n Lecture 03. Loss Functions and Optimization

by 자근기적밍기적 2024. 1. 16.

Contents

  • Loss Functions (SVM, softmax)
  • Regularization
  • Optimization (Random Search, Gradient descent)

 

 

After Classifying, we need to ...

  1. Define a loss function that quantifies our unhappiness with the scores across the training data.
  2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

 

 

Loss Function

Given a dataset of examples $\{(x_i, y_i)\}^N_{i=1}$ where $x_i$ is image and $y_i$ is label,

Loss over the dataset is a sum of loss over examples:

$L = \frac{1}{N}\sum_i{L_i(f(x_i,W), y_i)}$

 

 

(multiclass) Hinge loss:

Given an example ($x_i$, $y_i$) where $x_i$ is the image and where $y_i$ is the label,

and using the shorthand for the scores vector: s = f($x_i$, W)

 

Hinge loss is 

$L_i = \sum_{j\neq y_i}max(0, s_j - s_{y_i}+1)$

 

+) 1 is the value for the margin, bringing it closer to ground truth and widening the gap with other labels. 

    The value 1 is somewhat aribtrary.

 

Source: cs231n_2017_lecture3.pdf

 

Example

Source: cs231n_2017_lecture3.pdf

The SVM loss of cat 
= the loss from the car + the loss from the frog
= max(0, 5.1-3.2+1) + max(0, -1.7-3.2+1) 
= max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
= 2.9
The SVM loss of car
= the loss from the cat + the loss from the frog
= max(0, 1.3-4.9+1) + max(0, 2.0-4.9+1) 
= max(0, -2.6) + max(0, -1.9)
= 0 + 0 
= 0
The SVM loss of frog
= the loss from the cat + the loss from the car
= max(0, 2.2-(-3,1)+1) + max(0, 2.5-(-3,1)+1) 
= max(0, 6.3) + max(0, 6.6)
= 6.3 +6.6
= 12.9
The average SVM loss over full dataset 
= $\frac{1}{3}(2.9+0+12.9)$
= 5.27

 

 

Questions of SVM

Q1: What happens to loss if car scores change a bit?

       No change in loss of the car

Q2: What is the min/max possible loss?

       0 ~ infinity

Q3: At initialization W is small so small approximates to 0. What is the loss?

       # of classes - 1

Q4: What if the sum was over all classes? (including j = y_i)

       The loss increases by 1.

Q5: What if we used mean instead of sum? 

        No change

Q6: What if we used the squared term? (L_i = $\sum_{j\neq y_i}max(0, s_j - s_{y_i}+1)^2$)

       Change. Different classfication algorithm

 

 

Code for SVM loss

def L_i_vercotrized(x, y, W):
	scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0
    loss_i = np.sum(margins)
    return loss_i

 

 

Suppose that we found a W such that L = 0. Is this W unique?  No! 2W also has L = 0!

 


 

Regularization

Among competing hypotheses,
the simplest is the best.
- William of Ockham
  • Data loss: Model predictions should match training data. $\frac{1}{N}\sum_{i=1}^NL_i(f(x_i, W), y_i)$
  • Regularization: Model should be simple, so it works on test data. $\lambda R(W)$

 

Therefore, the loss function is 

$L(W)=\frac{1}{N}\sum_{i=1}^NL_i(f(x_i, W), y_i)+\lambda R(W)$

 

The type of Regularization

 

 

+++ L2 Regularization 관해서 한번 더 듣기

 

 


 

Softmax Classifier 

 

Scores = unnormalized log probabilities of the classes 

$P(Y=k|X=x_i)= \frac{e^{s_k}}{\sum_je^{s_j}}\ \textup{(softmax function)} \ where \ s=f(x_i;W)$

 

Want to maximize the log likelihood or to minimize the negative log likelihood of the correct class:

$L_i = -logP(Y=y_i|X=x_i)$

 

in summary,

 $Li = - log(\frac{e^{s_k}}{\sum_je^{s_j}})$

 

 

Example

1. unnormalized log probabilities

Source: cs231n_2017_lecture3.pdf

 

 

2. convert log probabilites to probablities with exponential function

  unnormalized log prob. unnormalized prob. prob.
cat 3.2 e^{3.2} = 24.5  
car 5.1 e^{5.1}=164.0  
frog -1.7 e^{-1.7}=0.18  

 

 

3. normalized probabilites

  unnormalized log prob. unnormalized prob. prob.
cat 3.2 e^{3.2} = 24.5 24.5/(24.5+164.0+0.18)=0.13
car 5.1 e^{5.1}=164.0 164.0/(24.5+164.0+0.18)=0.87
frog -1.7 e^{-1.7}=0.18 0.18/(24.5+164.0+0.18)=0.00

 

 

4. Compute loss with "-log(prob)"

    L_cat = -log(0.13) = 0.89

 

 

Question of cross-entropy loss

Q1: What is the min/max possible loss L_i?

      0~infinity

Q2: Usually at initialization W is small so all s approximates to 0. What is the loss?

      log(# of classes)

 

 

Difference between SVM and Softmax

Q. Suppose I take a datapoint and jiggle a bit. What happens to the loss in both cases?

A. In Hinge loss, if there is a greater than 1,  there is no change.

    In cross-entropy loss, it continues to push the probability towards 1.

 


 

Optimization

Q. Then, how do we find the best W with the loss?  Optimization!

 

1. Random Search - Bad idea... bad accuracy...

2. Follow the slope : Gradient descent

 

 

How to?

 

 

But...


Reference

https://www.youtube.com/watch?v=h7iBpEHGVNc&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv