总结

2020-04-30

Summary for 6000G Assignment

文章目录

1. Prepare
2. Assignment One

6000G course is Deep Learning Meets Computer Vision that I learnt in HKUST Fall semester in the first year. The course focuses on image process in Machine Learning. Its teaching materials are from Standford cs231n course Convolutional Neural Networks for Visual Recognition including its PPT, lecture notes, example codes and assignments.

After doing each assignment, I’ll mark down my thoughts, questions, solutions and other interesting things here.

The all assignments are divided into three parts:

Assignment #1: Image Classification, kNN, SVM, Softmax, Neural Network
Assignment #2: Fully-Connected Nets, Batch Normalization, Dropout, Convolutional Nets
Assignment #3: Image Captioning with Vanilla RNNs, Image Captioning with LSTMs, Network Visualization, Style Transfer, Generative Adversarial Networks

Prepare

The course has pre-prepared the virtual barn ( Adv. GPU Pool ) for our each student via the VMWare Horizon Client and VPN. However using this win10 system to do the experiment, I found that it works even worse than my own laptop, slower speed and delayed reponse. Thus I went back to my local environment.

The full package of assignment materials run on win10 and contain the python 3.6, jupyter notebook, WinPy, IDLE and etc. You just easily click your mouse to open jupyter, and then begin to work without scratching your head to configure the environment.

The assignment can be described as code complement : With the algorithm skeleton codes given, you need to complete them. You always require numpy to help you express matrix, to calculate some parameters like loss, gradient, accuracy, relative error, to tune hyper-parameter or to compare different model. Some inline questions are asked to make you think the reason behind a phenomenon. No matter what the problem is, you have to understand the backgound knowledge of each ML model. Otherwise you don’t know how to start.

Assignment One

At the first step I downloaded the experiment data ( CIFAR-10 image data set with labels ) by running getdata.py. Then I noticed the main structure with tutorial is in .ipynb file which calls functions from some .py files. The first difficulty I met was numpy broadcast features in knn no-loop algorithm. It’s also comfused me that the direction of axis = 0 or axis = 1 and the effect of keepdims = True.

The other obstacles I encountered are as follows :

How to calculate the loss function?
How to calculate the gradient of loss function?
How to tune the hyper-parameter in neural network?
The principle of features extraction?
The different way to process the image between raw pixels and features ( HOG + color histogram ).

1.1 numpy boardcast

These three pictures show a clear view how numpy broadcast works.

2D-axis=0
2D-axis=1
3D-axis=0

1.2 axis and keepdims

x = np.array([[0, 2, 1], 
              [1, 3, 5]])

print (x.sum())
print (x.sum(axis = 0))
print (x.sum(axis = 1))
print (x.sum(axis = 1, keepdims = True))

The output

12

[1 5 6]

[3 9]

[[3]
 [9]]

1.3 How to calculate loss and the gradient of it?

SVM

We have a training set X ( n samples with d dimensions ) to train a model i.e. learning a W ( d dimensions and c classes ) . The y contains the real label of each sample in X. S is the scores matrix to evaluate the quality of our model i.e. finding the ideal W.

$X = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1d}\\ x_{21} & x_{22} & \dots & x_{2d}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{N2} & \dots & x_{nd} \end{bmatrix}$ $W = \begin{bmatrix} w_{11} & w_{12} & \dots & w_{1c}\\ w_{21} & w_{22} & \dots & w_{2c}\\ \vdots & \vdots & \ddots & \vdots\\ w_{d1} & w_{d2} & \dots & w_{dc} \end{bmatrix}$ $y = \begin{pmatrix} y_{1} & y_{2} & \dots & y_{n}\\ \end{pmatrix}$ $S = X \cdot W = \begin{bmatrix} s_{11} & s_{12} & \dots & s_{1c}\\ s_{21} & s_{22} & \dots & s_{2c}\\ \vdots & \vdots & \ddots & \vdots\\ s_{n1} & s_{n2} & \dots & s_{nc} \end{bmatrix}$

We see that $s_{ij}$ shows that the $i^{th}$ sample score performing in the $j^{th}$ class. And we calculate the loss using following expression:

$loss = \sum_{i=1}^{n}\sum_{j=1}^{c}max(f) \\ where \quad f= \begin{cases} max[0, s_{ij}-s_{iy[i]} + 1] & j \neq y_{[i]} \\ 0 & j = y_{[i]} \end{cases}$ $s_{ij}=X[i,:]\cdot W[:,j]$

Notice that we may get zero sometimes, we consider that the $i^{th}$ row of $(S - S[:,y].broadcast + 1)$ maxtrx has $k_i$ none-zero values. And this $k_i$ values columns numbers are $t_{1}^i,t_{2}^i,\dots,t_{k_i}^i$. So the loss expression can rewrite to

$\begin{aligned} loss & = \frac{1}{n}\sum_{i=1}^{n}\left(\left(\sum_{j=t_{1}^{i}}^{t_{k_i}^i}s_{ij}\right) - k_i\cdot s_{iy[i]}\right) \\ & = \frac{1}{n}\sum_{i=1}^{n}\left(\left(\sum_{j=t_{1}^i}^{t_{k_i}^i}X[i,:]\cdot W[:,j]\right) - k_i\cdot X[i,:]\cdot W[:,y[i]] \right) \end{aligned}$

We can calculate the derivative of loss with respect to W and get a $d\times c$ matrix ( gradient matrix ). Initialize

$\begin{aligned} \partial w = \frac{\partial loss}{\partial W} = Zeros(d,c) \end{aligned}$

Then let

$\begin{aligned} \partial w[:,j] = \begin{cases} \sum_{i=1}^{n}X[i,:].T & j = t_{1}^i,t_{2}^i,\dots,t_{k_i}^i\\ \sum_{i=1}^{n}-k_i\cdot X[i,:].T & j = y[i] \end{cases} \end{aligned}$

If we want to add regularization, we can let

$loss = loss + reg\cdot||W||$

and the gradient need to add

$\partial w = \partial w+ 2 \cdot reg \cdot W$

Softmax

The way to calculate Matrix S is same to SVM. However Softmax uses softmax value to calculate the loss and gradient. We use P maxtrix to express the softmax formula. This is

$S = X \cdot W = \begin{bmatrix} s_{11} & s_{12} & \dots & s_{1c}\\ s_{21} & s_{22} & \dots & s_{2c}\\ \vdots & \vdots & \ddots & \vdots\\ s_{n1} & s_{n2} & \dots & s_{nc} \end{bmatrix}$

Then we make

$P = \begin{bmatrix} e^{s_{11}}/\sum_{j=1}^ce^{s_{1j}} & e^{s_{12}}/\sum_{j=1}^ce^{s_{1j}} & \dots & e^{s_{1c}}/\sum_{j=1}^ce^{s_{1j}}\\ e^{s_{21}}/\sum_{j=1}^ce^{s_{2j}} & e^{s_{22}}/\sum_{j=1}^ce^{s_{2j}} & \dots & e^{s_{2c}}/\sum_{j=1}^ce^{s_{2j}}\\ \vdots & \vdots & \ddots & \vdots\\ e^{s_{n1}}/\sum_{j=1}^ce^{s_{nj}} & e^{s_{n2}}/\sum_{j=1}^ce^{s_{nj}} & \dots & e^{s_{nc}}/\sum_{j=1}^ce^{s_{nj}} \end{bmatrix}$

In order to avoid numeric instability when we do exp and log, we would minus the max score in each row like this

$S' = \begin{bmatrix} s_{11}-max(row_1) & s_{12}-max(row_1) & \dots & s_{1c}-max(row_1)\\ s_{21}-max(row_2) & s_{22}-max(row_2) & \dots & s_{2c}-max(row_2)\\ \vdots & \vdots & \ddots & \vdots\\ s_{n1}-max(row_n) & s_{n2}-max(row_n) & \dots & s_{nc}-max(row_n) \end{bmatrix} \\$ $where \quad row_i = \{s_{i1},s_{i2},\dots,s_{ic}\}$

Then we get $P’$ like $S$ to $P$. The loss

$loss = -\frac{1}{n} \sum_{i=1}^nlog\{P'(i,y[i])\}$

To simple the illustration, we use $S$ and $P$ to express $S’$ and $P’$. Notice that we make $S = S’,P = P’$.

It’s easy to think that

$\begin{aligned} (-n)\cdot loss & = log\frac {e^{s_{1y[1]}}}{\sum_{j=1}^ce^{s_{1j}}} + log\frac {e^{s_{2y[2]}}}{\sum_{j=1}^ce^{s_{2j}}} + \dots + log\frac {e^{s_{ny[n]}}}{\sum_{j=1}^ce^{s_{nj}}} \\ & = \left(s_{1y[1]}-log{\sum_{j=1}^ce^{s_{1j}}}\right) + \left(s_{2y[2]}-log{\sum_{j=1}^ce^{s_{2j}}}\right) + \dots + \left(s_{ny[n]}-log{\sum_{j=1}^ce^{s_{nj}}}\right) \end{aligned}$

Do partial derivative

$\begin{aligned} (-n)\cdot \frac{\partial loss}{\partial w_{ij}} &= \left(\frac{\partial s_{1y[1]}}{\partial w_{ij}}-\frac{1}{\sum_{j=1}^ce^{s_{1j}}}\frac{\partial \sum_{j=1}^ce^{s_{1j}}}{\partial w_{ij}}\right) + \left(\frac{\partial s_{2y[2]}}{\partial w_{ij}}-\frac{1}{\sum_{j=1}^ce^{s_{2j}}}\frac{\partial \sum_{j=1}^ce^{s_{2j}}}{\partial w_{ij}}\right) + \dots + \left(\frac{\partial s_{ny[n]}}{\partial w_{ij}}-\frac{1}{\sum_{j=1}^ce^{s_{nj}}}\frac{\partial \sum_{j=1}^ce^{s_{nj}}}{\partial w_{ij}}\right) \end{aligned}$

Let’s take one part to see more details,

if $j=y[1]$ then

$\begin{aligned} &\frac{\partial s_{1y[1]}}{\partial w_{ij}} -\frac{1}{\sum_{j=1}^ce^{s_{1j}}}\frac{\partial \sum_{j=1}^ce^{s_{1j}}}{\partial w_{ij}}\\ = & \frac{\partial s_{1y[1]}}{\partial w_{ij}} - \frac{1}{\sum_{j=1}^ce^{s_{1j}}} \cdot \left(\sum_{j=1}^{c}\frac{\partial e^{s_{1j}}}{\partial w_{ij}}\right)\\ = & \frac{\partial s_{1y[1]}}{\partial w_{ij}} - \frac{1}{\sum_{j=1}^ce^{s_{1j}}} \cdot \left(\sum_{j=1}^{c}\frac{\partial s_{1j}}{\partial w_{ij}}\cdot e^{s_{1j}} \right)\\ = & \frac{\partial X[i,:]\cdot W[:,y[1]]}{\partial w_{ij}} - \frac{1}{\sum_{j=1}^ce^{s_{1j}}} \cdot \left(\sum_{j=1}^{c}\frac{\partial X[1,:]\cdot W[:,j]}{\partial w_{ij}}\cdot e^{s_{1j}} \right)\\ = & x_{ij} - \frac{s_{1j}}{\sum_{j=1}^ce^{s_{1j}}} \cdot \left(x_{1j}\right) \\ = & x_{ij} - p_{1j} \cdot x_{1j} \end{aligned}$

if $j\neq y[1]$ then

$\begin{aligned} &\frac{\partial s_{1y[1]}}{\partial w_{ij}} -\frac{1}{\sum_{j=1}^ce^{s_{1j}}}\frac{\partial \sum_{j=1}^ce^{s_{1j}}}{\partial w_{ij}}\\ = & - p_{1j} \cdot x_{1j} \end{aligned}$

Consider the two different situation $j=y[1]$ and $j\neq y[1]$. Make

$P[:,y] = P[:,y] - 1$

Then we’ll find that ( here we do partial derivative to $W$; we do partial derivative to $w_{ij}$ ahead )

$\begin{aligned} (-n)\cdot \frac{\partial loss}{\partial W} & = (-1)\cdot X.T\times P\\ \end{aligned}$

So the gradient

$\begin{aligned} \frac{\partial loss}{\partial W} & = \frac{1}{n}\cdot X.T\times P\\ \end{aligned}$