## Python Basic

#### The ‘import’ Command

- from functions import *

**a_method** defined in **function.py** can be used as **a_method **in current space

- import functions as fc

**a_method** defined in **function.py** can be used as **fc.a_method **in current space

- import function

**a_method** defined in **function.py** can be used as **function.a_method **in current space

## On Pylearn2

Block figure of pylearn 2

- Download dataset: download dataset to ${PYLEARN2_DATA_PATH}/,
**name the dataset according to**conventions in pylearn2/datasets/ - Modify make_dataset.py: point to above name convention; set output dataset_name_pkl folder
- make_dataset:
- Construct yaml file: with above dataset_name.pkl; modify related parameters…

## On Image Processing

#### Contrast Normalization

patches = bsxfun(@rdivide, bsxfun(@minus, patches, mean(patches,2)), sqrt(var(patches,[],2)+10));

Here +10 is to avoid some extremely large number after the normalization.

#### Whitening

We have used PCA to reduce the dimension of the data. There is a closely related preprocessing step called **whitening** (or, in some other literatures, **sphering**) which is needed for some algorithms. If we are training on images, the raw input is redundant, since adjacent pixel values are highly correlated. The goal of whitening is to make the input less redundant; more formally, our desiderata are that our learning algorithms sees a training input where (i) the features are less correlated with each other, and (ii) the features all have the same variance.

### Dictionary Learning and Feature Extraction for One Image

#### DL is only done on the (randomly) **selected patches** in the dataset.

1. Randomly select patches from dataset.

2. Normalization and whitening

3. Train dictionary with processed patches.

#### Encoding is done for all the training images, e.g., all images in cifar-10 dataset.

Generally speaking, the feature extraction for **one image** is composed of following steps:

1. Select **one image** from the dataset;

2. Generate **all possible patches** of the image, in all channels (R,G,B); Normalize and whitening;

3. **Encode every possible patch** within the image with the **trained dictionary **with specialized coding methods (dim n-> k);

4. Polarity splitting the **patch** code (optional) (dim * 2);

5. m*m Pooling within the **image, to construct an general representation of the image with only extending the dimension of a patch by m times ****(**methods include max pooling, average pooling, etc.) (dim k-> k*m)

6. Output the **feature vector** ** z** representing the

**image;**

[Go Deep]

7. Regard the feature vectors, ** z, **of the random samples

*as the input of next layer ( (k * m) * num_sample ); however, the dimension of each vector is too large, requiring exponential numbers of examples to train a dictionary in K-means. So we have to*

**shrink the output dimension**.

8. **Local receptive field selection**: (Within a certain area of the image), select ** T** most strongly correlated dimensions to for the dimensions of

*, to form*

**z****local receptive fields (T * num_sample). Above process can be reran for**

*one**times to form*

**N***such fields*

**N***Select corresponding features from the first layer output from each sample to form the inputs of the second layer.*

**.**9. Train each receptive field respectively, to get respective **dictionary** (** T *** k2).

10. Repeat step 3 for each dictionary.

11. Finally concatenate output codes from step 10.

Comments:

- Redundant information of the image, since pooling zones are
**overlapped;** - Pooling since we would like to get an entire description of the image
- M*M pooling is a trade-off between image diversity vs. data compression
- local receptive field selection to reduce the dimension of output data, and also to improve k-means’s performance
- Train dictionary always with samples; encode always with the dataset

## Optimization in Dictionary Learning

#### Methods of computing descent directions

– ‘sgd’: Stochastic Gradient Descent

– ‘sd’: Steepest Descent

– ‘csd’: Cyclic Steepest Descent

– ‘bb’: Barzilai and Borwein Gradient

– ‘cg’: Non-Linear Conjugate Gradient

– ‘scg’: Scaled Non-Linear Conjugate Gradient

– ‘pcg’: Preconditionined Non-Linear Conjugate Gradient

– ‘lbfgs’: Quasi-Newton with Limited-Memory BFGS Updating

– ‘newton0’: Hessian-Free Newton

– ‘pnewton0’: Preconditioned Hessian-Free Newton

– ‘qnewton’: Quasi-Newton Hessian approximation

– ‘mnewton’: Newton’s method with Hessian calculation after every

– ‘newton’: Newton’s method with Hessian calculation every iteration

– ‘tensor’: Tensor

#### Line search strategies

the termination criteria (‘LS’):

– 0: Backtrack w/ Step Size Halving

– 1: Backtrack w/ Quadratic/Cubic Interpolation from new function values

– 2: Backtrack w/ Cubic Interpolation from new function + gradient

– 3: Bracketing w/ Step Size Doubling and Bisection

– 4: Bracketing w/ Cubic Interpolation/Extrapolation with function + gradient values (default for all except ‘bb’ and ‘sd’)

– 5: Bracketing w/ Mixed Quadratic/Cubic Interpolation/Extrapolation

– 6: Use Matlab Optimization Toolbox’s line search

Above, the first three find a point satisfying the Armijo conditions, while the last four search for find a point satisfying the Wolfe conditions.

#### Several strategies for choosing the initial step size

– 0: Always try an initial step length of 1 (default for all except ‘cg’ and ‘sd’), (t = 1)

– 1: Use a step similar to the previous step (default for ‘cg’ and ‘sd’), (t = t_old*min(2,g’d/g_old’d_old))

– 2: Quadratic Initialization using previous function value and new function value/gradient (use this if steps tend to be very long) ,(t = min(1,2*(f-f_old)/g))

– 3: The minimum between 1 and twice the previous step length, (t = min(1,2*t)

– 4: The scaled conjugate gradient step length (may accelerate conjugate gradient methods, but requires a Hessian-vector product), (t = g’d/d’Hd)

## On Muti-class SVM Classification

Below mulit-class SVM classifier is composed of several binary-class SVM classifier. There are two ways to do.

#### 1. One vs. All (One vs. Rest):

For a k-class classifier, k b-SVMs are trained with respect to the training data of class ‘k’ against rest. To distinguish each example x_{i}, SVM_{j} computes

The SVM with the largest output is the classification of input x_{i}.

#### 2. One vs. One:

For a k-class classifier, b-SVMs are trained with respect to training data between any two classes, . To distinguish example x_{i}, SVM_{mn} computes

Then x_{i}’s weigh on class m is increased if SVM_{m&n}’s output favors m, and vise versa. After all-round SVM computations, the class with the largest weight is the classification of input x_{i}.

#### 3.Comment

OVO is better than OVR in performance, but more computation complexity since it trains more SVMs. Moreover, the scarcity of data in a particular class is also an issue of OVO method.

## REFERENCE

*Advances in Neural Information Processing Systems*, 2011, pp. 2528–2536.

*Neural Networks: Tricks of the Trade*, 2012, pp. 561–580.

*Proceedings of the 28th International Conference on Machine Learning (ICML-11)*, 2011, pp. 921–928.