Python Basic

The ‘import’ Command

  • from functions import *

a_method defined in can be used as a_method in current space

  • import functions as fc

a_method defined in can be used as fc.a_method in current space

  • import function

a_method defined in can be used as function.a_method in current space

On Pylearn2

Block figure of pylearn 2

  • Download dataset: download dataset to ${PYLEARN2_DATA_PATH}/, name the dataset according to conventions in pylearn2/datasets/
  • Modify point to above name convention; set output dataset_name_pkl folder
  • make_dataset:
  • Construct yaml file: with above dataset_name.pkl; modify related parameters…

On Image Processing

Contrast Normalization

patches = bsxfun(@rdivide, bsxfun(@minus, patches, mean(patches,2)), sqrt(var(patches,[],2)+10));

Here +10 is to avoid some extremely large number after the normalization.


We have used PCA to reduce the dimension of the data. There is a closely related preprocessing step called whitening (or, in some other literatures, sphering) which is needed for some algorithms. If we are training on images, the raw input is redundant, since adjacent pixel values are highly correlated. The goal of whitening is to make the input less redundant; more formally, our desiderata are that our learning algorithms sees a training input where (i) the features are less correlated with each other, and (ii) the features all have the same variance.

Dictionary Learning and Feature Extraction for One Image

DL is only done on the (randomly) selected patches in the dataset.

1. Randomly select patches from dataset.

2. Normalization and whitening

3. Train dictionary with processed patches.

Encoding is done for all the training images, e.g., all images in cifar-10 dataset.

Generally speaking, the feature extraction for one image is composed of following steps:

1. Select one image from the dataset;

2. Generate all possible patches of the image, in all channels (R,G,B); Normalize and whitening;

3. Encode every possible patch within the image with the trained dictionary with specialized coding methods (dim n-> k);

4. Polarity splitting the patch code (optional) (dim * 2);

5. m*m Pooling within the image, to construct an general representation of the image with only extending the dimension of a patch by m times  (methods include max pooling, average pooling, etc.)  (dim k-> k*m)

6. Output the feature vector z representing the image;

[Go Deep]

7. Regard the feature vectors,  z, of the random samples as the input of next layer ( (k * m) * num_sample ); however, the dimension of each vector is too large, requiring exponential numbers of examples to train a dictionary in K-means. So we have to shrink the output dimension.

8.  Local receptive field selection: (Within a certain area of the image), select T most strongly correlated dimensions to z_{i} for the dimensions of z, to form one local receptive fields (T * num_sample). Above process can be reran for N times to form N such fields. Select corresponding features from the first layer output from each sample to form the inputs of the second layer.

9. Train each receptive field respectively, to get respective dictionary (* k2).

10. Repeat step 3 for each dictionary.

11. Finally concatenate output codes from step 10.


  • Redundant information of the image, since pooling zones are overlapped;
  • Pooling since we would like to get an entire description of the image
  • M*M pooling is a trade-off between image diversity vs. data compression
  • local receptive field selection to reduce the dimension of output data, and also to improve k-means’s performance
  • Train dictionary always with samples; encode always with the dataset

Optimization in Dictionary Learning

Methods of computing descent directions

– ‘sgd’: Stochastic Gradient Descent
– ‘sd’: Steepest Descent
– ‘csd’: Cyclic Steepest Descent
– ‘bb’: Barzilai and Borwein Gradient
– ‘cg’: Non-Linear Conjugate Gradient
– ‘scg’: Scaled Non-Linear Conjugate Gradient
– ‘pcg’: Preconditionined Non-Linear Conjugate Gradient
– ‘lbfgs’: Quasi-Newton with Limited-Memory BFGS Updating
– ‘newton0’: Hessian-Free Newton
– ‘pnewton0’: Preconditioned Hessian-Free Newton
– ‘qnewton’: Quasi-Newton Hessian approximation
– ‘mnewton’: Newton’s method with Hessian calculation after every
– ‘newton’: Newton’s method with Hessian calculation every iteration
– ‘tensor’: Tensor

Line search strategies

the termination criteria (‘LS’):
– 0: Backtrack w/ Step Size Halving
– 1: Backtrack w/ Quadratic/Cubic Interpolation from new function values
– 2: Backtrack w/ Cubic Interpolation from new function + gradient
– 3: Bracketing w/ Step Size Doubling and Bisection
– 4: Bracketing w/ Cubic Interpolation/Extrapolation with function + gradient values (default for all except ‘bb’ and ‘sd’)
– 5: Bracketing w/ Mixed Quadratic/Cubic Interpolation/Extrapolation
– 6: Use Matlab Optimization Toolbox’s line search

Above, the first three find a point satisfying the Armijo conditions, while the last four search for find a point satisfying the Wolfe conditions.

Several strategies for choosing the initial step size

– 0: Always try an initial step length of 1 (default for all except ‘cg’ and ‘sd’), (t = 1)
– 1: Use a step similar to the previous step (default for ‘cg’ and ‘sd’), (t = t_old*min(2,g’d/g_old’d_old))
– 2: Quadratic Initialization using previous function value and new function value/gradient (use this if steps tend to be very long) ,(t = min(1,2*(f-f_old)/g))
– 3: The minimum between 1 and twice the previous step length, (t = min(1,2*t)
– 4: The scaled conjugate gradient step length (may accelerate conjugate gradient methods, but requires a Hessian-vector product), (t = g’d/d’Hd)

On Muti-class SVM Classification

Below mulit-class SVM classifier is composed of several binary-class SVM classifier. There are two ways to do.

1. One vs. All (One vs. Rest):

For a k-class classifier, k b-SVMs are trained with respect to the training data of class ‘k’ against rest. To distinguish each example x_{i}, SVM_{j} computes

y_{i}(w_{j}^{T}x_{i} + b_{j})

The SVM with the largest output is the classification of input x_{i}.

2. One vs. One:

For a k-class classifier, \frac{k*(k-1)}{2} b-SVMs are trained with respect to training data between any two classes, m,n \in K . To distinguish example x_{i}, SVM_{mn} computes

y_{i}(w_{mn}^{T}x_{i} + b_{mn})

Then x_{i}’s weigh on class m is increased if SVM_{m&n}’s output favors m, and vise versa. After all-round SVM computations, the class with the largest weight is the classification of input x_{i}.


OVO is better than OVR in performance, but more computation complexity since it trains more SVMs. Moreover, the scarcity of data in a particular class is also an issue of OVO method.



[1] A. Coates and A. Ng, “Selecting receptive fields in deep networks,” in Advances in Neural Information Processing Systems, 2011, pp. 2528–2536.
[2] A. Coates and A. Y. Ng, “Learning feature representations with k-means,” in Neural Networks: Tricks of the Trade, 2012, pp. 561–580.
[3] A. Coates and A. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 921–928.