Dictionary learning is done on the (randomly) selected patches from the dataset:

1. Randomly select patches from dataset.

2. Normalization and whitening for each patch.

3. Train a dictionary with processed patches.

Encoding is done for all the training images:

1. Select one image from the dataset;

2. Generate all possible patches of the image, in all channels (R,G,B); Normalize and whitening;

3. Encode every possible patch within the image with the trained dictionary with specialized coding methods (dim n-> k);

4. Polarity splitting the patch code (optional) (dim * 2);

5. m*m Pooling within the image, to construct an general representation of the image with only extending the dimension of a patch by m times  (methods include max pooling, average pooling, etc.)  (dim k-> k*m)

6. Output the feature vector z representing the image;

[Go Deep]

7. Regard the feature vectors,  z, of the random samples as the input of next layer ( (k * m) * num_sample ); however, the dimension of each vector is too large, requiring exponential numbers of examples to train a dictionary in K-means. So we have to shrink the output dimension.

8.  Local receptive field selection: (Within a certain area of the image), select T most strongly correlated dimensions to z_{i} for the dimensions of z, to form one local receptive fields (T * num_sample). Above process can be reran for times to form N such fields. Select corresponding features from the first layer output from each sample to form the inputs of the second layer.

9. Train each receptive field respectively, to get respective dictionary (* k2).

10. Repeat DL from step 3, until all dictionaries are learned;

11. Concatenate output codes for a image


  • Redundant information of the image, since pooling zones are overlapped;
  • Pooling since we would like to get an entire description of the image
  • M*M pooling is a trade-off between image diversity vs. data compression
  • local receptive field selection to reduce the dimension of output data, and also to improve k-means’s performance
  • Train dictionary always with samples; encode always with the dataset, and All code refers to the image;