Recently I have been into the investigation of current dictionary learning and encoding. Both dictionary learning and encoding are core components in the feature learning, since they construct the basis of the features and the coefficients of the original input respectively.

Dictionary is a set of vectors that span a new/sub space of the original example, with which we can see the examples from an alternative angle, and get a more comapct/sparse form of the signal. In computer vison, people extract the features of an image by building dictionaries.

However, early work on dictionary learning is like orthogonal transformation or orthogonal dictionary construction, which is limited by the dimension of the signals. But later when people find   sparse coding on overcomplete dictionaries can better represent the natural signal and easy to process, the sparse coding gradually become the focus of research.

In early days, dictionaries are built in a hand-designed way, which would cost time and effort and is not adaptable general cases. To solve the problem, it is proposed to learn the dictionaries. That is the beginning of dictionary learning and variations of DL methods emerged as follows.

I. State-of-art DL Methods:

1. Sparse Coding (with L1 norm)

\min_{D,s(i)}~ \sum_{i} ||D s(i) - x(i)||_{2}^{2} + \lambda ||s(i)||_{1}

s.t.~ ||D(j)||_{2}^{2} = 1, \forall j

Sparsity: L1 norm of s(i)

Solver: Coordinate Descent Algorithm.

2. OMP-k (or MOD)

\min_{D,s(i)}~ \sum_{i} ||D s(i) - x(i)||_{2}^{2}

s.t.~ ||D(j)||_{2}^{2} = 1, \forall j

s.t.~ ||s(i)||_{0} \leq k, \forall i

Sparsity: constrain k

Solver: Greedy selects s(i) first and then find out D with s.

P.S: OMP-1 is the same form of K-means, with

k = \arg\max_{j}|D(j)^{T}x(i)|

and then set s_{k}(i) = D (k)^{T}x(i)

3 PCA : transfer from linear dependent to linear independent 

\max_{w_{i}} (w_{i}X)^{T} (Xw_{i})

w_{i}^{T}w_{i} = 1

Sparsity:  This is a dimension-reduction method and thus the sparsity is decreased; but the G-PCA, an extension of PCA can achieve the sparsity.

Solver: SVD

P.S: Above optimization is to find out the vector w_{i} that max the projection from sample set X to vector w_{i}. The constrain means the vector must be normalized.  A similar algo is ICA. Also, one my consider Generalized PCA another extension of PCA that constructs an overcomplete dictionary with sparsity.

4. (Sparse) RBM (Restricted Boltzman Machine)

RBM is an example of generative model in machine learning, which see examples as a series of i.i.d variables sampled from an unknown distribution. The goal of a generative model in unsupervised learning is to learn the most “accurate” united distribution of X, i.e, given the parameters of the distribution \theta,

\max L(X|\theta) = \sum_{x} P(x_{i}|\theta)

So in fact the RBM works in the ML way.

The method that RBM utilizes to solve the problem is the Markov Random Field (MRF) theory. In MRF, variables are seen as independent vertexes in an un-directed graph and their relationship between are denoted by the edges.


Here the V are the examples and H are hidden nodes in the neural network. Then RBM constructs a two-layer neuron network with the visible and hidden layer and the parameters of the distribution play as the weights between the nodes.

By utilizing the factorization property of MRF, above optimization problem can be expressed as a series of products of conditional probability, and the parameters can be further solved. Then we can get the new expression of the examples  from  the hidden layer.

Sparsity: Sparse RBM

Solver: Original problem is hard to solve; use approximation via gradient descent methods.

P.S: Values of classic RBM is binary; however, there are extension of RBM with real values

5. Sparse AutoEncoder (SAE)

In the unsupervised learning, the SAE tries to recover the original examples as accurate as possible, by learning a neural network, i.e.,

h(x_{i}) \approx x_{i}

The network can be composed of multiple layers. Since the output is as accurate as the input, we can assert that the weights and notes of each layer of the neural network is a good representation of the original signal. Moreover, the sparsity constrain can be introduced by imposing an extra item in the optimization objective. For example, in a layer the optimization problem can be formulated as follows,

\min J_{sparse}(w,b) = J(w,b) + \beta \sum KL(\rho_{j} || \hat \rho_{j})

where KL is a function with sparse constrain.

Sparsity:  KL function

Solver: Backward Propagation

6.Parametric Training Methods Family: 

A large family of methods …

II. State-of-art Encoding Methods:

The issue of encoding comes from accommodating  different demands for the ‘codes’, say the sparsity. However, to get such rewards something needs to be traded-off. In most cases, ‘something’ is to what extent the code reflects the original signal accurately. Accepting this, we can mix up the dictionary learning and coding methods as much as we can. But the trade-off between approximation and sparsity must be tolerated.

State-of-art encoding methods include:

1. Sparse coding

2. OMP-K

3. Hard/soft threshold  …

III Summary:

Sparse Coding : Approximated Expression,  Depict signals from the natural angle,  Overcomplete Dictionary

Traditional Coding: Hi-fi Expression, Depict signals from the Math angle, Orthogonal/compact Dictionary


[1] L. Bo, X. Ren, and D. Fox, “Multipath sparse coding using hierarchical matching pursuit,” in NIPS workshop on deep learning, 2012.
[2] A. Coates and A. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 921–928.
[3] A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223.
[4] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: transfer learning from unlabeled data,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 759–766.
[5] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for Sparse Representation Modeling,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1045–1057, 2010.
[6] I. Tosic and P. Frossard, “Dictionary Learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, 2011.