Recently the concept of deep learning is prevailing throughout the academia and industry. Indeed, deep learning has been proven to dramatically increase state-of-art technologies in almost all walks of AI, ranging from computer vision, natural language processing, and audio processing. As as result, industrial prototypes using the idea of deep learning come into play, such as reported in [1]. Meanwhile, questions that how/why it works prevail the research area, as scientist propose various hypothesis. In the following post I will do a bit review on the question, with a special focus on Y.Bengio’s research.

The limitation of Shallow Structure

Shallow structure refers to most state-of-art machine learning methods, where the output of a learning scheme is directly followed by a classifier so the system has only one layer (a flat model).  Generally, this scheme works well and is also easy to be implemented. But challenges come when such a structure is used to learn highly varying functions.

In such case,  the shallow structure hits the ceilings inherited from its structure. [3] call them depth and locality issues. However, I see them as the same problem from two perspectives.

Depth in Computing Elements

Experiments have shown that when dealing with high varying functions, shallow structure cannot represent the function well even though the database is sufficiently large.  During the learning process,  shallow model may fall in sub-optimal local minimas due to overfit, a state where the model is not optimized. The sad news is that in most of such cases, however, better learning result(solutions) actually exist. The reason they are not reached at this point is because the number of parameters to be learnt is much larger than the size of the dataset (overfit). Therefore, even more samples are required to tune such model and the huge size of the database get increased experimentally w.r.t the training accuracy.

However, research on depth of the network finds a promising way to tackle such problem. This is inspired by a comparative research on the required number of computing elements for a logic-circuit (or the neuron network structure within human brain). In the circuit theory, it has been proven that given a set of operations, the number of operations (logical gates) needed is closely related to the depth of the circuit system. For example, increasing number of layers will decrease the number of logical gates  (operations). This is easy to understand because later layers reuse the results from former layers. Fortunately, similar results have also been observed in machine learning algorithms. When the learning structure gets overlaid, the number of parameters needed are much reduced. As a result, increasing depth of the algorithm points the way for learning highly-varying functions.

Local Representation

Another limitation of the shallow structure is the locality[3]. Traditionally, a signal is represented by a set of orthogonal base vectors in the space. If we call each of the vector a local feature, then for a general input signal within that space,  it will be a random linear combination of the base vectors(local feature). I think the reason we call it “local” is that each feature is orthogonal, only responsive to the projection of input signal on that dimension.

While this type of representation is clear and concise for math computation, it may not work well in machine learning tasks, especially the number of input dimension is unknown.




particular of a signal  structures are generally based on the assumption of smoothness of the learned function (like svm, knn, k-means…). Generally, the system makes prediction by comparing the unknown input with nearby counterparts in the training set. The locality, as pointed out in [2], however, makes the prediction worse when the function varies very fast and unexpectedly at certain part that few training samples locate in the neighborhood of the prediction region.

It is like the under-sampling problem in signal processing if we perceive the function as an unknown signal. In order to restore a signal from sampling, the sample rate needs at least twice as much as the highest frequency of the signal, and the baseline sample rate is called the Nyquist rate. Otherwise, key sample points may be missed and the restoration may be wrong, as the sine function example from [2]. This problem can also be overcome by deep learning structure, with the feature of distributed representation, which I will talk about in later section.

The Deep Structure

The deep structure also came from the research towards how cortex in our brain works in 1970s. Specifically, researchers have found that the way cortex function works in our brain exactly alike how deep learning network works: combining lower level information into higher level abstract information. Inspired by such concept, researchers proposed early versions of deep neuron network. In following years, they got theoretically proven to capable of overcoming depth and locality challenges.

Distributed Representation

Scientists are also working on ways to overcome local representation in shallow models. A great advance from deep structure is the ‘distributed representation'[6]. Instead of using a set of orthogonal base vectors to represent signals,  distributed representation uses a group of  un-orthogonal feature vectors to represent the signal. Normally, each of these feature vectors is learnt as a linear combination of orthogonal base vectors[6]. So  one advantage of using feature vectors is the reduction of “fleet size” of vectors. That is, we use a smaller set of vectors to represent the same information, while each one from the small set is a linear combination of severals from the big set. Learnt features is proven to have a better balance between ‘sparsity’ and ‘distributity’. How? The signal example again. First of all,  we hope the sample points to be distributed evenly (sparse) to reflect the underlying regularities of the signal.  like the ‘one-hot’ representations of signals in vector space, where N dimentions needs O(N) samples/parameters to represent. On  the other hand, we do like the sample With distributed representation,  instead, only O(log2N)  samples/parameters are needed to cover those regions. As a result, free from the locality issue, the representation is able to learn the underlying regularities of  a signal.

Take the sine example. Distributed representation can learn the cyclic pattern of the sine function by activating several parameters simultaneously, even though they are not adjacent. But sparse representation only catches the trough and crest of the sine waves.

On the other hand,  to fully recover the diversity of the natural signal, the representation should be sufficiently sparse [7].  So sparse codes are the output at each layer of deep learning. But they are recombined into one feature vector later, another structure to generate distributed representation in deep learning.

The Breakthrough of Deep Structure Training

Althought …, raining such a complex network was a nightmare for computer scientists that time. They had difficulties in algorithm design, computation complexity and data procession. But in recent years, as the result of computation capability  boost and the revolutionary algorithm proposed by [4], such barriers are vanish and the full-picture of deep learning algorithm is being revealed.The training of deep network had been a challenging task until Hinton proposed a solution, adding an unsupervised learning stage for the raw data at each layer a RBM[4]. Generally speaking, the proposed training are composed of two steps at each layer, a greedy unsupervised pre-training of the input data followed by supervised training to tune the classifier parameters. Hinton’s scheme has been such a great success in training that opened a door for us to recognize the power of deep networks. The experiment results were astonishing, beating most of state-of-art algo in a range of machine learning applications such as computer vision, NLP, audio processing and so on.

Explanations of Hinton’s Method

Hinton’s research got quickly followed by many researchers with variations based AE and other deep machines [6]. So people are curious about why does pre-training work? Several explanations [] were later proposed, to discuss the effect of US pre-training as regularization[5]. But more in-depth explanations are needed.

Limitations of Deep Network

At first I would put on a comment given by Bengio , one of the pioneer in deep learning research:

If you have enough prior knowledge on a problem you don’t need any learning, or it is enough to design the appropriate fully observed graphical model (you know what all the relevant random variables are and you either know their relationship or observe them in enough data), or just learn the “top layer”  (e.g. with a good feature space or a good similarity/kernel you can just plug your favorite SVM brand).

Another basic reason why I believe that artificial neural nets are not used for “everything AI” is simply that we are far from having captured from the cortex and from our ingenuity a learning algorithm as powerful as biological brains (in particular primates’, and to some extent mammals and birds) enjoy, i.e., more real fundamental research is needed!

I agree with Bengio’s point, especially the first one. For the field of computer vision, nobody can ignore the excellence of deep learning against its counterparts in front of the experiment results in generic object recognition and face recognition[6]. But the point is whether we need to that? or can we afford to the cost of DL. As far as I know, doing a DL experiment on the cluster in our lab normally requires  2-3 days to get the results, a duration dramatically longer than any comparable methods. But the reward is not so distinctive as the running time. Sometimes the test accuracy is only slightly better or even worse  than those from shallow structures. That is especially true when the training data is not very large. For example,

CIFAR-10 Repro (30000)
Two Layers (3*3 pooling) Single Layer
OMP1 with ST k=1600 k=1600
Train Acc. 97.796667% 95.5%
Test accu 77.66% 77.35%

Above form shows one of the reproduction experiment results of [6], stating that when the input dataset shrinks, the gap between deep and shallow learning gets smaller, only about 0.3 percent. However, consider the training time of the former one is much larger than the first one, one may doubt the cost-revenue of doing a deep learning task. Therefore, the proposed increase in test samples have not been observed due to practical limitations and many more fundamental research should be done to solve them.

Intuitively speaking, the fact that the way DL work is like that of human brain makes it reasonable to expect its excellence like a human brain. However, practical limitations like computation capability get in the way of the DL evolving process. Although the age of big data makes data more accessible to researchers, limited processing power of current computers however, makes the task a far difficult than we thought. As mentioned by Bengio, small data applications whereas shallow structures currently works, should avoid stumbling into the land of DL. For big data applications, there may be promising application based on DL, yet not now.

Learning Good Representations


  1. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng, “Building high-level features using large scale unsupervised learning,” arXiv preprint arXiv:1112.6209, 2011
  2. Y. Bengio, “Learning Deep Architectures for AI,” Foundations and Trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
  3. Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  4. G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  5. D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why Does Unsupervised Pre-training Help Deep Learning?,” J. Mach. Learn. Res., vol. 11, pp. 625–660, Mar. 2010.
  6. A. Coates and A. Ng, “Selecting receptive fields in deep networks,” in Advances in Neural Information Processing Systems, 2011, pp. 2528–2536.
  7. I. Tosic and P. Frossard, “Dictionary Learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, 2011.