Deep Learning

Mechanism#

I'm perplexed by one theoretical problem in machine learning: why deep neural network? Namely, why does adding multiple hidden layers work better, rather than just flatten the layers? I'm not asking whether why it is better to add layers, I'm asking what's the mechanism and the reason that adding layers help.

Theoretically speaking, what machine learning does is simply to find a function that "fits the data optimally". It is no different from, say, regression (though we also have classification problems etc.), hence it doesn't make much sense to add hidden layers. The computational process itself has nothing to do with the outcome. Or better, a result known as universal approximation theorem (https://en.wikipedia.org/wiki/Universal_approximation_theorem) plainly says that you can do with one hidden layer.

Then why neural network works better must be that it allows a more efficient way of solving learning problems.

A first guess is that more layers means more abstractions and hence irrelevant information is "washed away". This is also the most common answer, for exampleThe advantage of multiple layers is that they can learn features at various levels of abstraction.

from https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider

but why? what's "abstraction" here, and how does this multi-layer abstraction actually work? There's literally no explanation just heuristics. This is what makes me mad about machine learning stuff. All heuristics, and people aren't interested in interesting problems at all.

Maybe this: the layering naturally introduces a kind of weight in the, say, parameters/data (whatever, I'm not familiar with the jargon yet) or simply variables, so that relevant variables increase, while irrelevant variables rapidly decrease. Then we may say that when some of the variables become negligible, a new layer of abstraction emerges. This somehow reminds one of

renormalization group: irrelevant operators are washed away when approaching the critical point, which corresponds to the locus of meaning/categorization scheme that is proper to a learning process.
bifurcation of the logistic map: in short iterative dynamical systems and their bifurcation, especially when it comes to classification problem this becomes manifest.

Freaking frustrating to search the web about this problem. Why the heck aren't they interested in this?

All of that is largely driven by trial and error and there is little understanding of what makes some things work so well and some other things not. Training deep networks is like a big bag of tricks. Successful tricks are usually rationalized post factum.

The actual status of the problem:

I think that the somewhat astonishing answer is that nobody really knows. There are some common explanations that I will briefly review below, but none of them has been convincingly demonstrated to be true, and one cannot even be sure that having many layers is really beneficial.
https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w?noredirect=1