-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrelated.tex
104 lines (80 loc) · 5.77 KB
/
related.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
\label{sec:related}
\vspace{-2mm}
\subsection{Deep probabilistic generative models}
\vspace{-2mm}
Our approach is inspired by the recent introduction of generative deep learning models that can be
trained end-to-end using backpropagation. These models have included
generative adversarial networks \citep{denton2015deep, goodfellow2014generative}
as well as variational auto-encoder (VAE) models \citep{Kingma2014,kingma2014semi,rezende2014stochastic,burda2015importance}
which are most relevant to our setting.
Among the variational auto-encoder literature, our work is most comparable to
the DRAW network of~\cite{gregor2015draw}. As with our proposed model,
the DRAW network is a generative model of images in the variational auto-encoder framework
that decomposes image formation into multiple stages of additions to a canvas matrix. The DRAW paper assumes an LSTM based generative model of these sequential
drawing actions which is more general than our model.
In practice, these drawing actions seem to progressively refine an initially blurry region of an image to be sharper.
In our work, we also construct the image sequentially,
but each step is encouraged to correspond to a layer in the image,
similar to what one might have in typical photo editing software.
This encourages our hidden stochastic variable to be interpretable,
which could potentially be useful for semi-supervised learning.
\vspace{-2mm}
\subsection{Modeling Transformation in neural networks}\vspace{-2mm}
\looseness -1 One of our major contributions is a model that is capable of separating the pose of an object from its appearance,
which is of course a classic problem in computer vision.
Here we highlight several of the most related works from
the deep learning community.
Many of these related works have been influenced by the Transforming Auto-encoder models by \cite{hinton2011transforming},
in which pose is explicitly separated from content in an auto-encoder which is trained to predict (known)
small transformations of an image.
More recently, \cite{dosovitskiy2014learning} introduced a convolutional network to generate images of chairs where pose was explicitly separated out, and \cite{cheung2014discovering} introduced an auto-encoder where a subset of variables such as pose can be explicitly observed and remaining
variables are encouraged to explain orthogonal factors of variation.
Most relevant in this line of works is that of~\cite{kulkarni2015deep}, which, like us,
separate the content of an image from pose parameters using a variational auto-encoder.
In all of these works, however, there is an element of supervision, where variables such as pose
and lighting are known at training time.
Our method,
which is based on the recently introduced Spatial
Transformer Networks paper
~\citep{jaderberg2015spatial},
is able to separate pose from content in an fully unsupervised setting
using standard off-the-shelf gradient methods.
\eat{
Finally our method relies crucially on the recently introduced Spatial
Transformer Networks paper
~\citep{jaderberg2015spatial}, which introduced
a fully differentiable module that allows one to warp/rotate images based on some input spatial transformation.
To our knowledge, we are the first to apply this module in an unsupervised learning setting.
}
% Discovering Hidden Factors of Variation in Deep Networks \cite{cheung2014discovering},
% Chair paper \cite{dosovitskiy2014learning}: The above two papers both separate style from pose (and other things like lighting) but use ordinary autoencoders which do not have the same probabilistic semantics. They also train their models in a supervised setting and thus required labeled data for pose/lighting/etc
\vspace{-2mm}
\subsection{Layered models of images}\vspace{-2mm}
Layer models of images is an old idea.
Most works take advantage of motion cues to decompose video data into layers~\citep{darrell1991robust,wang1994representing,ayer1995layered,kannan2005generative,kannan2008fast}.
However, there have been some papers that work from single images.
\cite{yang2012layered}, for example, propose a layered model for
segmentation but rely heavily on bounding box and categorical
annotations.
\cite{Isola2013} deconstruct a single image into layers, but require a
training set of manually segmented regions.
Our generative model is similar to that proposed by \cite{williams2004greedy}, however we can capture more complex
appearance models by using deep neural networks, compared to their per-pixel mixture-of-gaussian models. Moreover,
our training procedure is simpler since it is just end-to-end minibatch SGD.
Our approach also has similarities to the work of \cite{le2011learning}, however they use restricted Boltzmann machines,
which require expensive MCMC sampling to estimate gradients and have difficulties reliably estimating the
log-partition function.
\vspace{-2mm}
%ur model¯
%falls under the category of variational autoencoders which do not require expensive MCMC sampling.
%Amodal completion, where one must ``complete'' a partially visible object, is a challenging vision task
%that several papers have recently attempted.
% has recently been a
%Though we do not attempt the amodal completion task on a large scale vision dataset,
%we believe that generative models such as the one we propose could potentially be very useful in this task.
%\cite{amodalKarTCM15}, \cite{categoryShapesKar15}, \cite{zhu2015semantic}
% all use motion cues to decompose a video into layers
% The kannan model (at least the graphical model part) looks a lot like ours except they
% treat videos and rely on motion cues. They don't assume that appearance is encoded from some latent vector ---
% rather they treat appearance as a pixel matrix which can be warped by motion throughout the video
% They use EM with variational inference