-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_system_architecture.tex
106 lines (78 loc) · 11.8 KB
/
04_system_architecture.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
\section{System architecture}
The system consists mainly of three parts, of which one is used for generating test data and two are used for the actual reinforcement learning.
Initially, we created a Python-based program that can be used to generate synthetic training images with text on them\footnote{The program's source code can be found on GitHub.\cite{GitHubDatasetGenerator}}.
This program is called the \textit{dataset-generator}.
Subsequently, we wrote two programs - also in Python - which provide both the environment and the agent used for the reinforcement learning's training and testing\footnote{The source code of those programs can be found on GitHub as well.\cite{GitHubTextLocalizationEnvironment}\cite{GitHubTextLocalizationAgent}}.
Those programs are called \textit{text-localization-environment} and \textit{text-localization-agent} respectively.
Both the \textit{dataset-generator} and the \textit{text-localization-agent} program use the Python package \textit{click}\cite{PythonPackageClick} for offering command-line interfaces that allow users to change the programs' parameters without having to change their source code.
This is especially useful when running the programs on computers accessed via a remote connection (e.\,g. via SSH), as it often is the case when working on machine learning projects.
A typical use-case for using the system is to (a) generate synthetic training data using the \textit{dataset-generator}, (b) installing the \textit{text-localization-environment} for later usage by the agent and (c) training and/or evaluating one or multiple agents using the \textit{text-localization-agent}.
In the following paragraphs, the three programs and their usage options are described more in-depth.
\subsection{The dataset-generator program}\label{sec:dataset-generator}
As briefly mentioned before, the \textit{dataset-generator} is a program that can be used for generating synthetic training data for usage in machine learning projects related to the bounding box detection (localization) of text on images.
With the generator it is possible to start our training process at a low level and increase the difficulty step by step. Therefore we can slowly improve our implementation to the point where it might be able to detect real world images. Secondly the \textit{dataset-generator} enables us to generate as many images for train or testing purposes with specific parameters (i.e. wordcount, word length, font, font size etc.) as we want without the need of anyone classifying data for us.
The \textit{dataset-generator} uses the Python package Pillow\cite{PythonPackagePillow} to load one or multiple fonts, create images of a specific size and render one or multiple words of text onto those images.
The words themselves come from a public domain word list originally contained in the original BSD project.\cite{BSDWordlist}
For the generation of image datasets, users can control the images' width, their height, the amount of images to generate, the location of a word list (in a format similar to the word list contained by default, as described above), a seed used for the random calculations within the program as well as whether the program should use a debug mode (in which it renders AABB outlines around the corresponding words).
Figure \ref{fig:dataset-generator-sample-output} shows two sample images created by the \textit{dataset-generator}, one using the default settings and one explicitly having the \textit{debug} flag set to true.
In addition to the images, the program also creates a \texttt{.npy} file containing the AABB locations of the words in the images in a Numpy-encoded way as well as a \texttt{.txt} file containing the relative locations of the generated images.
\begin{figure}[h!]
\begin{subfigure}[t]{0.5\textwidth}
\centering
\includegraphics[width=0.8\textwidth]{figures/dataset-generator-sample-1.png}
\subcaption{Debug mode disabled (default setting)}
\end{subfigure}
\begin{subfigure}[t]{0.5\textwidth}
\centering
\includegraphics[width=0.8\textwidth]{figures/dataset-generator-sample-debug.png}
\subcaption{Debug mode enabled. The blue and gray boxes indicate the AABB of the single words and of the whole “text” block}
\end{subfigure}
\caption{Two sample images created by the \textit{dataset-generator}}
\label{fig:dataset-generator-sample-output}
\end{figure}
\noindent Together with the \texttt{.npy} and the \texttt{.txt} files, the images can be used by the \textit{text-localization-environment} program as described in the following section.
\subsection{The text-localization-environment program}
The \textit{text-localization-environment} program is designed to conform to OpenAI Gym's environment interface \textit{Env}\cite{OpenAIGym} by implementing the three following methods:
\begin{quote}
\begin{itemize}
\item reset(self): Reset the environment's state. Returns observation.
\item step(self, action): Step the environment by one timestep. Returns observation, reward, done, info.
\item render(self, mode='human', close=False): Render one frame of the environment. The default mode will do something human friendly, such as pop up a window. Passing the close flag signals the renderer to close any such windows.
\end{itemize}\cite{OpenAIGymReadme}
\end{quote}
In general, the environment is based on the algorithm described in the paper\cite{caicedo2015active} which we based our research on:
the environment offers a total of 9 actions to the agent (and thus uses a \textit{discrete} action space).
Those actions are exactly the same as the one used in the paper mentioned above, and are visualized in Figure \ref{fig:environment-actions-from-base-paper}.
After being initialized with a set of images (indicated by a corresponding \texttt{.txt} file, as described above) and the ground truth bounding boxes (contained in a \texttt{.npy} file), the environment acts as a kind of state machine which can be controlled by the three main methods described above.
For calculating the observation in the \textit{reset} and \textit{step} methods, the environment uses a feature extractor – in our case, we used VGG16\cite{VGG16} in the beginning and replaced it with ResNet\cite{ResNet} later.
In order to control whether this feature extractor runs on the GPU, the environment can be initialized with a corresponding parameter indicating the ID of the GPU to be used.
For calculating the reward returned by the \textit{step} method, the environment calculates the best intersection over union (IoU) of the current bounding box and any of the ground truth bounding boxes.
During the course of the project, we also introduced two more means for increasing the likelihood of the agent using the \textit{trigger} action: we (a) made the environment count the amount of consecutive steps taken that did not change the IoU in any way (and give a negative reward of $-1$ whenever this count exceeds 3 steps) and (b) introduced a duration penalty (with a default value of $0.03$) which is multiplied by the amount of steps taken since the last time the \textit{reset} method was called and subtracted from the reward. These two elements were introduced because agents in the early development stages ended up almost never using the \textit{trigger} action; our aim was thus to penalize the agent for waiting too long before coming up with a bounding box proposal\footnote{This matter is discussed further in Section \ref{sec:Results}.}.
Although the \textit{text-localization-environment} program is theoretically independent from the agent using it (as proposed in the OpenAI specification), it usually is used together with such an agent.
The program allowing the training and testing of such agents is described in the following section.
\subsection{The text-localization-agent program}
The \textit{text-localization-agent} program uses the deep reinforcement learning framework ChainerRL\cite{ChainerRL} internally.
In addition to offering a variety of state-of-the-art deep reinforcement learning algorithms and different Q-functions, ChainerRL also offers convenience functions for training an agent while evaluating it at a given interval.
Thus, the \textit{text-localization-agent} program acts on a quite high level of abstraction:
It offers a \textit{click}-based CLI which allows users to provide an amount of steps for which a training should be run as well as the locations of the \texttt{.txt} and \texttt{.npy} files described in section \ref{sec:dataset-generator}.
Furthermore, it allows users to specify the ID of the GPU to use (if any) and whether the intermediate training results should be logged to TensorBoard, as described in the following paragraph.
Our implementation of the agent has remained rather static across development, but there are a couple of details which are important to keep in mind. As hinted in Section \ref{sec:theoretical background}, our agent at its core was made up of a Deep QNetwork; this essentially means that it was modeled by a Deep Neural Network which took as input a representation of the current state and would output the Q values for the different actions. For the optimizer we chose the widely used Adam optimizer which is well suited for our problem given the sparse structure of the data. Finally, the exploration-exploitation trade-off was balanced using a Linear Decay $\epsilon$-greedy strategy; we decided to go for this approach based on the base paper \cite{caicedo2015active} and also took the same interval for the linearly decaying $\epsilon$ values.
\paragraph{TensorBoard logging integration}
Although being mainly focused on being used together with TensorFlow, the TensorBoard visualization toolkit\cite{TensorBoard} also can be used for monitoring the training progress of other machine learning applications.
Since ChainerRL doesn't offer a built-in way of logging in a format suitable for visualization in TensorBoard, we had to implement a corresponding logging ourselves.
Thankfully, there exists a library called \textit{tensorboard-chainer}\cite{TensorBoardChainer} which offers a convenient way of logging variables like scalars and images to TensorBoard.
The resulting TensorBoard logging is implemented using a so-called ChainerRL step hook\cite{ChainerRLStepHooks}.
Figure \ref{fig:tensorboard-sample-output} shows a sample visualization created in TensorBoard.
\begin{figure}[h!]
\centering
\includegraphics[width=0.55\textwidth]{figures/tensorboard-sample-output.png}
\caption{Sample visualization created using TensorBoard}
\label{fig:tensorboard-sample-output}
\end{figure}
\noindent We tried using the ChainerRL Visualizer\cite{ChainerRLVisualizer} as well, but it turned out to be less useful than our integration of TensorBoard.
This was mostly based on the fact that the ChainerRL Visualizer can not yet create saliency maps for models containing a RNN, as well as on the fact that the ChainerRL Visualizer requires users to control the training itself via the UI, which was in our case less practical than being able to run the training programmatically from our \textit{text-localization-agent} program.
While the visualizations in TensorBoard help to get a quick impression on how well a training run performs in comparison to others, they do not really give a formal evaluation of the agents' performances.
Thus, we implemented a Python script that allow evaluating a trained agent on a specified set of images, as described in the following paragraph.
\paragraph{Evaluation script}
In order to be able to precisely calculate the mean average precision (mAP) as well as precision and recall for specific IoU thresholds, we implemeted a small wrapper script around ChainerCV's VOC evaluation methods\cite{ChainerCVEvalDetectionVOC}.
This allows users to provide a dataset in the format created by the \textit{dataset-generator} program and calculate the mAP as well as precision and recall values based on the metrics used in the PASCAL VOC challenge.