02_related_work.tex

\section{Related work}

Before we explain the reinforcement learning approach to text
detection, we give a short overview over the state of the art methods
in text localization in this section. First, we briefly discuss
methods that use handcrafted features to then then move on to more
recent approaches that use convolutional neural networks.

Approaches that rely on handcrafted features often make use of visual properties specific to text to distinguish it from 
the background or other objects. These are then used to generate candidates for further examination \cite{zhu2016scene}. One such property is that 
the stroke width of a character is usually almost constant. The so called Stroke Width Transform (SWT) operator makes use of this finding to assign 
to each pixel the width of the stroke associated with it \cite{epshtein2010detecting}. By grouping pixels based on these values, letter candidates 
are generated, which are then filtered and aggregated to text lines.

Another way of generating candidates is the Maximally Stable Extremal Regions (MSER) approach \cite{neumann2010method}. This method exploits that 
all pixels belonging to a letter usually have a similar colour and intensity. Similar to the previous method, candidates are generated by identifying
regions of similar intensity. A classifier is then used to filter out candidates that do not contain a letter and the remaining candidates are
aggregated into lines.

The emergence of convolutional neural networks provides an alternative to handcrafting features, as it is now feasible to learn directly from the 
image itself. This has been used for text detection in various ways that are often inspired by developments in the more general field of object 
detection. An example of this is the Rotating Region Proposal Network \cite{ma2018arbitrary} that builds on the Faster R-CNN object detection 
mechanism \cite{ren2015faster}. Faster R-CNN first feeds the image into a convolutional neural network that outputs a feature map, meaning that the 
features are not constructed by human reasoning but learned during training. This feature map is then used as the input for another network that 
generates region proposals. These proposals serve a role similar to that of the candidates in the previous methods. In the last step, 
the proposals are assigned both a class and an offset that slightly adjusts the bounding box to better fit the object. The Rotating Region Proposal
Network adapts this method to text localization by enabling the network to propose bounding boxes with varying orientations as opposed to 
axis-aligned bounding boxes.

Faster R-CNN yields good detection results, but the two-step process of first creating and then validating region proposals is time consuming. The 
Single Shot MultiBox Detector (SSD) approach \cite{liu2016ssd} addresses this issue by skipping the region proposal creation phase and instead
performing the entire detection process in one network. This results in higher execution speed while achieving better detection results. Just like 
Faster R-CNN, SSD has been adapted for text localization. One approach that is inspired by SSD is Textboxes++ \cite{liao2018textboxes++}. Compared to 
SSD, Textboxes++ improves recognition of objects with extreme aspect ratios, which are common in text localization. Unlike SSD, it is also able 
to recognize arbitrarily oriented text.