The MapReducer
is the central object of every OSHDB query. It is returned by the initial OSHDB view and allows to filter out defined subsets of the OSM history dataset. At that point one can transform (map) and aggregate (reduce) the respective OSM data into a final result.
For example, a map function can calculate the length of every OSM highway, and a reduce function can sum up all of these length values.
For many of the most frequently used reduce operations, such as the summing up of many values or the counting of elements, there exist specialized reducers.
A transformation function can be set by calling the map
method of any MapReducer. It is allowed to have an OSHDB query without a map step or one with multiple map steps, which are executed one after each other. Such a map function can also transform the data type of the MapReducer it operates on.
For example, when calculating the length (which is a floating-point number) of an entity snapshot, the underlying MapReducer changes from type
MapReducer<OSMEntitySnapshot>
to being aMapReducer<Double>
.
A flatMap
operation allows one to map any input value to an arbitrary amount of output values. Each of the output values can be transformed in further map steps individually.
Filters can even be applied in the map phase. Read more about this feature in the filters section of this manual.
The reduce
operation produces the final result of an OSHDB query. It takes the result of the previous map steps and combines (reduces) these values into a final result. This can be something as simple as summing up all the values, but also something more complicated, for example estimating statistical properties such as the median of the calculated values. Many queries use common reduce operations, for which the OSHDB provides shorthand methods (see below).
Every OSHDB query must have exactly one terminal reduce operation (or use the stream
method explained below).
Remark: If you are already familiar with Hadoop, note that for defining a reduce operation we use the terminology of the Java stream API which is slightly different from the terminology used in Hadoop. In particular, the Java stream API and Hadoop use the same term 'combiner' for different things.
The OSHDB provides the following list of default reduce operations, that are often used for querying OSM history data. Their names and usage are mostly self-explanatory.
Some listed specialized reducers also have overloaded versions that accept a mapping function directly. This allows some queries to be written more concisely, but also allows for improved type inference: For example when summing integer values, using the overloaded sum
reducer knows that the result must also be of type Integer
, and doesn't have to resort on returning the more generic Number
type.
Instead of using a regular reduce operation at the end of an OSHDB query, one can also call stream
, which doesn't aggregate the values into a final result, but rather returns a (potentially long) stream of values. If possible, using a reduce operation instead of streaming all values and using post-processing results in better performance of a query, because there is less data to be transferred. The stream operation is however beneficial over using collect
if the result set is expected to be large, because it doesn't require all the data to be buffered into a result collection.
Often, one might be interested in analyzing properties of the geometries of the analyzed OSM features. For some often used metrics, the OSHDB comes with a few built-in helper functions in its Geo
class:
areaOf
returns the area (inm²
) of polygonal geometries.lengthOf
returns the length (inm
) of linear geometries.
Note that both of these methods use approximation formulas to calculate the length or area of OSM geometries. For typical features present in OpenStreetMap data, however, the relative error introduced by these approximations are quite small (below 0.1% for lengths and < 0.001% for areas).