Skip to content

openhistorymap/ohm-stacks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ohm-stacks

One of the key principles of cloud-first infrastructure design is to rely on pre-existing services already available within the cloud infrastructure we are working with. This is in principle optimal for high burn-rate startups with huge funding but it might not be an ideal solution for low budget digital humanities projects. For this reason we designed an infrastructure that could take advantage of an abstraction of this basic principle and of other general principles of software architecture in order to create an efficient architecture that complies with the needs and requirements that define the Open History Map platform in all of its aspects [Montanari 2021] but can also offer some of the high-level tools that cloud infrastructures typically lack for use in Digital Humanities.

One of the main factors taken into account during the architectural planning process was the importance of delaying and deferring decisions as much as possible both for us, using the architecture as well as for the architecture expandability itself [Martin 2017]. This principle is true when designing an application, an API and a complex architecture. For this reason most of the API orchestration is delegated to the various interfaces that use the data and compose the element in the way that fits best. For example, the map front-end uses the Data Index API in order to display information about a given source, using its id as symbolic reference (which is in itself a symbolic reference to the identifier defined in Zotero). This means a partial delegation of the knowledge of the inner workings of one specific part of the infrastructure to other elements that might not have to know, in principle, anything of it, but on the other hand, being part of the same ecosystem means that many elements can be cross referenced between various interfaces.

The infrastructure is divided into several macro-areas, each of whom covers one specific aspect of the platform. Wherever possible the vertical architecture of the macro-area has been structured with the same pattern:

  • Database (postgres/mongodb/redis/filesystem/influxdb)
  • Writing API, that also controls the database initialization infrastructure (a python/flask microservice)
  • Reading API (a python/flask microservice) with eventually Tile Server
  • Front end (an angular application)

This very basic template may or may not have all of its components, as for example, a service might not need to write to a backend, while another service might only be using interaction-less writing operations, and as such might not need a direct reading API. Obviously, if the reading operations are simple enough, they are integrated into the writing component and vice-versa.

The less cross-depending macro-area is the Data Index, being simply a visualization of the data collected in the Zotero collection representing the sources used for populating all other areas. The infrastructure is totally stateless, it has no persistent database, as of now, and downloads the current collection of sources once the docker image starts, eliminating the need of a back-end of any kind. The api only exposes the endpoints used by the interface and the APIs for other macro-areas to get source-specific data to display on their interfaces. All areas beyond this rely on its presence and on its being up-to-date.

The main user of the Data Index API is the Data Importer, an interface-less system that does the heavy lifting of transforming and importing data into the various databases coordinating the use of the various APIs. This service is stateless as well, not having a persistent database, yet it does connect to the local Docker socket, as it uses the Docker-in-docker methodology to spawn the machines that do the real ETL operations as well as the test database and test-APIs in case of import testing. The ETL code is downloaded from a repository and for each source identifier or source dataset identifier taken from the Data Index API the importer builds the specific docker image and launches it, writing, if already tested, directly on the various production APIs.

The other areas are all full stack infrastructures, as all rely on one or more databases, APIs and specific front-ends. Starting from the map, the data is stored in a PostGIS database configured automatically via the api module and already set up to be distributed across a cloud infrastructure via partitioning. The partitioning configuration is done both on layers (that are bound to topics) and year the dataset starts being valid. Setting up a variable grain to the time-dependent partitions enables a major optimization of the resources for the infrastructure, giving the possibility to move the storage of data-heavy periods (wars, moments of major changes) into separate databases. The API relies, in the writing part, on the possibility of using a redis-based buffer to have a workload manager deciding when to import specific items. This is very important because many polygons are very detailed, and the possibility to do an indirect upload of data gives the client better feedback on operations even in very complicated cases. The tile server is completely separated from the rest of the API and uses a stored procedure defined in the db-initialization part of the API. This enables an enormous optimization of the activities, as it relies on the database-native generation of MVT tiles for the specific requested tile.

Beyond the main map, all other tools are always cartographic as well, being the Event Index and the Historical Street View. Both contain mainly points, in contrast with the multiple types of geometries stored by the main map. The data, in the two cases, is bound to mapping events in time and (not always) in space and mapping documentation of the form of the world in history, respectively.

The Event Index collects data from various sources enabling the location of specific events in time and space, but also tracking particular subjects in their activities in time. For example it is possible to visualize the course of a specific ship over time, such as, in this case, the course of the Endeavour over its travels to New Zealand.

In addition to movements, the layer also contains data about events caused by antropic causes of change (wars, battles, murders, births and deaths) as well as by natural causes (quakes, volcanic eruptions, various forms of disasters). This layer is a pure collection of time-space coordinates with general information about the event and a reference to the external source. This block contains a modified tileserver, slightly different from the main map one, optimized for point management.

Finally, the Historical Street View macro-area is, like the Event Index, simply a point storage for not just space and time coordinates, but also the reference to external sources for documents, being photos, paintings of views, videos. These multimedia sources are not stored or cached in the system and are visualized directly from other providers. This is very important in order to guarantee the maximum independence from local storage.

Overall the structure of the architecture guarantees a great amount of flexibility in case of additions to the system as well as a great amount of simplification based on the templatization of the single structures. The transformation of the various parts into microservices gives the whole system additional reliability and resilience, enabling possible changes to the implementation without stopping the infrastructure.