Skip to content

Architectural Decision Records

Johanna edited this page Jun 20, 2023 · 17 revisions

Table of Contents

Database

Multiplex

Workflows

Database

Flexible Database Migration

Date: 02/2022

Status: Accepted

Context

During the Omicron surge in late 2021 through the beginning of 2022, the volume of daily tests SimpleReport processed increased exponentially. The system frequently alerted on database connection exhaustion errors, as there were not enough available connections to process the number of simultaneous users. Initial remediation attempts included changing the number of connections requested by the application's connection pool, along with increasing the number of application replicas present at a given point in time. These solutions proved temporary, however, as SimpleReport's rate of adoption continued to outstrip its available capacity. A more permanent solution was established by changing the SKU of our existing database to increase the fixed number of available connections, along with the available DB-specific compute resources. This still did not solve our issues with connection pool sizing, and required careful management of application replicas to prevent inadvertent connection starvation. The issue resurfaced as attempts were made to migrate the audit log functionality away from the database, and to Splunk, a third-party log analytics provider.

Decision

Ultimately, the team decided to move to the Azure Database for PostgreSQL - Flexible Server product.

Movement to the Flexible Server SKU would provide a number of advantages, many of which are covered in this documentation. Specifically, the following provide the greatest impact to SimpleReport:

  • High availability of database instances, ensuring minimal downtime in the event of a datacenter or machine outage.
  • Automated patching with a managed maintenance window, ensuring that Azure does not attempt to perform maintenance of our DB during peak SR usage hours
  • Rapid scaling and performance management, enabling rapid response to customer demand
  • The integration of PgBouncer connection pooling, which increases resource efficiency and minimizes the need for manual connection management

Consequences

The Flexible DB rollout has helped reduce response time according to the Azure metrics.

Multiplex

Multiplex changes for Flu A and B

Date: 06/2022

Status: Accepted

Additional discussion available here

Context

As of February 2022, SimpleReport was a COVID-19 specific tool. In order to become more flexible, we looked to expand SimpleReport’s capabilities to start recording Flu results. We wanted to support the growing number of SimpleReport users testing on devices that test for COVID-19, Flu A & B with a single patient sample (multiplex).

Over 65,000 tests were recorded in SimpleReport using these devices as of February 2022. We wanted users to be able to report flu results from these devices in tandem with COVID-19 results for three reasons:

  • To better support users as they track disease outbreaks in their facilities and organizations.
  • To provide a proof-of-concept to the CDC and other public health partners that flu data can be collected and is useful.
  • To begin the engineering work required to add additional diseases.

Decision

For this iteration of multiplex implementation, we did not focus on how to report these results to public health departments through ReportStream, as many departments do not want this data yet. Instead, we focused on how to collect and store these results within the SimpleReport system. We believe this will provide value to our end users by allowing them to review the test results and provide a more complete test result to the patient via SimpleReport’s SMS and SMTP test result delivery.

  • The “results” data schema has the following pieces:

    • A mutable TestOrder object with correction/removal status representing “a test for a person at a time has occurred”.
    • An immutable TestEvent object that we send to ReportStream representing “something related to testing has happened in SR”. If a TestOrder is corrected or removed, another TestEvent with the correction/removal flag is created to capture the change.
  • Previous to multiplex, we tracked test outcomes as columns in the TestEvent/TestOrder tables. To implement multiplex, we replaced these columns with a Results table joined to the TestOrder and TestEvent tables to maintain the mapping of Result <> TestOrder <> TestEvent. The Results table has a "Test Result" column taking one of "positive"/"inconclusive"/"negative", similar to the columns that existed previously in the TestOrder/Event tables.

  • For previous results that weren't corrected or removed, we backfilled the Results table with entries that copied their respective results’ outcomes with joins to the existing TestOrder/Events rows. These Result entries also got new columns like disease ID needed for multiplex.

  • For each corrected and/or removed test, we made new TestEvent entities. Each new TestEvent was mapped to a new Result entity. Both new entities were mapped back to the original TestOrder that was updated with correction/removal status.

    • In this process, the old TestEvents made prior to the TestOrder update are no longer associated with any Result entities. Therefore, we made new Result entities to map the old TestEvents back to the relevant TestOrder and to ensure each TestEvent has a Result.
    • In other words, TestOrders are one-to-many on Results. Results are one-to-one on TestEvents. TestEvents are many-to-one on TestOrders 🙃.

Consequences

The multiplex implementation has enabled SimpleReport to begin accepting Flu A and B results, as well as laying the groundwork for easier addition of future diseases. These changes have surfaced the potential need to refactor the Results flow given the complicated workarounds to TestEvents/Orders that were needed to make multiplex work.

Device Addition Automation and Standardization

Date: 09/2022

Status: Approved

Additional discussion available here.

Context

Test devices form a key part of SimpleReport’s user experience and add value to test reporting. The evolution of the COVID-19 pandemic combined with user engagement with SimpleReport have surfaced the need to rethink how we offer devices to end users. As the pandemic has evolved, SimpleReport has discovered user need for multiplex test results, expanding the product needs of SimpleReport beyond its original conception as a workflow tool for COVID-19. As SimpleReport expands to handle more diseases, maintaining accurate device data will be essential.

In the initial problem statement, we had two large goals: First, reducing the manual labor for support and super admins to add new devices, thus making reporting COVID-19 results – and later other disease results – a smoother, more accurate, and scalable process. Second, standardizing devices across SimpleReport environments, thereby increasing the ease and accuracy of testing the SimpleReport application across environments.

Ideally, this work would reduce manual effort to add devices to the SimpleReport application, and make a variety of devices automatically available to users. To potentially support this work, we considered the data provided by SimpleReport’s partner team, ReportStream. ReportStream has a LIVD api, which provides consumable COVID-19 test device data from the LIVD table that could be used to populate device data within SimpleReport, improving device SimpleReport’s device accuracy and consistency.

Decision

Research revealed that the product gain from implementing these goals would not be worth the effort of implementation. This is in large part due to lack of usefulness of LIVD table data both for SimpleReport users and in regard to SimpleReport’s current definition of devices. To be immediately useful, the LIVD data would require significant data processing. Furthermore, we do not know enough about multiplex and additional diseases at this time to truly leverage the LIVD data, and LIVD data is not useful for additional diseases. While still potentially valuable, architecting a data solution for multiplex and additional diseases at this time would be premature, given our lack of knowledge around how this data will need to be reported and in what format (FIHR, for instance).

Therefore, we adjusted our goals to near term and long term. The near-term goal is to increase data consistency of devices across all SimpleReport environments, notably prod, demo, and training by enabling device sync across environments. Doing so will achieve two sub goals: First, increase the ease and accuracy of automated and smoke testing across environments for engineers. Second, improve the experience for new SimpleReport users onboarding/training in demo and training environments. Mirroring device data from production will give new users a better understanding of the capability of SimpleReport and how to take advantage of it. This will also enable the support team to better troubleshoot user issues.

Longer-term goals include automating device addition (as initially imagined) and revising the data model in tandem with device data ingestion (either LIVD data or other multiplex/additional device data yet to be discovered) to be better structured to send the necessary report data to ReportStream.

A valid solution will ensure that 1) a single source of truth exists for all environments, and 2) lower environments remain usable and testable after sync with little manual effort, and 3) sync is automated and repeatable. To ensure this, we decided to:

  1. Create a static list for device data that closely resembles prod and can be used in any environment. Although this option only partially achieves the goal of having consistency between environments, as prod would not be touched, the engineering team determined that the advantages inherent in having consistent data in lower environments is worth the tradeoff of not having full data consistency. Additionally, this is the simplest solution, and will provide significant benefit given low effort.
  2. Implement scheduled automated wiping and re-seeding of environments. Work to implement a daily wipe of Training was already planned, so adding re-seeding device data is a logical add-on to this work. This automation should be available in the local environment as well, though on an as-needed rather than scheduled basis.

Consequences

Implementing scheduled automation for Training and Demo will create a prod-like experience for new SimpleReport users, and adding similar automation to Test will create consistency and reliability for e2e tests run against Test.

Workflows

Notification to support team when maintenance banner is updated

Date: 06/2032

Status: Accepted

Context

As part of the support research effort, it was mentioned that when the banner goes up indicating a prod outage or other active issue with the application’s functionality the support folks aren't notified and are sometimes taken by surprise. We have an existing GitHub workflow that takes care of posting and removing the maintenance banner in the application so it makes sense to extend that workflow to include logic that notifies support team of those changes.

Decision

The team decided to implement a custom GitHub action that sends a notification email to the support team whenever the maintenance workflow runs successfully and a change to the banner is done. The custom action will be taking advantage of the SimpleReport SendGrid account to send the email. A new API token with restricted email sending access was created for this.

The email notification was chosen over a slack notification because it requires less external dependencies. In order to set up the slack-send action we would need to create an application in the cdc slack workspace. This action is only possible if you are the admin of the workspace and because the workspace expands to the entire cdc organization, getting approvals and coordinating this setup would require more involvement than a notification by email.

Consequences

An email will be send to the support team as part of the maintenance workflow. The email will indicate whetherthe banner is being displayed or taken down, as well as, the content of the banner itself.

Local development

Setup

How to

Development process and standards

Oncall

Technical resources

How-to guides

Environments/Azure

Misc

?

Clone this wiki locally