Performance

Major performance issue should be addressed as they come up, but profiling will reveal where the most time is being spent.

Some resources to keep in mind especially for Models with large amounts of data in columns, or joins upon joins are

lazy loading; and
selective querying (i.e., not all columns only a subset)

https://docs.sqlalchemy.org/en/20/orm/queryguide/relationships.html#lazy-loading

https://docs.sqlalchemy.org/en/20/orm/queryguide/columns.html

Outdated information below

Note: this section is no longer relevant as MongoDB has been implemented for document storage.

Optimizing the large import / post import query

The following index is applied:

CREATE INDEX idx_data_source_jsonb ON data_source USING gin (document);

You can also create a more targeted index if you know the specific JSONB paths that are frequently accessed.

Use a specialized JSONB storage format: PostgreSQL 14 introduced a new JSONB storage format called zson that is optimized for queries that search for specific keys or paths in JSONB documents. If you're using PostgreSQL 14 ✅ or later, you can try switching to the zson format to see if it improves performance.

Use a specialized JSONB search engine: There are also specialized search engines designed specifically for JSONB data, such as Elasticsearch or MongoDB. If your application requires frequent searches or queries on JSONB data, you can consider using one of these search engines instead of relying on PostgreSQL's built-in JSONB support.

Elasticsearch

Elasticsearch can index the entire JSON document and allow you to perform full-text searches across all properties, regardless of their structure. This is especially useful when dealing with unstructured data like JSON.

To integrate Elasticsearch with SQLAlchemy, you can use the elasticsearch-dsl-py library which provides an easy-to-use mapping API to define your index and query your documents.

MongoDB

MongoDB has some performance advantages over Postgres for path-based queries on nested data structures, as it is designed to handle JSON-like documents natively. MongoDB's BSON format (Binary JSON) supports a rich set of data types, including nested objects and arrays, which can be queried using MongoDB's query language.

MongoDB has a powerful query language that supports a range of operators for querying and manipulating data. This includes operators for querying nested objects and arrays, such as the $elemMatch operator for querying arrays, and the $exists operator for querying the existence of fields.

Additionally, MongoDB has built-in support for sharding and replication, which allows for scaling horizontally across multiple servers. This can provide significant performance advantages for handling large volumes of data.

Overall, while Postgres' JSONB format provides a powerful and flexible way to store and query JSON-like data, MongoDB may be a better fit for certain use cases where path-based queries on nested data structures are a primary concern.

It's possible to store the metadata in the current table data_source and move the documents to MongoDB. One way to achieve this is to use a separate collection in MongoDB to store the documents, and use a foreign key to reference the document from the data_source table.

For example, you could create a new collection in MongoDB called documents, and store the documents there. You would also add a new column to the data_source table called document_id, which would contain the ID of the corresponding document in MongoDB. You could then use a foreign key constraint to enforce referential integrity between the two tables.

When you need to retrieve a document, you would join the data_source table with the documents collection using the foreign key. This would allow you to retrieve both the metadata and the document from the two tables.

To make this process seamless, you could write a wrapper function that would abstract away the details of retrieving the data from the two tables. The function would take the data_source ID as input, retrieve the metadata and the document from the two tables, and return them as a single object.

Here's an example implementation using the PyMongo library:

from pymongo import MongoClient
from sqlalchemy.orm import joinedload
from myapp.models import DataSource, Document

mongo_client = MongoClient('mongodb://localhost:27017/')
mongo_db = mongo_client['myapp']

def get_data_source_with_document(data_source_id):
    # Retrieve the data source and document using a joined load
    data_source = (
        session.query(DataSource)
        .options(joinedload(DataSource.document))
        .filter(DataSource.id == data_source_id)
        .one()
    )
    document_id = data_source.document_id
    document = mongo_db.documents.find_one({'_id': document_id})

    # Combine the metadata and document into a single object
    result = {'metadata': data_source.to_dict(), 'document': document}
    return result

In this example, the get_data_source_with_document function retrieves the data source and document using a joined load, and then retrieves the document from MongoDB using the document ID. It then combines the metadata and document into a single object and returns it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Outdated information below

Clone this wiki locally