note.txt

Deep dive into MongoDB Aggregation Framework

In this article, we will deep dive into the complex usage of the MongoDB aggregation framework to retrieve 
documents from multiple collections using an aggregation pipeline query that will perform much faster than 
the existing functional Map-reduce models. 

Run Database query with one-go

The database transaction to retrieve a large number of records in a single query is a daunting task. 
The millions of documents stored in the database have complex relationships with data nodes that require 
seamless flow with the software application service. 

The data node defines the business data processing workloads to perform an intensive task in real-time. 
The business software applications (Stock Market, Disaster Management, a Cryptocurrency marketplace, 
Census dashboard) require in-flight analytical data with a single database query to retrieve grouped data 
and a faster business transaction timeline. 

Techniques to handle large datasets

MongoDB provides data processing functions to analyze large datasets and produce the final aggregated results. 
The Map Reduce and Aggregation pipeline are two ways of aggregating, grouping, filtering data nodes in 
MongoDB with several functional stages.

-  What is Map-Reduce?
    Map Reduce is a javascript functional model to map documents with emit(key-value) pair-based functions 
that pass value to the reduce function to perform the aggregation of the data node that returns the output 
to the client. 

    -- Structure of Map-Reduce in MongoDB
       MongoDB uses the "MapReduce" database command for a collection to apply the aggregation query. 
       The map-reduce function facilitates writing key-value pair objects in map function and computational 
       logic in reduce function to finalize the result.  
       
    For example: A "mapReduce" function to count number of imdb votes for a movie stored under a collection.

    <<-- screenshot movie list -->>

    The mongodb collection has "23,500" movie records.

    db.movies.mapReduce(
        // map
        function(){
            emit(1,this.imdb.votes);
        }, 

        // reduce
        function(key,vals){ 
            var count = 0;
            vals.forEach(function(v) {
                count +=v;
            });

            return count;
        }, 
        {  
            // query  
            query: {
                "imdb.rating": {
                    "$lt": 7.5
                }
            }, 
            // output
            out: "vote_count" 
        }
    ).find();

    output:
    { 
        "_id" : 1.0, 
        "value" : 271026594.0
    }

    << screenshot: map-reduce (imdb_vote_count).jpg >>

    The map-reduce query took 280ms to count total_votes for the movies has less than 7.5 imdb rating.

    Why Map-Reduce much slower in query performance?

    1. Sequence of tasks during the process
    The map-reduce function uses JavaScript that runs in a single-thread Spider Monkey engine. The map-reduce 
    operation involves many tasks such as reading the input collection, execution of map function, execution 
    of reduce function, writing to the temporary collection during the processing, and writing the output 
    collection. 

    2. map-reduce conncurrency & global write lock
    The map-reduce-concurrency operation applies many global write lock during the processing of the queries 
    that makes the execution much slower to generate the final document.

    -- global write lock
    - The read lock applies to the input collection to fetch 100 documents.
    - The write lock is applied to creating the output collection and insert, merge, replace, reduce documents into output collection.
    - The write lock is global and block all the remaining operations on the mongod instances.

    MongoDB 5.0 release deprecated the map-reduce from the database engine. The aggregation pipeline is 
    much faster in performance querying multiple collections with the complex data-node relationship. 
    However, MongoDB provides a few aggregation pipeline operators [$group, $merge, $accumulator, $function] 
    in the framework layer to facilitate writing Map-Reduce operations using the aggregation pipeline.


- What is Aggregation pipeline?

    The Aggregation pipeline works as a data processing framework in MongoDB to aggregate hierarchical 
data nodes. There will be several stages in the pipeline to process the documents to filter out the final 
analyzed document. 

    The pipeline operators used to apply logical queries and accumulate the filtered records to transform 
an output structure. The performance of the aggregation pipeline is significantly high compared to Map-Reduce 
because the framework runs with compiled C++ codes. 

    The data flow control works as a sequence of stages that the output of the previous pipeline stage 
becomes the input to the next pipeline stage. The pipeline operator decides the type of aggregation action 
to perform in the database engine. 


- How to write Aggregation pipeline?
The pipeline written as BSON format passes directly into the "aggregate()" function in the Mongo Shell for 
executing the aggregation query. Also, MongoDB provides language-specific drivers that will have an in-built 
Aggregation framework with structure API functions to implement the Aggregation pipeline in the application. 

The following "aggregate" function counts the total votes using the aggregation pipeline. 

Stage - 1
The "$match" operator applies a conditional query to filter the documents with less than 7.5 "imdb.rating". 

Stage - 2
Then the second stage takes the filtered document as input to accumulate the $sum of "imdb.votes" and groups 
the vote_count using the "$group" operator.

Query:
    db.movies.aggregate(

        // Pipeline
        [
            // Stage 1
            {
                $match: {
                    // conditional query 
                    "imdb.rating": {
                        "$lt": 7.5
                    }
                }
            },

            // Stage 2
            {
                $group: {
                    _id: "total_vote",
                    vote_count: { 
                        "$sum": "$imdb.votes"
                    }
                }
            },
        ]
    );

Output:
{ 
    "_id" : "total_vote", 
    "vote_count" : NumberInt(271026594)
}

<< screenshot: mongo-aggregation-pipelin(imdb_vote_count).png >>

    Complex Aggregation Pipelines

    The MongoDB aggregation framework has many operators that support to aggregate in the pipeline stages. 
    The operators are treated as MongoDB query commands to perform query actions and then either process 
    the pipeline into the next available pipeline stage or collect the filtered document at the final 
    pipeline stage. The pipeline execution is an algorithmic process of a finite sequence of blocking 
    structures that perform several aggregation functions.

    - sorting, merging documents from more than one collection.
    - Arithmetic computation of numeric data nodes. 
    - Accumulate a data node to apply Mongo aggregation expression ($avg, $sum, $addField, $concat).
    - Apply GeoSpatial command to find the nearby address or calculate min/max distance.
    - and many more

    Type of Aggregation Pipeline Operators

    - $project
      The $project operator facilitates to include or exclude certain fields from the document in a single 
    stage.
    
    Example:

    In the following movie record only specific fields 'title', 'plot', 'year', 'rating', 'rated', 'language', 
    'year', 'poster', 'director', 'imdb', 'awards', 'genres' are required to be display from the collection. 

    << gist: https://gist.github.com/Deeptiman/7878b1551b907c1fa7b58b7d911340c6 >>

    $project aggregation query

    << gist: https://gist.github.com/Deeptiman/367e8ab17df387a2038521e6e9c01dbc >>

    result

    << gist: https://gist.github.com/Deeptiman/9e2771ff97595f7f1c67d15b4fcfcbf4 >>

    Problems with $project stage
    The project operator is quite verbose and lengthy as the fields are defined explicitly in the projection. 
    We have seen in the example that the project stage is very derivative to include field items. 
    The projection will be much more tedious to include fields in a large set of documents. The project stage 
    refactor will not be optimal to define the new field in the result. 

    Optimized solution with $set and $unset
    The $set and $unset provides efficient way to write inclusion and exclusion of fields in the pipeline. 
    The specific field items are only required to add, modify, remove in the stage.

    query:

    << gist: https://gist.github.com/Deeptiman/fd14f26419118356d6434e9996f10216 >>

    result:

    << gist: https://gist.github.com/Deeptiman/cda4dca212a0023262355daabaf91bbd >>

    As we can see the same projection query is much more clear and flexible to write using $set and $unset 
    stage.

    Best scenario for $project over $set/$unset
    The "$project" stage best use case to produce completely different output document than the defined 
    internal document. The same movie document requires to display a new shape of document with a few lines 
    of projection query.

    output document: 

    << gist: https://gist.github.com/Deeptiman/b23a0a2bb2235087fab4cf66039bdce4 >>

    Bad approach with $set/$unset

    << gist: https://gist.github.com/Deeptiman/a8698444a7a79920721e18f9fca4bb8d >>

    Good approach with $project

    << gist: https://gist.github.com/Deeptiman/36b2dcd3363cad48948e18d4202c78d9 >>


    - $group
    The distinct grouping of documents followed by _id is supported by the $group operator. The output field 
    document holds the accumulated value of an object.

    Example-1

    - Find the total number of votes for all the movies using the $sum accumulator expression.

    list of movies:

    << gist: https://gist.github.com/Deeptiman/3ef23d40e48165b3f9e196d13c5cd46e >>

    query: 

    << gist: https://gist.github.com/Deeptiman/ac615c6e67715bf673f14587ca9f5a18 >>

    result: 
    
    << gist: https://gist.github.com/Deeptiman/1b0fd6225e6a33e34e00785648014bb5 >>

    Example-2

    - Find the recent movie document using "$last" accumulator expression with the genre "War".

    list of movies:

    << gist: https://gist.github.com/Deeptiman/3ef23d40e48165b3f9e196d13c5cd46e >>

    query:

    << gist: https://gist.github.com/Deeptiman/ea9ebf524e3df3e338365be22759876a >>

    result:

    << output of the latest movie doc >>
    
    << gist: https://gist.github.com/Deeptiman/55c01af050d3f482a950aafad26033f9 >>


    - $lookup
    MongoDB stores the documents in multiple collections as a best practice to distribute the records to 
    reduce the storage volume for mongo collections. The $lookup operator searches for data nodes among 
    mongo collections that maintain relationships to aggregate the output record.

    $lookup query structure

    -- Structure 1
    {
        $lookup:
            {
                from: <collection to join>,
                localField: <field from the input documents>,
                foreignField: <field from the documents of the "from" collection>,
                as: <output array field>
            }
    }

    -- Structure 2
    {
        $lookup:
            {
                from: <collection to join>,
                let: { <var_1>: <expression>, …, <var_n>: <expression> },
                pipeline: [ <pipeline to run on joined collection> ],
                as: <output array field>
            }
    }


    list of movies:

    << gist: https://gist.github.com/Deeptiman/3ef23d40e48165b3f9e196d13c5cd46e >>

    list of comments for the movie

    << gist: https://gist.github.com/Deeptiman/d3e9afff4a6901d17a6bed5e82b7be6c >>

    Example - 1:

    The list of comments for the movies is stored in a separate collection and the requirement is to 
    retrieve movie comments with genres "Comedy" and "imdb.rating" greater than 5.

    query:

    << gist: https://gist.github.com/Deeptiman/4c7fb54ac9fa2717b335e4e118617fd0 >> 

    result:

    << gist: https://gist.github.com/Deeptiman/00dec26c13ccd412584b128619788f93 >>

    Example - 2:
    
    Retrieve comments sorted in ascending order. So, we need to apply the $lookup pipeline attribute to run 
    the sorting pipeline within the join collection.

    query:

    << gist: https://gist.github.com/Deeptiman/e125537ef6c328fb3b989cc39b59da57 >>

    results:

    << list of sorted comments for the movie >>

    << gist: https://gist.github.com/Deeptiman/51cee243d482a6864ff1cf609c138f21 >>

    - $graphLookup
        The graphLookup operator performs a recursive search on collections using graph techniques that 
    connects document fields as nodes and edges. The recursive search operation continues based on the 
    maximum recursion depth defined with the "maxDepth" parameter.

    $graphLookup query structure

    ---  
        {
            $graphLookup: {
                from: <collection to join>,
                startWith: <expression that begins the search>,
                connectFromField: <field name to map with `from` collection>,
                connectToField: <field name that map with aggregated collection>,
                as: <ouput object>,
                maxDepth: <defined level of depth to perform recursive search>
            }
        }

    Example:

    Find the comments given by the user based on the email address.

    list of users

    << gist: https://gist.github.com/Deeptiman/0b01e66497933cd4b33a3df9e7524263 >>

    query:

    << gist: https://gist.github.com/Deeptiman/413e735f274999841a75e0e666110c2a >>

    result:

    << gist: https://gist.github.com/Deeptiman/ec21b1f8c2d96f1d0a42953f6ea1641a >>


    - $geoNear
        MongoDB supports geolocation queries with $geoNear aggregator. There are plenty of use-cases on 
    geography coordinates to calculate nearby distances with the geo point. The GeoJSON coordinate pair 
    work well in the aggregation to point to the closet documents in the collection.

    Example :
        Find the nearby theaters within a certain radius on given lat/long coordinates.

    << list of theaters >>
    << gist: https://gist.github.com/Deeptiman/93bc3627b593498285d4246c099c665d >>
    
    query:

        << gist: https://gist.github.com/Deeptiman/edd2dddabbbefb6cd1c89c96d661e1ed >>

    result: 

        << gist: https://gist.github.com/Deeptiman/e0592da8a6242d1f6def22aa4a5a4e51 >>
    
    Information Security with Mongo Aggregation framework

    So far, we have seen many different mongo aggregation techniques to retrieve documents from the 
    collection used by the client application for analyses. But not every category of data needed to be 
    visible or readable to the 3rd party client. The sensitive field information access remains confidential 
    to the owner to secure the information security. Whenever a database engine is authorized to store 
    sensitive information is regulated by data privacy law. The organization goes through a series of 
    regulatory audits to present the third party risk management, cybersecurity infrastructure to reach the 
    compliance requirement to perform business transaction data in the software application. 

    - Read only collection using View
        MongoDB view works quite similarly to SQL views that query other collections using certain 
    aggregation pipelines to retrieve the filtered document that is only allowed to be visible for the client
    application.

    Example:
    There are few user data with sensitive personal information and while the client application tries to 
    read the user data the sensitive personal information must be restricted.

    << user personal info >>

    << gist: https://gist.github.com/Deeptiman/8aa96883621aff5b2fcb104999afa304 >>

    query:

    1. Create `user_data_list` view by querying from `personal_info` collection with removing `mobile` and 
    `passport_number`.

        << gist: https://gist.github.com/Deeptiman/7d7632aa742e08975ee7f74beb94aca8 >>

        result:

        << gist: https://gist.github.com/Deeptiman/0b5a1c060519da3e0831c33b5f553be2 >>
    
    2. Read the `user_data_list` view

        << gist: https://gist.github.com/Deeptiman/f993e51a6c38bc385c043779e5a522d9 >>

        result:

        << gist: https://gist.github.com/Deeptiman/befe3ad4eb8c97b21aa99f21ea888c9c >>

    So, from the `user_data_list` view result we can see `mobile` and `passport_number` details removed and 
    restricted for the client.
    
    The client application will have access to the created view 

    - Aggregation technique to mask sensitive fields
    MongoDB aggregatation operators provides the technique to mask sensitive field information that remains
    obfuscate in the client application. As a scenario, we will try to mask credit card information without
    exposing sensitive fields like:

        - Partilly obfuscate card holder name
        - Obfuscate first 12 digits card number and only retain last 4 digits
        - Randomize the expiry date-time by adding-substracting maximum 30 days
        - Replace the 3 digit CVC code with random set of 3 digits.
    
    << credit card info table >>
    << gist: https://gist.github.com/Deeptiman/f28ac8192ac3ad528bd2406d785afa13 >>

    query: 

    << gist: https://gist.github.com/Deeptiman/0c2b389f4661d0011d6b86810039df49 >>

    result: 

    << gist: https://gist.github.com/Deeptiman/645bd929da28e1fdb4363378be77dfdd >>


    Performance of MongoDB Aggregation Queries

    The performance measurement of the aggregation query is analyzed using the explain plan. MongoDB supports 
    three types of verbosity mode queryPlanner, executionStats, and allPlansExecution to explain a query 
    execution plan.  
    
    In the process of optimizing the aggregation query, the database engine reorders the pipeline stages. 
    The first pipeline stage $cursor operator picks the best possible winning plan to form the metadata to 
    make a judgment call on the type of document scan operation [IXSCAN, COLLSCAN, FETCH] to faster the 
    document retrieval process on the pipeline stage.

    Explain with queryPlanner

    db.getCollection("movies").explain("queryPlanner").aggregate(pipeline)

    "stages" : [
        {
            "$cursor" : {
                "queryPlanner" : {
                    "namespace" : "movie_details.movies", 
                    "indexFilterSet" : false, 
                    "parsedQuery" : {
                        "imdb.rating" : {
                            "$lt" : 7.5
                        }
                    }, 
                    "queryHash" : "148454AD", 
                    "planCacheKey" : "42DB0972", 
                    "maxIndexedOrSolutionsReached" : false, 
                    "maxIndexedAndSolutionsReached" : false, 
                    "maxScansToExplodeReached" : false, 
                    "winningPlan" : {
                        "stage" : "PROJECTION_DEFAULT", 
                        "transformBy" : {
                            "imdb.votes" : 1.0, 
                            "_id" : 0.0
                        }, 
                        "inputStage" : {
                            "stage" : "COLLSCAN", // scan entire collection to match with the record
                            "filter" : {
                                "imdb.rating" : {
                                    "$lt" : 7.5
                                }
                            }, 
                            "direction" : "forward"
                        }
                    }, 
                    "rejectedPlans" : [

                    ]
                }
            }
        }, 
        {
            "$group" : {
                "_id" : {
                    "$const" : "total_vote"
                }, 
                "vote_count" : {
                    "$sum" : "$imdb.votes"
                }
            }
        }
    ]

    Here the "winningPlan" stage metadata is scanned with the "COLLSCAN" operation, so the entire collection 
    requires iterating all rows to map/match the particular records. The COLLSCAN is pretty slow and very 
    much ineffective mapping with an enormously large data set. The queryPlanner also supports indexing 
    with IXSCAN that filters instantly with the particular record.

    Explain with executionStats
    
    The statistics of winning plan query execution contains in "executionStats" mode. 
    
    db.getCollection("movies").explain("executionStats").aggregate(pipeline)

    "stages" : [
            "executionStats" : {
                    "executionSuccess" : true, 
                    "nReturned" : 18511.0, 
                    "executionTimeMillis" : 42.0, 
                    "totalKeysExamined" : 0.0, 
                    "totalDocsExamined" : 23530.0, 
                    "executionStages" : {
                        "stage" : "PROJECTION_DEFAULT", 
                        "nReturned" : 18511.0, 
                        "executionTimeMillisEstimate" : 2.0, 
                        "works" : 23532.0, 
                        "advanced" : 18511.0, 
                        "needTime" : 5020.0, 
                        "needYield" : 0.0, 
                        "saveState" : 24.0, 
                        "restoreState" : 24.0, 
                        "isEOF" : 1.0, 
                        "transformBy" : {
                            "imdb.votes" : 1.0, 
                            "_id" : 0.0
                        }, 
                        "inputStage" : {
                            "stage" : "COLLSCAN", 
                            "filter" : {
                                "imdb.rating" : {
                                    "$lt" : 7.5
                                }
                            }, 
                            "nReturned" : 18511.0, 
                            "executionTimeMillisEstimate" : 1.0, 
                            "works" : 23532.0, 
                            "advanced" : 18511.0, 
                            "needTime" : 5020.0, 
                            "needYield" : 0.0, 
                            "saveState" : 24.0, 
                            "restoreState" : 24.0, 
                            "isEOF" : 1.0, 
                            "direction" : "forward", 
                            "docsExamined" : 23530.0
                        }
                    }
                }
            }, 
            "nReturned" : NumberLong(18511), 
            "executionTimeMillisEstimate" : NumberLong(29)
    ]


    nReturned: The total number of documents returned from the query, in this case, 18511 movies returned 
    to the client.

    executionTimeMillis: The query execution time in a millisecond helps to understand the performance 
    of the aggregation query.

    filter: The conditional query statement included in the aggregation pipeline. 
    
    "imdb.rating" : { "$lt" : 7.5 }

    docsExamined: The total document examined in the collection to match with a particular searching record


    Visualizing Aggregation flow with Studio 3T

    In Studio 3T, the transactional flow shows the total time to produce the result and the size of the 
    documents generated in each stage of the pipeline. The explain() query in Studio 3T is quite helpful 
    to visualize the process for an aggregation pipeline.
    
    << -- screenshot of aggregation flow -- >>
    
    Execute Aggregation pipeline in MongoDB Compass GUI
    
    The MongoDB compass GUI is pretty handy to execute aggregation queries. The debugging of the 
    aggregation pipeline helps practically test the pipeline before applying it into any of the 
    programmings languages.

    << -- screenshot of mongodb compass aggregation flow -- >>


    Conclusion
    MongoDB aggregation framework has a complex use case in data processing workload. The pipeline stages 
    are similar to functional programming languages that output the data from one pipeline stage to input 
    the data to the next pipeline stage. We have executed aggregation pipeline queries from this article 
    with many different scenarios to understand the practical technique to perform mass data manipulation. 
    The MongoDB aggregation framework looks different from regular procedural programming at the beginner 
    level. But after understanding the building structure of aggregation expression operators, Developers 
    will find the aggregation framework similar to Domain Specific Language (DSL).