-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnote.txt
630 lines (453 loc) · 24.1 KB
/
note.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
Deep dive into MongoDB Aggregation Framework
In this article, we will deep dive into the complex usage of the MongoDB aggregation framework to retrieve
documents from multiple collections using an aggregation pipeline query that will perform much faster than
the existing functional Map-reduce models.
Run Database query with one-go
The database transaction to retrieve a large number of records in a single query is a daunting task.
The millions of documents stored in the database have complex relationships with data nodes that require
seamless flow with the software application service.
The data node defines the business data processing workloads to perform an intensive task in real-time.
The business software applications (Stock Market, Disaster Management, a Cryptocurrency marketplace,
Census dashboard) require in-flight analytical data with a single database query to retrieve grouped data
and a faster business transaction timeline.
Techniques to handle large datasets
MongoDB provides data processing functions to analyze large datasets and produce the final aggregated results.
The Map Reduce and Aggregation pipeline are two ways of aggregating, grouping, filtering data nodes in
MongoDB with several functional stages.
- What is Map-Reduce?
Map Reduce is a javascript functional model to map documents with emit(key-value) pair-based functions
that pass value to the reduce function to perform the aggregation of the data node that returns the output
to the client.
-- Structure of Map-Reduce in MongoDB
MongoDB uses the "MapReduce" database command for a collection to apply the aggregation query.
The map-reduce function facilitates writing key-value pair objects in map function and computational
logic in reduce function to finalize the result.
For example: A "mapReduce" function to count number of imdb votes for a movie stored under a collection.
<<-- screenshot movie list -->>
The mongodb collection has "23,500" movie records.
db.movies.mapReduce(
// map
function(){
emit(1,this.imdb.votes);
},
// reduce
function(key,vals){
var count = 0;
vals.forEach(function(v) {
count +=v;
});
return count;
},
{
// query
query: {
"imdb.rating": {
"$lt": 7.5
}
},
// output
out: "vote_count"
}
).find();
output:
{
"_id" : 1.0,
"value" : 271026594.0
}
<< screenshot: map-reduce (imdb_vote_count).jpg >>
The map-reduce query took 280ms to count total_votes for the movies has less than 7.5 imdb rating.
Why Map-Reduce much slower in query performance?
1. Sequence of tasks during the process
The map-reduce function uses JavaScript that runs in a single-thread Spider Monkey engine. The map-reduce
operation involves many tasks such as reading the input collection, execution of map function, execution
of reduce function, writing to the temporary collection during the processing, and writing the output
collection.
2. map-reduce conncurrency & global write lock
The map-reduce-concurrency operation applies many global write lock during the processing of the queries
that makes the execution much slower to generate the final document.
-- global write lock
- The read lock applies to the input collection to fetch 100 documents.
- The write lock is applied to creating the output collection and insert, merge, replace, reduce documents into output collection.
- The write lock is global and block all the remaining operations on the mongod instances.
MongoDB 5.0 release deprecated the map-reduce from the database engine. The aggregation pipeline is
much faster in performance querying multiple collections with the complex data-node relationship.
However, MongoDB provides a few aggregation pipeline operators [$group, $merge, $accumulator, $function]
in the framework layer to facilitate writing Map-Reduce operations using the aggregation pipeline.
- What is Aggregation pipeline?
The Aggregation pipeline works as a data processing framework in MongoDB to aggregate hierarchical
data nodes. There will be several stages in the pipeline to process the documents to filter out the final
analyzed document.
The pipeline operators used to apply logical queries and accumulate the filtered records to transform
an output structure. The performance of the aggregation pipeline is significantly high compared to Map-Reduce
because the framework runs with compiled C++ codes.
The data flow control works as a sequence of stages that the output of the previous pipeline stage
becomes the input to the next pipeline stage. The pipeline operator decides the type of aggregation action
to perform in the database engine.
- How to write Aggregation pipeline?
The pipeline written as BSON format passes directly into the "aggregate()" function in the Mongo Shell for
executing the aggregation query. Also, MongoDB provides language-specific drivers that will have an in-built
Aggregation framework with structure API functions to implement the Aggregation pipeline in the application.
The following "aggregate" function counts the total votes using the aggregation pipeline.
Stage - 1
The "$match" operator applies a conditional query to filter the documents with less than 7.5 "imdb.rating".
Stage - 2
Then the second stage takes the filtered document as input to accumulate the $sum of "imdb.votes" and groups
the vote_count using the "$group" operator.
Query:
db.movies.aggregate(
// Pipeline
[
// Stage 1
{
$match: {
// conditional query
"imdb.rating": {
"$lt": 7.5
}
}
},
// Stage 2
{
$group: {
_id: "total_vote",
vote_count: {
"$sum": "$imdb.votes"
}
}
},
]
);
Output:
{
"_id" : "total_vote",
"vote_count" : NumberInt(271026594)
}
<< screenshot: mongo-aggregation-pipelin(imdb_vote_count).png >>
Complex Aggregation Pipelines
The MongoDB aggregation framework has many operators that support to aggregate in the pipeline stages.
The operators are treated as MongoDB query commands to perform query actions and then either process
the pipeline into the next available pipeline stage or collect the filtered document at the final
pipeline stage. The pipeline execution is an algorithmic process of a finite sequence of blocking
structures that perform several aggregation functions.
- sorting, merging documents from more than one collection.
- Arithmetic computation of numeric data nodes.
- Accumulate a data node to apply Mongo aggregation expression ($avg, $sum, $addField, $concat).
- Apply GeoSpatial command to find the nearby address or calculate min/max distance.
- and many more
Type of Aggregation Pipeline Operators
- $project
The $project operator facilitates to include or exclude certain fields from the document in a single
stage.
Example:
In the following movie record only specific fields 'title', 'plot', 'year', 'rating', 'rated', 'language',
'year', 'poster', 'director', 'imdb', 'awards', 'genres' are required to be display from the collection.
<< gist: https://gist.github.com/Deeptiman/7878b1551b907c1fa7b58b7d911340c6 >>
$project aggregation query
<< gist: https://gist.github.com/Deeptiman/367e8ab17df387a2038521e6e9c01dbc >>
result
<< gist: https://gist.github.com/Deeptiman/9e2771ff97595f7f1c67d15b4fcfcbf4 >>
Problems with $project stage
The project operator is quite verbose and lengthy as the fields are defined explicitly in the projection.
We have seen in the example that the project stage is very derivative to include field items.
The projection will be much more tedious to include fields in a large set of documents. The project stage
refactor will not be optimal to define the new field in the result.
Optimized solution with $set and $unset
The $set and $unset provides efficient way to write inclusion and exclusion of fields in the pipeline.
The specific field items are only required to add, modify, remove in the stage.
query:
<< gist: https://gist.github.com/Deeptiman/fd14f26419118356d6434e9996f10216 >>
result:
<< gist: https://gist.github.com/Deeptiman/cda4dca212a0023262355daabaf91bbd >>
As we can see the same projection query is much more clear and flexible to write using $set and $unset
stage.
Best scenario for $project over $set/$unset
The "$project" stage best use case to produce completely different output document than the defined
internal document. The same movie document requires to display a new shape of document with a few lines
of projection query.
output document:
<< gist: https://gist.github.com/Deeptiman/b23a0a2bb2235087fab4cf66039bdce4 >>
Bad approach with $set/$unset
<< gist: https://gist.github.com/Deeptiman/a8698444a7a79920721e18f9fca4bb8d >>
Good approach with $project
<< gist: https://gist.github.com/Deeptiman/36b2dcd3363cad48948e18d4202c78d9 >>
- $group
The distinct grouping of documents followed by _id is supported by the $group operator. The output field
document holds the accumulated value of an object.
Example-1
- Find the total number of votes for all the movies using the $sum accumulator expression.
list of movies:
<< gist: https://gist.github.com/Deeptiman/3ef23d40e48165b3f9e196d13c5cd46e >>
query:
<< gist: https://gist.github.com/Deeptiman/ac615c6e67715bf673f14587ca9f5a18 >>
result:
<< gist: https://gist.github.com/Deeptiman/1b0fd6225e6a33e34e00785648014bb5 >>
Example-2
- Find the recent movie document using "$last" accumulator expression with the genre "War".
list of movies:
<< gist: https://gist.github.com/Deeptiman/3ef23d40e48165b3f9e196d13c5cd46e >>
query:
<< gist: https://gist.github.com/Deeptiman/ea9ebf524e3df3e338365be22759876a >>
result:
<< output of the latest movie doc >>
<< gist: https://gist.github.com/Deeptiman/55c01af050d3f482a950aafad26033f9 >>
- $lookup
MongoDB stores the documents in multiple collections as a best practice to distribute the records to
reduce the storage volume for mongo collections. The $lookup operator searches for data nodes among
mongo collections that maintain relationships to aggregate the output record.
$lookup query structure
-- Structure 1
{
$lookup:
{
from: <collection to join>,
localField: <field from the input documents>,
foreignField: <field from the documents of the "from" collection>,
as: <output array field>
}
}
-- Structure 2
{
$lookup:
{
from: <collection to join>,
let: { <var_1>: <expression>, …, <var_n>: <expression> },
pipeline: [ <pipeline to run on joined collection> ],
as: <output array field>
}
}
list of movies:
<< gist: https://gist.github.com/Deeptiman/3ef23d40e48165b3f9e196d13c5cd46e >>
list of comments for the movie
<< gist: https://gist.github.com/Deeptiman/d3e9afff4a6901d17a6bed5e82b7be6c >>
Example - 1:
The list of comments for the movies is stored in a separate collection and the requirement is to
retrieve movie comments with genres "Comedy" and "imdb.rating" greater than 5.
query:
<< gist: https://gist.github.com/Deeptiman/4c7fb54ac9fa2717b335e4e118617fd0 >>
result:
<< gist: https://gist.github.com/Deeptiman/00dec26c13ccd412584b128619788f93 >>
Example - 2:
Retrieve comments sorted in ascending order. So, we need to apply the $lookup pipeline attribute to run
the sorting pipeline within the join collection.
query:
<< gist: https://gist.github.com/Deeptiman/e125537ef6c328fb3b989cc39b59da57 >>
results:
<< list of sorted comments for the movie >>
<< gist: https://gist.github.com/Deeptiman/51cee243d482a6864ff1cf609c138f21 >>
- $graphLookup
The graphLookup operator performs a recursive search on collections using graph techniques that
connects document fields as nodes and edges. The recursive search operation continues based on the
maximum recursion depth defined with the "maxDepth" parameter.
$graphLookup query structure
---
{
$graphLookup: {
from: <collection to join>,
startWith: <expression that begins the search>,
connectFromField: <field name to map with `from` collection>,
connectToField: <field name that map with aggregated collection>,
as: <ouput object>,
maxDepth: <defined level of depth to perform recursive search>
}
}
Example:
Find the comments given by the user based on the email address.
list of users
<< gist: https://gist.github.com/Deeptiman/0b01e66497933cd4b33a3df9e7524263 >>
query:
<< gist: https://gist.github.com/Deeptiman/413e735f274999841a75e0e666110c2a >>
result:
<< gist: https://gist.github.com/Deeptiman/ec21b1f8c2d96f1d0a42953f6ea1641a >>
- $geoNear
MongoDB supports geolocation queries with $geoNear aggregator. There are plenty of use-cases on
geography coordinates to calculate nearby distances with the geo point. The GeoJSON coordinate pair
work well in the aggregation to point to the closet documents in the collection.
Example :
Find the nearby theaters within a certain radius on given lat/long coordinates.
<< list of theaters >>
<< gist: https://gist.github.com/Deeptiman/93bc3627b593498285d4246c099c665d >>
query:
<< gist: https://gist.github.com/Deeptiman/edd2dddabbbefb6cd1c89c96d661e1ed >>
result:
<< gist: https://gist.github.com/Deeptiman/e0592da8a6242d1f6def22aa4a5a4e51 >>
Information Security with Mongo Aggregation framework
So far, we have seen many different mongo aggregation techniques to retrieve documents from the
collection used by the client application for analyses. But not every category of data needed to be
visible or readable to the 3rd party client. The sensitive field information access remains confidential
to the owner to secure the information security. Whenever a database engine is authorized to store
sensitive information is regulated by data privacy law. The organization goes through a series of
regulatory audits to present the third party risk management, cybersecurity infrastructure to reach the
compliance requirement to perform business transaction data in the software application.
- Read only collection using View
MongoDB view works quite similarly to SQL views that query other collections using certain
aggregation pipelines to retrieve the filtered document that is only allowed to be visible for the client
application.
Example:
There are few user data with sensitive personal information and while the client application tries to
read the user data the sensitive personal information must be restricted.
<< user personal info >>
<< gist: https://gist.github.com/Deeptiman/8aa96883621aff5b2fcb104999afa304 >>
query:
1. Create `user_data_list` view by querying from `personal_info` collection with removing `mobile` and
`passport_number`.
<< gist: https://gist.github.com/Deeptiman/7d7632aa742e08975ee7f74beb94aca8 >>
result:
<< gist: https://gist.github.com/Deeptiman/0b5a1c060519da3e0831c33b5f553be2 >>
2. Read the `user_data_list` view
<< gist: https://gist.github.com/Deeptiman/f993e51a6c38bc385c043779e5a522d9 >>
result:
<< gist: https://gist.github.com/Deeptiman/befe3ad4eb8c97b21aa99f21ea888c9c >>
So, from the `user_data_list` view result we can see `mobile` and `passport_number` details removed and
restricted for the client.
The client application will have access to the created view
- Aggregation technique to mask sensitive fields
MongoDB aggregatation operators provides the technique to mask sensitive field information that remains
obfuscate in the client application. As a scenario, we will try to mask credit card information without
exposing sensitive fields like:
- Partilly obfuscate card holder name
- Obfuscate first 12 digits card number and only retain last 4 digits
- Randomize the expiry date-time by adding-substracting maximum 30 days
- Replace the 3 digit CVC code with random set of 3 digits.
<< credit card info table >>
<< gist: https://gist.github.com/Deeptiman/f28ac8192ac3ad528bd2406d785afa13 >>
query:
<< gist: https://gist.github.com/Deeptiman/0c2b389f4661d0011d6b86810039df49 >>
result:
<< gist: https://gist.github.com/Deeptiman/645bd929da28e1fdb4363378be77dfdd >>
Performance of MongoDB Aggregation Queries
The performance measurement of the aggregation query is analyzed using the explain plan. MongoDB supports
three types of verbosity mode queryPlanner, executionStats, and allPlansExecution to explain a query
execution plan.
In the process of optimizing the aggregation query, the database engine reorders the pipeline stages.
The first pipeline stage $cursor operator picks the best possible winning plan to form the metadata to
make a judgment call on the type of document scan operation [IXSCAN, COLLSCAN, FETCH] to faster the
document retrieval process on the pipeline stage.
Explain with queryPlanner
db.getCollection("movies").explain("queryPlanner").aggregate(pipeline)
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"namespace" : "movie_details.movies",
"indexFilterSet" : false,
"parsedQuery" : {
"imdb.rating" : {
"$lt" : 7.5
}
},
"queryHash" : "148454AD",
"planCacheKey" : "42DB0972",
"maxIndexedOrSolutionsReached" : false,
"maxIndexedAndSolutionsReached" : false,
"maxScansToExplodeReached" : false,
"winningPlan" : {
"stage" : "PROJECTION_DEFAULT",
"transformBy" : {
"imdb.votes" : 1.0,
"_id" : 0.0
},
"inputStage" : {
"stage" : "COLLSCAN", // scan entire collection to match with the record
"filter" : {
"imdb.rating" : {
"$lt" : 7.5
}
},
"direction" : "forward"
}
},
"rejectedPlans" : [
]
}
}
},
{
"$group" : {
"_id" : {
"$const" : "total_vote"
},
"vote_count" : {
"$sum" : "$imdb.votes"
}
}
}
]
Here the "winningPlan" stage metadata is scanned with the "COLLSCAN" operation, so the entire collection
requires iterating all rows to map/match the particular records. The COLLSCAN is pretty slow and very
much ineffective mapping with an enormously large data set. The queryPlanner also supports indexing
with IXSCAN that filters instantly with the particular record.
Explain with executionStats
The statistics of winning plan query execution contains in "executionStats" mode.
db.getCollection("movies").explain("executionStats").aggregate(pipeline)
"stages" : [
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 18511.0,
"executionTimeMillis" : 42.0,
"totalKeysExamined" : 0.0,
"totalDocsExamined" : 23530.0,
"executionStages" : {
"stage" : "PROJECTION_DEFAULT",
"nReturned" : 18511.0,
"executionTimeMillisEstimate" : 2.0,
"works" : 23532.0,
"advanced" : 18511.0,
"needTime" : 5020.0,
"needYield" : 0.0,
"saveState" : 24.0,
"restoreState" : 24.0,
"isEOF" : 1.0,
"transformBy" : {
"imdb.votes" : 1.0,
"_id" : 0.0
},
"inputStage" : {
"stage" : "COLLSCAN",
"filter" : {
"imdb.rating" : {
"$lt" : 7.5
}
},
"nReturned" : 18511.0,
"executionTimeMillisEstimate" : 1.0,
"works" : 23532.0,
"advanced" : 18511.0,
"needTime" : 5020.0,
"needYield" : 0.0,
"saveState" : 24.0,
"restoreState" : 24.0,
"isEOF" : 1.0,
"direction" : "forward",
"docsExamined" : 23530.0
}
}
}
},
"nReturned" : NumberLong(18511),
"executionTimeMillisEstimate" : NumberLong(29)
]
nReturned: The total number of documents returned from the query, in this case, 18511 movies returned
to the client.
executionTimeMillis: The query execution time in a millisecond helps to understand the performance
of the aggregation query.
filter: The conditional query statement included in the aggregation pipeline.
"imdb.rating" : { "$lt" : 7.5 }
docsExamined: The total document examined in the collection to match with a particular searching record
Visualizing Aggregation flow with Studio 3T
In Studio 3T, the transactional flow shows the total time to produce the result and the size of the
documents generated in each stage of the pipeline. The explain() query in Studio 3T is quite helpful
to visualize the process for an aggregation pipeline.
<< -- screenshot of aggregation flow -- >>
Execute Aggregation pipeline in MongoDB Compass GUI
The MongoDB compass GUI is pretty handy to execute aggregation queries. The debugging of the
aggregation pipeline helps practically test the pipeline before applying it into any of the
programmings languages.
<< -- screenshot of mongodb compass aggregation flow -- >>
Conclusion
MongoDB aggregation framework has a complex use case in data processing workload. The pipeline stages
are similar to functional programming languages that output the data from one pipeline stage to input
the data to the next pipeline stage. We have executed aggregation pipeline queries from this article
with many different scenarios to understand the practical technique to perform mass data manipulation.
The MongoDB aggregation framework looks different from regular procedural programming at the beginner
level. But after understanding the building structure of aggregation expression operators, Developers
will find the aggregation framework similar to Domain Specific Language (DSL).