documents in mongodb data collection follows format,
[ { "_id": xxxxxxxxx, "crime_type": "illegal_trade", "crime_year": "2013", "location": "kurunegala" }, { "_id": xxxxxxxxx, "crime_type": "illegal_trade", "crime_year": "2013", "location": "colombo" }, { "_id": xxxxxxxxx, "crime_type": "illegal_trade", "crime_year": "2014", "location": "kandy" }, { "_id": xxxxxxxxx, "crime_type": "murder", "crime_year": "2013", "location": "kadawatha" } ] when run aggregate operation,
db.collection.aggregate( [ { $group : { _id : {type: "$crime_type", year: "$crime_year"}, count: { $sum: 1 } } } ] ) the result contains items have count > 0
for example results _id : {type: "murder", year: "2014"} in count = 0 won't include in results.
my question is, how should alter query count = 0 items in results?
in other words how this mongodb...?
basically asking results not present in data, key combination not exist return count of 0. in truth, no database system "truly" this, there ways make happening. means understanding happening make so.
it true sql approach make sub-query of expected keys distinct values , "join" existing dataset in order create "false positives" grouping accumulation. general method there, of course there basic "joining" concept not supported mongodb reasons of scalability. whole different argument there, accept mongodb not joins on it's own server architecture, , never will.
as such, task of creating "false set" when working mongodb relegated client side ( , consider in terms of "client" separate process database server ) operation. both "result set" , "blank set" , "merge" results.
different language approaches vary, here efficient listing node.js:
var async = require('async'), mongo = require('mongodb'), mongoclient = mongo.mongoclient, datastore = require('nedb'), combined = new datastore(); var data = [ { "crime_type": "illegal_trade", "crime_year": "2013", "location": "kurunegala" }, { "crime_type": "illegal_trade", "crime_year": "2013", "location": "colombo" }, { "crime_type": "illegal_trade", "crime_year": "2014", "location": "kandy" }, { "crime_type": "murder", "crime_year": "2013", "location": "kadawatha" } ]; mongoclient.connect('mongodb://localhost/test',function(err,db) { if (err) throw err; db.collection('mytest',function(err,collection) { if (err) throw err; async.series( [ // clear collection function(callback) { console.log("dropping..\n"); collection.remove({},callback); }, // insert data function(callback) { console.log("inserting..\n"); collection.insert(data,callback); }, // run parallel merge function(callback) { console.log("merging..\n"); async.parallel( [ // blank distincts function(callback) { collection.distinct("crime_year",function(err,years) { if (err) callback(err); async.each( years, function(year,callback) { collection.distinct("crime_type",function(err,types) { if (err) callback(err); async.each( types, function(type,callback) { combined.update( { "type": type, "year": year }, { "$inc": { "count": 0 } }, { "upsert": true }, callback ); },callback); }); },callback); }); }, // result distincts function(callback) { collection.aggregate( [ { "$group": { "_id": { "type": "$crime_type", "year": "$crime_year" }, "count": { "$sum": 1 } }} ], function(err,results) { async.each( results, function(result, callback) { combined.update( { "type": result._id.type, "year": result._id.year }, { "$inc": { "count": result.count } }, { "upsert": true }, callback ); },callback); } ); } ], function(err) { callback(err); } ) }, // retrieve result function(callback) { console.log("fetching:\n"); combined.find({},{ "_id": 0 }).sort( { "year": 1, "type": 1 }).exec(function(err,results) { if (err) callback(err); console.log( json.stringify( results, undefined, 4 ) ); callback(); }); } ], function(err) { if (err) throw err; db.close(); } ) }); }); and return result not "combines" results grouped keys, contains 0 entry "murder" in year "2014":
[ { "type": "illegal_trade", "year": "2013", "count": 2 }, { "type": "murder", "year": "2013", "count": 1 }, { "type": "illegal_trade", "year": "2014", "count": 1 }, { "type": "murder", "year": "2014", "count": 0 } ] so consider meat of operations here, within "parallel" section of code under "merging", efficient way node issue of queries (and potentially quite few) @ same time.
the first part in order "blank" results no count double loop operation, point distinct values each of "year" , "type". whether use .distinct() method shown here or or using .aggregate() method "cursor" output , iteration matter of how data have or like. small set generation .distinct() fine results in memory. want create "blank" or 0 count entries each possible pairing, or more importantly including "non existent" pairing in dataset.
secondly, , in parallel possible, aggregation result run standard results. of course these results not return count "murder" in "2014" because there none. comes down merging results.
the "merge" working "hash/map/dict" ( whatever term ), of combined keys "year" , "type". use structure, adding key not exist or incrementing "count" value on key does. that's age old operation, , basis of aggregation techniques.
the neat little thing being done here ( not need use ), use of nedb, nice little module allows use of mongodb "like" operations on in-memory or other self contained data files. think of sqlite sql rdbms operations. little lighter on complete functionality.
part of point here "hash merge" functions regular mongodb "upsert" operations code. in fact, same code applies if have large result needs end in "result collection" on server instead.
the overall point "join" operation or otherwise "fill in blanks" operation depending on overall size , expectancy of "keys" in operation. mongodb server not going this, there nothing stopping writing own "data layer" middle tier between end application , database. such distributed server model can scaled out service level performs these sorts of "joining" operations.
all of queries used data merge can run in parallel under right coding environment, while may not seem straightforward sql approach doing this, can still effective , efficient @ processing results.
the approach different, again part of philosophy here. mongodb relagates "joining" activities different parts of application architecture in order keep it's own server specific operations more efficient, , regards sharded clusters. "joining" or "hash merge" "code" function can handled other infrastructure database server.
Comments
Post a Comment