An introduction to mongo aggregations
This post describes a simple introduction to mongodb and how to get started with building complex queries. I suggest to read my previous post on NoSQL Databases in case you haven’t worked with them before so that you know what they are and what they are not.
Before we start I recommend using Robomongo as a graphical mongo client but if you prefer CLI let’s explain some basic commands:
> show databases # Or "show dbs" to list avaiable databases
> use <db name> # Connect to a particular database
> show collections # Print a list of collections in the databaseIn mongo, a database doesn’t contain tables but collections that contain a
bunch of documents that could have something in common, each of this documents
has its auto generated _id by which mongo indexes them in the database.
In this post I’m using a very simple database that consist of a sales
collection and no more, if you want to follow along you can download this data
set in CSV format here. To load the file in a mongo collection we use
mongoimport:
$ mongoimport -d <db name> -c <collection name> --type csv --file <path to csv file> --headerlineWith this we say to mongoimport to load a CSV file (--type csv) and use
the first row as attribute names rather than treat it as another record
(--headerline).
We’re ready to build queries!, to retrieve all documents in a collection named
sales we do it like this:
db.sales.find({})Note that we pass an empty JSON object to the find() function indicating
that we intend to match everything in the result set, however this is
optional, if you don’t pass anything you’ll get the same result but better be
explicit about what we want.
Inspecting a single document in the result we see a structure like this:
{
"_id" : ObjectId("59b0bc033989da18984ddecc"),
"InvoiceNo" : 536365,
"StockCode" : "85123A",
"Description" : "WHITE T-LIGHT HOLDER",
"Quantity" : 6,
"InvoiceDate" : ISODate("2010-12-01T08:26:00.000Z"),
"UnitPrice" : 2.55,
"CustomerID" : 17850.0,
"Country" : "United Kingdom"
}All the documents in the sample sales collection have the same attributes
but in a document oriented database this is not a requirement, a document
could have more or less attributes set.
Now, what ever object we pass to the find() function is going to be
interpreted as a WHERE clause in a SQL DBMS. Lets suppose we what to find
all sales made to the customer 17850:
db.sales.find({'CustomerID': 17850})I expect this query to be clear, we pass a valid JSON object with the
structure { attribute: value }, note that the semicolons aren’t necessary
unless the attribute name has spaces in between but I always use them and I’ll
explain why latter on.
To query for all the sales for the same customer with a Quantity greater
than 10 units we do:
db.sales.find({
'CustomerID': 17850,
'Quantity': { '$gt': 10 }
})Now this looks a little bit more ugly, in this case we use the $gt selector
to filter the result set where the Quantity is greater than (therefore gt)
10, note here that the attribute doesn’t receive a value but another object
{ $selector: value }
Feel free to consult the documentation on selectors, some of them come handy to make comparisons, logical operations, check existence or type of attributes, match by regular expressions and many more.
find() is just wrapper to build simple queries but as soon as you need to
group or match records by a more complex set of rules we need to use
aggregate().
Aggregating works as a pipe line, the concept is simple, first you match
by some attributes (the same way you did it with find) and then you group
them by some set of rules. For us all what this means is that aggregate()
receives as parameter a list where each element is an object performing an
operation.
db.sales.aggregate(
[
{ ... }, // 1st operation (usually $match)
{ ... }, // 2nd operation (usually $group)
{ ... }, // 3rd operation (usually $sort)
{ ... }, // more operations to follow
.
.
.
]
)Let’s see an example, we want to find out who are the top three countries with
more sales, in this case we don’t want to match by any { attribute: value }
in particular, so we skip that, but for example if you wanted to take into
account only those countries where a certain product was sold you can do that
with match { '$match': { 'StockCode': 22379 } } and this object would be the
first in the list passed to aggregate().
Instead we group directly all of the records by country { '$group': { '_id': '$Country' } }:
db.sales.aggregate([
{ '$group': { '_id': '$Country' } }
])$group takes an object as a value, we use this object to define the
representation of the grouped elements, so we’re saying “Group by $Country,
and each of the resulted groups will consist of just that, the _id attribute
set to the $Country”.
Now, having only the countries in groups doesn’t help much, we need to have
the sales amount in each group as well, so for that we say “It will also have
a sales attribute whose value is the sum of multiplying the quantity by the
product’s price across all elements of the group”.
I hope to haven’t lost you there, let’s see, to find who are the top seller countries we first need to know the sales amount they’ve generated, and for that within each country (group) we need to first compute within each invoice (element in the group) the total amount in that sale.
$multiply is an aggregation operation so we can use it while grouping, in
this case we need to multiply in each of the invoices UnitPrice by the
Quantity:
{ '$multiply': ['$UnitPrice', '$Quantity'] }Now we combine all of the elements by summing the results (the aggregation) so
we use $sum.
{
'$sum': {
'$multiply': ['$UnitPrice', '$Quantity']
}
}To whatever results of this, we want to store it in a sales attribute within
each group i.e. within each country. So we now have this:
db.sales.aggregate([
{
'$group': {
'_id': '$Country',
'sales': {
'$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
}
}
}
])And that folks is just our first operation in the list passed to
aggregate(), our next operation needs to be a sort, we would like to see the
countries with more `sales` on top:
{ '$sort': { 'sales': -1 } }The -1 indicates that we want to sort by decreasing order:
db.sales.aggregate([
{
'$group': {
'_id': '$Country',
'sales': {
'$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
}
}
},
{ '$sort': { 'sales': -1 } }
])Finally we only want to gather the top three, so we need to $limit the
result set to 3:
{ '$limit': 3 }Our pipeline consists of three operations:
db.sales.aggregate([
// 1st operation ($group)
{
'$group': {
'_id': '$Country',
'sales': {
'$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
}
}
},
// 2nd operation ($sort)
{ '$sort': { 'sales': -1 } },
// 3rd operation ($limit)
{ '$limit': 3 }
])
And this is the result:
[
{
"_id" : "United Kingdom",
"sales" : 8187806.364
},
{
"_id" : "Netherlands",
"sales" : 284661.54
},
{
"_id" : "EIRE",
"sales" : 263276.82
}
]Using Pymongo
Now, the advantage to always wrap attributes and values in semicolons is that
if you also interact with a Mongo database from a python script you can use
pymongo and use the exact same queries there!
pip install pymongoTo connect to a a mongo instance you create a MongoClient instance passing
the host and port to get a connection, if you don’t pass these parameters
it will use the defaults localhost:27017:
from pymongo import MongoClient
conn = MongoClient('localhost', 27017)
db = conn.my_databaseIt’s very easy to get a reference to the database, you just access it as an
attribute in the connection object. Once you have the db object you can
pretty much use it the same way as you do in the mongo client.
This query, finds out the top 5 most sold products in USA and it works the same way in mongo as it does in pymongo:
db.sales.aggregate([
{ '$match': { 'Country': 'USA' } },
{
'$group': {
'_id': '$StockCode',
'price': { '$first': '$UnitPrice' },
'quantity': { '$first': '$Quantity' },
'description': { '$first': '$Description' }
},
},
{'$sort': {'quantity': -1} },
{'$limit': 5}
])