An introduction to mongo aggregations
This post describes a simple introduction to mongodb and how to get started with building complex queries. I suggest to read my previous post on NoSQL Databases in case you haven’t worked with them before so that you know what they are and what they are not.
Before we start I recommend using Robomongo as a graphical mongo client but if you prefer CLI let’s explain some basic commands:
> show databases # Or "show dbs" to list avaiable databases
> use <db name> # Connect to a particular database
> show collections # Print a list of collections in the database
In mongo, a database doesn’t contain tables but collections that contain a
bunch of documents that could have something in common, each of this documents
has its auto generated _id
by which mongo indexes them in the database.
In this post I’m using a very simple database that consist of a sales
collection and no more, if you want to follow along you can download this data
set in CSV format here. To load the file in a mongo collection we use
mongoimport
:
$ mongoimport -d <db name> -c <collection name> --type csv --file <path to csv file> --headerline
With this we say to mongoimport
to load a CSV file (--type csv
) and use
the first row as attribute names rather than treat it as another record
(--headerline
).
We’re ready to build queries!, to retrieve all documents in a collection named
sales
we do it like this:
db.sales.find({})
Note that we pass an empty JSON object to the find()
function indicating
that we intend to match everything in the result set, however this is
optional, if you don’t pass anything you’ll get the same result but better be
explicit about what we want.
Inspecting a single document in the result we see a structure like this:
{
"_id" : ObjectId("59b0bc033989da18984ddecc"),
"InvoiceNo" : 536365,
"StockCode" : "85123A",
"Description" : "WHITE T-LIGHT HOLDER",
"Quantity" : 6,
"InvoiceDate" : ISODate("2010-12-01T08:26:00.000Z"),
"UnitPrice" : 2.55,
"CustomerID" : 17850.0,
"Country" : "United Kingdom"
}
All the documents in the sample sales
collection have the same attributes
but in a document oriented database this is not a requirement, a document
could have more or less attributes set.
Now, what ever object we pass to the find()
function is going to be
interpreted as a WHERE
clause in a SQL DBMS. Lets suppose we what to find
all sales made to the customer 17850
:
db.sales.find({'CustomerID': 17850})
I expect this query to be clear, we pass a valid JSON object with the
structure { attribute: value }
, note that the semicolons aren’t necessary
unless the attribute name has spaces in between but I always use them and I’ll
explain why latter on.
To query for all the sales for the same customer with a Quantity
greater
than 10
units we do:
db.sales.find({
'CustomerID': 17850,
'Quantity': { '$gt': 10 }
})
Now this looks a little bit more ugly, in this case we use the $gt
selector
to filter the result set where the Quantity
is greater than (therefore gt
)
10
, note here that the attribute doesn’t receive a value but another object
{ $selector: value }
Feel free to consult the documentation on selectors, some of them come handy to make comparisons, logical operations, check existence or type of attributes, match by regular expressions and many more.
find()
is just wrapper to build simple queries but as soon as you need to
group or match records by a more complex set of rules we need to use
aggregate()
.
Aggregating works as a pipe line, the concept is simple, first you match
by some attributes (the same way you did it with find) and then you group
them by some set of rules. For us all what this means is that aggregate()
receives as parameter a list
where each element is an object
performing an
operation.
db.sales.aggregate(
[
{ ... }, // 1st operation (usually $match)
{ ... }, // 2nd operation (usually $group)
{ ... }, // 3rd operation (usually $sort)
{ ... }, // more operations to follow
.
.
.
]
)
Let’s see an example, we want to find out who are the top three countries with
more sales, in this case we don’t want to match by any { attribute: value }
in particular, so we skip that, but for example if you wanted to take into
account only those countries where a certain product was sold you can do that
with match { '$match': { 'StockCode': 22379 } }
and this object would be the
first in the list passed to aggregate()
.
Instead we group directly all of the records by country { '$group': { '_id': '$Country' } }
:
db.sales.aggregate([
{ '$group': { '_id': '$Country' } }
])
$group
takes an object as a value, we use this object to define the
representation of the grouped elements, so we’re saying “Group by $Country,
and each of the resulted groups will consist of just that, the _id
attribute
set to the $Country”.
Now, having only the countries in groups doesn’t help much, we need to have
the sales amount in each group as well, so for that we say “It will also have
a sales
attribute whose value is the sum of multiplying the quantity by the
product’s price across all elements of the group”.
I hope to haven’t lost you there, let’s see, to find who are the top seller countries we first need to know the sales amount they’ve generated, and for that within each country (group) we need to first compute within each invoice (element in the group) the total amount in that sale.
$multiply
is an aggregation operation so we can use it while grouping, in
this case we need to multiply in each of the invoices UnitPrice
by the
Quantity
:
{ '$multiply': ['$UnitPrice', '$Quantity'] }
Now we combine all of the elements by summing the results (the aggregation) so
we use $sum
.
{
'$sum': {
'$multiply': ['$UnitPrice', '$Quantity']
}
}
To whatever results of this, we want to store it in a sales
attribute within
each group i.e. within each country. So we now have this:
db.sales.aggregate([
{
'$group': {
'_id': '$Country',
'sales': {
'$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
}
}
}
])
And that folks is just our first operation in the list passed to
aggregate()
, our next operation needs to be a sort, we would like to see the
countries with more `sales` on top:
{ '$sort': { 'sales': -1 } }
The -1
indicates that we want to sort by decreasing order:
db.sales.aggregate([
{
'$group': {
'_id': '$Country',
'sales': {
'$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
}
}
},
{ '$sort': { 'sales': -1 } }
])
Finally we only want to gather the top three, so we need to $limit
the
result set to 3
:
{ '$limit': 3 }
Our pipeline consists of three operations:
db.sales.aggregate([
// 1st operation ($group)
{
'$group': {
'_id': '$Country',
'sales': {
'$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
}
}
},
// 2nd operation ($sort)
{ '$sort': { 'sales': -1 } },
// 3rd operation ($limit)
{ '$limit': 3 }
])
And this is the result:
[
{
"_id" : "United Kingdom",
"sales" : 8187806.364
},
{
"_id" : "Netherlands",
"sales" : 284661.54
},
{
"_id" : "EIRE",
"sales" : 263276.82
}
]
Using Pymongo
Now, the advantage to always wrap attributes and values in semicolons is that
if you also interact with a Mongo database from a python script you can use
pymongo
and use the exact same queries there!
pip install pymongo
To connect to a a mongo instance you create a MongoClient
instance passing
the host
and port
to get a connection, if you don’t pass these parameters
it will use the defaults localhost:27017
:
from pymongo import MongoClient
conn = MongoClient('localhost', 27017)
db = conn.my_database
It’s very easy to get a reference to the database, you just access it as an
attribute in the connection object. Once you have the db
object you can
pretty much use it the same way as you do in the mongo
client.
This query, finds out the top 5 most sold products in USA and it works the same way in mongo as it does in pymongo:
db.sales.aggregate([
{ '$match': { 'Country': 'USA' } },
{
'$group': {
'_id': '$StockCode',
'price': { '$first': '$UnitPrice' },
'quantity': { '$first': '$Quantity' },
'description': { '$first': '$Description' }
},
},
{'$sort': {'quantity': -1} },
{'$limit': 5}
])