September 16, 2017

1415 words 7 mins read

An introduction to mongo aggregations

This post describes a simple introduction to mongodb and how to get started with building complex queries. I suggest to read my previous post on NoSQL Databases in case you haven’t worked with them before so that you know what they are and what they are not.

Before we start I recommend using Robomongo as a graphical mongo client but if you prefer CLI let’s explain some basic commands:

> show databases    # Or "show dbs" to list avaiable databases
> use <db name>     # Connect to a particular database
> show collections  # Print a list of collections in the database

In mongo, a database doesn’t contain tables but collections that contain a bunch of documents that could have something in common, each of this documents has its auto generated _id by which mongo indexes them in the database.

In this post I’m using a very simple database that consist of a sales collection and no more, if you want to follow along you can download this data set in CSV format here. To load the file in a mongo collection we use mongoimport:

$ mongoimport -d <db name> -c <collection name> --type csv --file <path to csv file> --headerline

With this we say to mongoimport to load a CSV file (--type csv) and use the first row as attribute names rather than treat it as another record (--headerline).

We’re ready to build queries!, to retrieve all documents in a collection named sales we do it like this:

db.sales.find({})

Note that we pass an empty JSON object to the find() function indicating that we intend to match everything in the result set, however this is optional, if you don’t pass anything you’ll get the same result but better be explicit about what we want.

Inspecting a single document in the result we see a structure like this:

{
  "_id" : ObjectId("59b0bc033989da18984ddecc"),
  "InvoiceNo" : 536365,
  "StockCode" : "85123A",
  "Description" : "WHITE T-LIGHT HOLDER",
  "Quantity" : 6,
  "InvoiceDate" : ISODate("2010-12-01T08:26:00.000Z"),
  "UnitPrice" : 2.55,
  "CustomerID" : 17850.0,
  "Country" : "United Kingdom"
}

All the documents in the sample sales collection have the same attributes but in a document oriented database this is not a requirement, a document could have more or less attributes set.

Now, what ever object we pass to the find() function is going to be interpreted as a WHERE clause in a SQL DBMS. Lets suppose we what to find all sales made to the customer 17850:

db.sales.find({'CustomerID': 17850})

I expect this query to be clear, we pass a valid JSON object with the structure { attribute: value }, note that the semicolons aren’t necessary unless the attribute name has spaces in between but I always use them and I’ll explain why latter on.

To query for all the sales for the same customer with a Quantity greater than 10 units we do:

db.sales.find({
  'CustomerID': 17850,
  'Quantity': { '$gt': 10 }
})

Now this looks a little bit more ugly, in this case we use the $gt selector to filter the result set where the Quantity is greater than (therefore gt) 10, note here that the attribute doesn’t receive a value but another object { $selector: value }

Feel free to consult the documentation on selectors, some of them come handy to make comparisons, logical operations, check existence or type of attributes, match by regular expressions and many more.

find() is just wrapper to build simple queries but as soon as you need to group or match records by a more complex set of rules we need to use aggregate().

Aggregating works as a pipe line, the concept is simple, first you match by some attributes (the same way you did it with find) and then you group them by some set of rules. For us all what this means is that aggregate() receives as parameter a list where each element is an object performing an operation.

db.sales.aggregate(
  [
    { ... },  // 1st operation (usually $match)
    { ... },  // 2nd operation (usually $group)
    { ... },  // 3rd operation (usually $sort)
    { ... },  // more operations to follow
    .
    .
    .
  ]
)

Let’s see an example, we want to find out who are the top three countries with more sales, in this case we don’t want to match by any { attribute: value } in particular, so we skip that, but for example if you wanted to take into account only those countries where a certain product was sold you can do that with match { '$match': { 'StockCode': 22379 } } and this object would be the first in the list passed to aggregate().

Instead we group directly all of the records by country { '$group': { '_id': '$Country' } }:

db.sales.aggregate([
  { '$group': { '_id': '$Country' } }
])

$group takes an object as a value, we use this object to define the representation of the grouped elements, so we’re saying “Group by $Country, and each of the resulted groups will consist of just that, the _id attribute set to the $Country”.

Now, having only the countries in groups doesn’t help much, we need to have the sales amount in each group as well, so for that we say “It will also have a sales attribute whose value is the sum of multiplying the quantity by the product’s price across all elements of the group”.

I hope to haven’t lost you there, let’s see, to find who are the top seller countries we first need to know the sales amount they’ve generated, and for that within each country (group) we need to first compute within each invoice (element in the group) the total amount in that sale.

$multiply is an aggregation operation so we can use it while grouping, in this case we need to multiply in each of the invoices UnitPrice by the Quantity:

{ '$multiply': ['$UnitPrice', '$Quantity'] }

Now we combine all of the elements by summing the results (the aggregation) so we use $sum.

{
  '$sum': {
    '$multiply': ['$UnitPrice', '$Quantity']
  }
}

To whatever results of this, we want to store it in a sales attribute within each group i.e. within each country. So we now have this:

db.sales.aggregate([
  {
    '$group': {
      '_id': '$Country',
      'sales': {
        '$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
      }
    }
  }
])

And that folks is just our first operation in the list passed to aggregate(), our next operation needs to be a sort, we would like to see the countries with more `sales` on top:

{ '$sort': { 'sales': -1 } }

The -1 indicates that we want to sort by decreasing order:

db.sales.aggregate([
  {
    '$group': {
      '_id': '$Country',
      'sales': {
        '$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
      }
    }
  },
  { '$sort': { 'sales': -1 } }
])

Finally we only want to gather the top three, so we need to $limit the result set to 3:

{ '$limit': 3 }

Our pipeline consists of three operations:

db.sales.aggregate([
  // 1st operation ($group)
  {
    '$group': {
      '_id': '$Country',
      'sales': {
        '$sum': { '$multiply': ['$UnitPrice', '$Quantity'] }
      }
    }
  },

  // 2nd operation ($sort)
  { '$sort': { 'sales': -1 } },

  // 3rd operation ($limit)
  { '$limit': 3 }
])

And this is the result:

[
  {
    "_id" : "United Kingdom",
    "sales" : 8187806.364
  },
  {
    "_id" : "Netherlands",
    "sales" : 284661.54
  },
  {
    "_id" : "EIRE",
    "sales" : 263276.82
  }
]

Using Pymongo

Now, the advantage to always wrap attributes and values in semicolons is that if you also interact with a Mongo database from a python script you can use pymongo and use the exact same queries there!

pip install pymongo

To connect to a a mongo instance you create a MongoClient instance passing the host and port to get a connection, if you don’t pass these parameters it will use the defaults localhost:27017:

from pymongo import MongoClient

conn = MongoClient('localhost', 27017)
db = conn.my_database

It’s very easy to get a reference to the database, you just access it as an attribute in the connection object. Once you have the db object you can pretty much use it the same way as you do in the mongo client.

This query, finds out the top 5 most sold products in USA and it works the same way in mongo as it does in pymongo:

db.sales.aggregate([
  { '$match': { 'Country': 'USA' } },
  {
    '$group': {
      '_id': '$StockCode',
      'price': { '$first': '$UnitPrice' },
      'quantity': { '$first': '$Quantity' },
      'description': { '$first': '$Description' }
    },
  },
  {'$sort': {'quantity': -1} },
  {'$limit': 5}
])