Why You Should Never Use MongoDB

Disclaimer: I do not build database engines. I build web applications. I run 4-6 different projects every year, so I build a lot of web applications. I see apps with different requirements and different data storage needs. I’ve deployed most of the data stores you’ve heard about, and a few that you probably haven’t.

I’ve picked the wrong one a few times. This is a story about one of those times — why we picked it originally, how we discovered it was wrong, and how we recovered. It all happened on an open source project called Diaspora.

The project

Diaspora is a distributed social network with a long history. Waaaaay back in early 2010, four undergraduates from New York University made a Kickstarter video asking for $10,000 to spend the summer building a distributed alternative to Facebook. They sent it out to friends and family, and hoped for the best.

But they hit a nerve. There had just been another Facebook privacy scandal, and when the dust settled on their Kickstarter, they had raised over $200,000 from 6400 different people for a software project that didn’t yet have a single line of code written.

Diaspora was the first Kickstarter project to vastly overrun its goal. As a result, they got written up in the New York Times – which turned into a bit of a scandal, because the chalkboard in the backdrop of the team photo had a dirty joke written on it, and no one noticed until it was actually printed. In the NEW YORK TIMES. The fallout from that was actually how I first heard about the project.

As a result of their Kickstarter success, the guys left school and came out to San Francisco to start writing code. They ended up in my office. I was working at Pivotal Labs at the time, and one of the guys’ older brothers also worked there, so Pivotal offered them free desk space, internet, and, of course, access to the beer fridge. I worked with official clients during the day, then hung out with them after work and contributed code on weekends.

They ended up staying at Pivotal for more than two years. By the end of that first summer, though, they already had a minimal but working (for some definition) implementation of a distributed social network built in Ruby on Rails and backed by MongoDB.

That’s a lot of buzzwords. Let’s break it down.

“Distributed social network”

If you’ve seen the Social Network, you know everything you need to know about Facebook. It’s a web app, it runs on a single logical server, and it lets you stay in touch with people. Once you log in, Diaspora’s interface looks structurally similar to Facebook’s:

A screenshot of the Diaspora user interface

There’s a feed in the middle showing all your friends’ posts, and some other random stuff along the sides that no one has ever looked at. The main technical difference between Diaspora and Facebook is invisible to end users: it’s the “distributed” part.

The Diaspora infrastructure is not located behind a single web address. There are hundreds of independent Diaspora servers. The code is open source, so if you want to, you can stand up your own server. Each server, called a pod, has its own database and its own set of users, and will interoperate with all the other Diaspora pods that each have their own database and set of users.

Pods of different sizes communicate with each other, without a central hub.

Each pod communicates with the others through an HTTP-based API. Once you set up an account on a pod, it’ll be pretty boring until you follow some other people. You can follow other users on your pod, and you can also follow people who are users on other pods. When someone you follow on another pod posts an update, here’s what happens:

1. The update goes into the author’s pod’s database.

2. Your pod is notified over the API.

3. The update is saved in your pod’s database.

4. You look at your activity feed and see that post mixed in with posts from the other people you follow.

Comments work the same way. On any single post, some comments might be from people on the same pod as the post’s author, and some might be from people on other pods. Everyone who has permission to see the post sees all the comments, just as you would expect if everyone were on a single logical server.

Who cares?

There are technical and legal advantages to this architecture. The main technical advantage is fault tolerance.

Here is a very important fault tolerant system that every office should have.

If any one of the pods goes down, it doesn’t bring the others down. The system survives, and even expects, network partitioning. There are some interesting political implications to that — for example, if you’re in a country that shuts down outgoing internet to prevent access to Facebook and Twitter, your pod running locally still connects you to other people within your country, even though nothing outside is accessible.

The main legal advantage is server independence. Each pod is a legally separate entity, governed by the laws of wherever it’s set up. Each pod also sets their own terms of service. On most of them, you can post content without giving up your rights to it, unlike on Facebook. Diaspora is free software both in the “gratis” and the “libre” sense of the term, and most of the people who run pods care deeply about that sort of thing.

So that’s the architecture of the system. Let’s look at the architecture within a single pod.

It’s a Rails app.

Each pod is a Ruby on Rails web application backed by a database, originally MongoDB. In some ways the codebase is a ‘typical’ Rails app — it has both a visual and programmatic UI, some Ruby code, and a database. But in other ways it is anything but typical.

The internal structure of one Diaspora pod

The visual UI is of course how website users interact with Diaspora. The API is used by various Diaspora mobile clients — that part’s pretty typical — but it’s also used for “federation,” which is the technical name for inter-pod communication. (I asked where the Romulans’ access point was once, and got a bunch of blank looks. Sigh.) So the distributed nature of the system adds layers to the codebase that aren’t present in a typical app.

And of course, MongoDB is an atypical choice for data storage. The vast majority of Rails applications are backed by PostgreSQL or (less often these days) MySQL.

So that’s the code. Let’s consider what kind of data we’re storing.

I Do Not Think That Word Means What You Think That Means

“Social data” is information about our network of friends, their friends, and their activity. Conceptually, we do think about it as a network — an undirected graph in which we are in the center, and our friends radiate out around us.

Photos all from rubyfriends.com. Thanks Matt Rogers, Steve Klabnik, Nell Shamrell, Katrina Owen, Sam Livingston-Grey, Josh Susser, Akshay Khole, Pradyumna Dandwate, and Hephzibah Watharkar for contributing to #rubyfriends!

When we store social data, we’re storing that graph topology, as well as the activity that moves along those edges.

For quite a few years now, the received wisdom has been that social data is not relational, and that if you store it in a relational database, you’re doing it wrong.

But what are the alternatives? Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production. Other folks say that document databases are perfect for social data, and those are mainstream enough to actually be used. So let’s look at why people think social data fits more naturally in MongoDB than in PostgreSQL.

How MongoDB Stores Data

MongoDB is a document-oriented database. Instead of storing your data in tables made out of individual rows, like a relational database does, it stores your data in collections made out of individual documents. In MongoDB, a document is a big JSON blob with no particular format or schema.

Let’s say you have a set of relationships like this that you need to model. This is quite similar to a project that come through Pivotal that used MongoDB, and was the best use case I’ve ever seen for a document database.

At the root, we have a set of TV shows. Each show has many seasons, each season has many episodes, and each episode has many reviews and many cast members. When users come into this site, typically they go directly to the page for a particular TV show. On that page they see all the seasons and all the episodes and all the reviews and all the cast members from that show, all on one page. So from the application perspective, when the user visits a page, we want to retrieve all of the information connected to that TV show.

There are a number of ways you could model this data. In a typical relational store, each of these boxes would be a table. You’d have a tv_shows table, a seasons table with a foreign key into tv_shows, an episodes table with a foreign key into seasons, and reviews and cast_members tables with foreign keys into episodes. So to get all the information for a TV show, you’re looking at a five-table join.

We could also model this data as a set of nested hashes. The set of information about a particular TV show is one big nested key/value data structure. Inside a TV show, there’s an array of seasons, each of which is also a hash. Within each season, an array of episodes, each of which is a hash, and so on. This is how MongoDB models the data. Each TV show is a document that contains all the information we need for one show.

Here’s an example document for one TV show, Babylon 5.

It’s got some title metadata, and then it’s got an array of seasons. Each season is itself a hash with metadata and an array of episodes. In turn, each episode has some metadata and arrays for both reviews and cast members.

It’s basically a huge fractal data structure.

Sets of sets of sets of sets. Tasty fractals.

All of the data we need for a TV show is under one document, so it’s very fast to retrieve all this information at once, even if the document is very large. There’s a TV show here in the US called “General Hospital” that has aired over 12,000 episodes over the course of 50+ seasons. On my laptop, PostgreSQL takes about a minute to get denormalized data for 12,000 episodes, while retrieval of the equivalent document by ID in MongoDB takes a fraction of a second.

So in many ways, this application presented the ideal use case for a document store.

Ok. But what about social data?

Right. When you come to a social networking site, there’s only one important part of the page: your activity stream. The activity stream query gets all of the posts from the people you follow, ordered by most recent. Each of those posts have nested information within them, such as photos, likes, reshares, and comments.

The nested structure of activity stream data looks very similar to what we were looking at with the TV shows.

Users have friends, friends have posts, posts have comments and likes, each comment has one commenter and each like has one liker. Relationship-wise, it’s not a whole lot more complicated than TV shows. And just like with TV shows, we want to pull all this data at once, right after the user logs in. Furthermore, in a relational store, with the data fully normalized, it would be a seven-table join to get everything out.

Seven-table joins. Ugh. Suddenly storing each user’s activity stream as one big denormalized nested data structure, rather than doing that join every time, seems pretty attractive.

In 2010, when the Diaspora team was making this decision, Etsy’s articles about using document stores were quite influential, although they’ve since publicly moved away from MongoDB for data storage. Likewise, at the time, Facebook’s Cassandra was also stirring up a lot of conversation about leaving relational databases. Diaspora chose MongoDB for their social data in this zeitgeist. It was not an unreasonable choice at the time, given the information they had.

What could possibly go wrong?

There is a really important difference between Diaspora’s social data and the Mongo-ideal TV show data that no one noticed at first.

With TV shows, each box in the relationship diagram is a different type. TV shows are different from seasons are different from episodes are different from reviews are different from cast members. None of them is even a sub-type of another type.

But with social data, some of the boxes in the relationship diagram are the same type. In fact, all of these green boxes are the same type — they are all Diaspora users.

A user has friends, and each friend may themselves be a user. Or, they may not, because it’s a distributed system. (That’s a whole layer of complexity that I’m just skipping for today.) In the same way, commenters and likers may also be users.

This type duplication makes it way harder to denormalize an activity stream into a single document. That’s because in different places in your document, you may be referring to the same concept — in this case, the same user. The user who liked that post in your activity stream may also be the user who commented on a different post.

Duplicate data Duplicate data

We can represent this in MongoDB in a couple of different ways. Duplication is any easy option. All the information for that friend is copied and saved to the like on the first post, and then a separate copy is saved to the comment on the second post. The advantage is that all the data is present everywhere you need it, and you can still pull the whole activity stream back as a single document.

Here’s what this kind of fully denormalized stream document looks like.

Here we have copies of user data inlined. This is Joe’s stream, and it has a copy of his user data, including his name and URL, at the top level. His stream, just underneath, contains Jane’s post. Joe has liked Jane’s post, so under likes for Jane’s post, we have a separate copy of Joe’s data.

You can see why this is attractive: all the data you need is already located where you need it.

You can also see why this is dangerous. Updating a user’s data means walking through all the activity streams that they appear in to change the data in all those different places. This is very error-prone, and often leads to inconsistent data and mysterious errors, particularly when dealing with deletions.

Is there no hope?

There is another approach you can take to this problem in MongoDB, which will more familiar if you have a relational background. Instead of duplicating user data, you can store references to users in the activity stream documents.

With this approach, instead of inlining this user data wherever you need it, you give each user an ID. Once users have IDs, we store the user’s ID every place that we were previously inlining data. New IDs are in green below.

Semi-normalized stream document with user IDs

MongoDB actually uses BSON IDs, which are strings sort of like GUIDs, but to make these samples easier to read I’m just using integers.

This eliminates our duplication problem. When user data changes, there’s only one document that gets rewritten. However, we’ve created a new problem for ourselves. Because we’ve moved some data out of the activity streams, we can no longer construct an activity stream from a single document. This is less efficient and more complex. Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars.

What’s missing from MongoDB is a SQL-style join operation, which is the ability to write one query that mashes together the activity stream and all the users that the stream references. Because MongoDB doesn’t have this ability, you end up manually doing that mashup in your application code, instead.

Simple Denormalized Data

Let’s return to TV shows for a second. The set of relationships for a TV show don’t have a lot of complexity. Because all the boxes in the relationship diagram are different entities, the entire query can be denormalized into one document with no duplication and no references. In this document database, there are no links between documents. It requires no joins.

On a social network, however, nothing is that self-contained. Any time you see something that looks like a name or a picture, you expect to be able to click on it and go see that user, their profile, and their posts. A TV show application doesn’t work that way. If you’re on season 1 episode 1 of Babylon 5, you don’t expect to be able to click through to season 1 episode 1 of General Hospital.

Don’t. Link. The. Documents.

Once we started doing ugly MongoDB joins manually in the Diaspora code, we knew it was the first sign of trouble. It was a sign that our data was actually relational, that there was value to that structure, and that we were going against the basic concept of a document data store.

Whether you’re duplicating critical data (ugh), or using references and doing joins in your application code (double ugh), when you have links between documents, you’ve outgrown MongoDB. When the MongoDB folks say “documents,” in many ways, they mean things you can print out on a piece of paper and hold. A document may have internal structure — headings and subheadings and paragraphs and footers — but it doesn’t link to other documents. It’s a self-contained piece of semi-structured data.

If your data looks like that, you’ve got documents. Congratulations! It’s a good use case for Mongo. But if there’s value in the links between documents, then you don’t actually have documents. MongoDB is not the right solution for you. It’s certainly not the right solution for social data, where links between documents are actually the most critical data in the system.

So social data isn’t document-oriented. Does that mean it’s actually…relational?

That Word Again

When people say “social data isn’t relational,” that’s not actually what they mean. They mean one of these two things:

1. “Conceptually, social data is more of a graph than a set of tables.”

This is absolutely true. But there are actually very few concepts in the world that are naturally modeled as normalized tables. We use that structure because it’s efficient, because it avoids duplication, and because when it does get slow, we know how to fix it.

2. “It’s faster to get all the data from a social query when it’s denormalized into a single document.”

This is also absolutely true. When your social data is in a relational store, you need a many-table join to extract the activity stream for a particular user, and that gets slow as your tables get bigger. However, we have a well-understood solution to this problem. It’s called caching.

At the All Your Base Conf in Oxford earlier this year, where I gave the talk version of this post, Neha Narula had a great talk about caching that I recommend you watch once it’s posted. In any case, caching in front of a normalized data store is a complex but well-understood problem. I’ve seen projects cache denormalized activity stream data into a document database like MongoDB, which makes retrieving that data much faster. The only problem they have then is cache invalidation.

“There are only two hard problems in computer science: cache invalidation and naming things.”

Phil Karlton

It turns out cache invalidation is actually pretty hard. Phil Karlton wrote most of SSL version 3, X11, and OpenGL, so he knows a thing or two about computer science.

Cache Invalidation As A Service

But what is cache invalidation, and why is it so hard?

Cache invalidation is just knowing when a piece of your cached data is out of date, and needs to be updated or replaced. Here’s a typical example that I see every day in web applications. We have a backing store, typically PostgreSQL or MySQL, and then in front of that we have a caching layer, typically Memcached or Redis. Requests to read a user’s activity stream go to the cache rather than the database directly, which makes them very fast.

Typical cache and backing store setup

Application writes are more complicated. Let’s say a user with two followers writes a new post. The first thing that happens (part 1) is that the post data is copied into the backing store. Once that completes, a background job (part 2) appends that post to the cached activity stream of both of the users who follow the author.

This pattern is quite common. Twitter holds recently-active users’ activity streams in an in-memory cache, which they append to when someone they follow posts something. Even smaller applications that employ some kind of activity stream will typically end up here (see: seven-table join).

Back to our example. When the author changes an existing post, the update process is essentially the same as for a create, except instead of appending to the cache, it updates an item that’s already there.

What happens if that step 2 background job fails partway through? Machines get rebooted, network cables get unplugged, applications restart. Instability is the only constant in our line of work. When that happens, you’ll end up with invalid data in your cache. Some copies of the post will have the old title, and some copies will have the new title. That’s a hard problem, but with a cache, there’s always the nuclear option.

Always an option >_<

You can always delete the entire activity stream record out of your cache and regenerate it from your consistent backing store. It may be slow, but at least it’s possible.

What if there is no backing store? What if you skip step 1? What if the cache is all you have?

When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time. At that point, you have no options. Not even a nuclear one. You have no way to regenerate the data in a consistent state.

When Diaspora decided to store social data in MongoDB, we were conflating a database with a cache. Databases and caches are very different things. They have very different ideas about permanence, transience, duplication, references, data integrity, and speed.

The Conversion

Once we figured out that we had accidentally chosen a cache for our database, what did we do about it?

Well, that’s the million dollar question. But I’ve already answered the billion-dollar question. In this post I’ve talked about how we used MongoDB vs. how it was designed to be used. I’ve talked about it as though all that information were obvious, and the Diaspora team just failed to research adequately before choosing.

But this stuff wasn’t obvious at all. The MongoDB docs tell you what it’s good at, without emphasizing what it’s not good at. That’s natural. All projects do that. But as a result, it took us about six months, a lot of user complaints, and a lot of investigation to figure out that we were using MongoDB the wrong way.

There was nothing to do but take the data out of MongoDB and move it to a relational store, dealing as best we could with the inconsistent data we uncovered along the way. The data conversion itself — export from MongoDB, import to MySQL — was straightforward. For the mechanical details, you can see my slides from All Your Base Conf 2013.

The Damage

We had eight months of production data, which turned into about 1.2 million rows in MySQL. We spent four pair-weeks developing the code for the conversion, and when we pulled the trigger, the main site had about two hours of downtime. That was more than acceptable for a project that was in pre-alpha. We could have reduced that downtime more, but we had budgeted for eight hours of downtime, so two actually seemed fantastic.

NOT BAD

Epilogue

Remember that TV show application? It was the perfect use case for MongoDB. Each show was one document, perfectly self-contained. No references to anything, no duplication, and no way for the data to become inconsistent.

About three months into development, it was still humming along nicely on MongoDB. One Monday, at the weekly planning meeting, the client told us about a new feature that one of their investors wanted: when they were looking at the actors in an episode of a show, they wanted to be able to click on an actor’s name and see that person’s entire television career. They wanted a chronological listing of all of the episodes of all the different shows that actor had ever been in.

We stored each show as a document in MongoDB containing all of its nested information, including cast members. If the same actor appeared in two different episodes, even of the same show, their information was stored in both places. We had no way to tell, aside from comparing the names, whether they were the same person. So to implement this feature, we needed to search through every document to find and de-duplicate instances of the actor that the user clicked on. Ugh. At a minimum, we needed to de-dup them once, and then maintain an external index of actor information, which would have the same invalidation issues as any other cache.

You See Where This Is Going

The client expected this feature to be trivial. If the data had been in a relational store, it would have been. As it was, we first tried to convince the PM they didn’t need it. After that failed, we offered some cheaper alternatives, such as linking to an IMDB search for the actor’s name. The company made money from advertising, though, so they wanted users to stay on their site rather than going off to IMDB.

This feature request eventually prompted the project’s conversion to PostgreSQL. After a lot more conversation with the client, we realized that the business saw lots of value in linking TV shows together. They envisioned seeing other shows a particular director had been involved with, and episodes of other shows that were released the same week this one was, among other things.

This was ultimately a communication problem rather than a technical problem. If these conversations had happened sooner, if we had taken the time to really understand how the client saw the data and what they wanted to do with it, we probably would have done the conversion earlier, when there was less data, and it was easier.

Always Be Learning

I learned something from that experience: MongoDB’s ideal use case is even narrower than our television data. The only thing it’s good at is storing arbitrary pieces of JSON. “Arbitrary,” in this context, means that you don’t care at all what’s inside that JSON. You don’t even look. There is no schema, not even an implicit schema, as there was in our TV show data. Each document is just a blob whose interior you make absolutely no assumptions about.

At RubyConf this weekend, I ran into Conrad Irwin, who suggested this use case. He’s used MongoDB to store arbitrary bits of JSON that come from customers through an API. That’s reasonable. The CAP theorem doesn’t matter when your data is meaningless. But in interesting applications, your data isn’t meaningless.

I’ve heard many people talk about dropping MongoDB in to their web application as a replacement for MySQL or PostgreSQL. There are no circumstances under which that is a good idea. Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value. If you have an implicit schema — meaning, if there are things you are expecting in that JSON — then MongoDB is the wrong choice. I suggest taking a look at PostgreSQL’s hstore (now apparently faster than MongoDB anyway), and learning how to make schema changes. They really aren’t that hard, even in large tables.

Find The Value

When you’re picking a data store, the most important thing to understand is where in your data — and where in its connections — the business value lies. If you don’t know yet, which is perfectly reasonable, then choose something that won’t paint you into a corner. Pushing arbitrary JSON into your database sounds flexible, but true flexibility is easily adding the features your business needs.

Make the valuable things easy.

The End.

Thanks for reading! Let me sum up how I feel about comments on this post:

Graham Pritchard

November 14, 2013 at 10:16 am

Enjoyed this article very much. Not being an ‘expert’ in this field, I am personally grappling with how organisations can best manage data consolidation from multiple disconnects internal and web-based sources, utilise semantics & triple stores to deliver customer insight, etc

I have happened upon MarkLogic’s XML Platform whose functionality may be able to manage the scenario outlined. Would someone please confirm this.

If MarkLogic can do what I hope it can, then I wouldn’t suggest all rushing back to the relational motherland just yet

Tegiri Nenashi

November 14, 2013 at 10:44 am

Mongo DB is so last decade… The greatest NoSQL minds are currently working on NoSQL standard: CODASQL.

Jerome

November 14, 2013 at 3:33 pm

Did you have a look at Riak? I’m not saying that you should ditch away your RDBMS and use Riak instead, but it “feels” like a document store, except that you can have links between objects, and it’s (relatively) easy to write map/reduce queries to follow those links.

I.e. “start from user Alice, fetch all her friends older than 25, now go to the things that she “liked” if they are of type “movie”, then get all the actors in those movies that have “from country” Germany, and return the resulting set”.

I definitely recommend to have a look at it some day if you like the general idea of MongoDB but were put off by some of its shortcomings

Art S. Kagel

November 14, 2013 at 6:27 pm

I love this article. I have been at development and databases for 30 years and there is always the latest and greatest thing. Everyone thinks that they have invented something new. JSON is just tag/value records something we used in the 80s but set aside when real databases became common. Sara, I agree with your points and your approach. Make sure you understand your problem, ask probing questions, and choose your tools appropriately I think sums it up.

I would mention a fairly new alternative to using only a document store or only a relational database. Informix, one of the early enterprise quality relational databases, recently release a new version (v12.10.xC2) which supports JSON/BSON data as a native type within the relational database framework and fully supports all of the MongoDB APIs so that any application written to the MongoDB protocol can simply be pointed at the Informix server and it will just work. The application will gain all of the advantages of a relational database such as a consistent store, backup procedures, ACID compliance, enterprise quality replication, full sharding support, massive scaleability, fine grained locking and concurrency control, and more. On top of this the IBM engineers (Informix is now an IBM product) extended the JSON type to support documents up to 2GB in size (MongoDB limits documents to 16MB).

Most relevant to your sample projects and to most existing development environments that want to branch out into semi-structured data, Informix fully supports the integration of relational and JSON data for all clients – both MongoDB API and relational API clients. Relational tables can be queried directly and the engine will return JSON data to MongoDB clients. Even general SQL can be passed in using the “.sql.” method and the resulting tupples will be returned as a stream of JSON documents. Collections can be queried using SQL and users can elect to return either full JSON documents or to parse the documents and return relational columns instead. You can even join JSON collections to relational tables and the server supports relational integrity constraints and triggers on JSON collections.

This means that you can develop hybrid applications that user relational tables for what they are good for and collections for – that being data where the schema needs to be flexible, for rapid development, and for … well for documents. Informix isn’t open source, but there is a community supported completely free edition (Informix Innovator Edition) that startups and single developers and students.

Chris Travers

November 15, 2013 at 1:42 am

Great article. As I often tell people, DB’s like MongoDB, CouchDB etc. often have a place in high velocity, or high variability data pipelines…. as a way in or out of the RDBMS. In essence you can use these as a buffer for information pending processing, which can then be processed and inserted into a relational db for more flexibility. You can also use it as a caching layer on the way out, and one that you can share between processes if you need to.

However, the lessons of the past database wise have been hard-won and discarding those in favor of the shiny new tech is rarely wise here.

All this being said there are things that RDBMS DBA’s can take from the popularity of NoSQL. For example in the LedgerSMB project we seek to encapsulate our db behind stored procedures, using conventions which make them application-discoverable. This enables many of the best features of NoSQL db’s in terms while keeping the extensibility of the relational store.

Dennis

November 15, 2013 at 9:02 am

Thanks, entertaining reading. And I think you explained the situation clearly, so that even newbies could understand it.

However, as soon as you as you began describing the “tv_shows” database as an ideal use case for a document database, I thought “but what if the user wants to see all the shows an actor has been in? It is the exact same problem as with your social network.”

I’m glad to see that you acknowledged this later in the article, but I am really shocked that the team didn’t foresee that this feature would be wanted eventually.

I think the lesson really is, you probably aren’t going to foresee everything you want to do with your data up front, but if you chose a relational database you can have high confidence you’ll be able to overcome that.

Carlos

November 15, 2013 at 10:42 am

I believe this documents could be planned differently.
It is not forbidden to reference multiple document inside of MongoDB (relationships), but as mysql normalization optimizes storage based on datatypes (because that it’s natural constrain), schema less approaches optimize based on access and relationships minimization ( not absolute elimination ). Correct indexing and sharding strategies will make relationships even faster.

In some way schema less may even be more design intensive that the traditional relational design simply because there is not this absolute right way to store a given set of data but depends entirely on how you intend to access it.

I do enjoyed this post and it made me think about matters, despite me not agreeing with the conclusions I would still recommend it a good read.

Nils Wiersema

November 15, 2013 at 7:19 pm

Thanks for this fantastic article! Nice story about Diaspora and this read was well worth tonight’s Googling efforts because of my growing doubts about using MongoDB as the backing storage system for a game server. You know, a database with players, inventories and a big, persistent activity stream. All of this data has relations and is wrapped in a well-defined schema anyway. A document-oriented database is far from the ‘right tool for the job’ and we will consider PostgreSQL now.

However, I really liked the ease of inserting and modifying simple POJO’s into MongoDB. MongoDB was very attractive because it’s super easy & working ‘straight out of the box’, Jongo ORM (www.jongo.org) etc. If only Hibernate was this simple…

Bookmarked and will share with friends, thanks again!

Simon

November 16, 2013 at 9:59 am

Great article. I’ve been working in data for many years and I often find that even the simplest data problems turn out to be extremely complex when investigated in detail. I work mostly in BI and so am often thinking about the complex questions that we need to ask about data and it’s in those times that I realise that we need to differentiate between the use of relational principles to describe data and the practical use of an RDBMS. All data can be described in relational terms, and the analytical process of normalising data can be am extremely enlightening. However, the implementation choice could still be document-centric, key-value, RDBMS, graph based or others. All have their advantages and disadvantages, and the more we are open about the limitations of the tools we use (and this article is a great example of that) the better implementation choices we can make as result.

One of the most important roles we have responsibility for as IT professionals is in choosing the best tool for the job. Those who support specific tools with a near-religious zealotry do the industry no good whatsoever and can be the root cause of many failed projects and systems.

Just on a final point, we also have to be open-minded on how we combine tools – you suggest the use of different technologies for a data cache and for a transactional store. The reality is that today, we don’t have single tools that can do everything. The real skill that is in short supply today is that of the true IT architect. Someone who can break down the core business problem into its discrete technology components and then have a broad (and open minded) view on which tool is best for that component.

It seems to me that you are well on the way to fulfilling that role; IMH, you’re no longer ‘just’ a Ruby programmer.

lega

November 17, 2013 at 2:08 pm

> 2) retrieve all the user documents to fill in names and avatars.
You can store names and avatars on the client side (for a session or for a period of time), and you can fill the activity streams on the client side, so often you need an one query on the server side for this.

Gokul

November 19, 2013 at 2:14 am

Thanks for discussing in detail the exact conundrums where I spent a lot of time to decide when we decided to re-architect a system with document stores.

I gone in ‘hybrid’ direction where any unstructured data – before processing, where you are not worried about what it is – placed in the document store and then the processed structured data stored/mapped in the relational tables with references to the document.

This decision seemed to be intuitive at the time I started on the solution design, felt vindicated or strengthened with this article. Thanks again.

Skitted

November 19, 2013 at 2:23 am

Hello,
Awesome article and thank you for sharing this.
Regarding MongoDB I have to add that as for Diaspora the decission to go with MongoDB might have been wrong, its not wrong by “definition”. Choice of database is always a decision for pros and cons and depending on the amount of need for joins you might still be better of choosing MongoDB and having your app handle the joins. In case of 7 Join SQL statements it depends on to how many joins you can reduce it within Mongo and the speed differences. Regarding the problem with MongoDB beeing “Cache” I totally agree that they have to do something about it and inform about it how they improve on that side, but its also something a DBA should be aware about and which can be already be avoided by the features that mongodb currently supports.

Cheers,
Skitted

bseddon

November 19, 2013 at 4:34 am

A couple of year ago I became involved in a project a part of which was to be social. The developers had taken it upon themselves to use MongoDB because of the mantra you include: that social data is not relational, and that if you store it in a relational database, you’re doing it wrong. It was a disaster but the developers – all ‘web developers’ and none had formal training in computer science – chose MongoDB because of this type of hype. There are good use cases for MongoDB. Their flagship story at the time was it’s use at Craigslist. Not as a replacement for the main relational engine (MySQL) but as a backup store. Apparently users can recall past adverts which may be held in a backup store. The Craigslist database schema changes occasionally. It’s bad enough that it takes a long time to update the main database but the real killer was the changing the backups. By using MongoDB they could eliminate the need to change the schema of the backups. Perfect, was a great use case – but a boring work-a-day story, not social or exciting.

What really put me off using MongoDB was that they used an English major to justify why their application should be used. Relational databases are backed by a LOT of math theory and are implemented by some of the brightest engineers. When I’m looking for justification to use a technology, I want to hear it from someone who knows how to write great code and has a solid understanding of the underlying theory. It does not have to scan well or be poetic, it just has to convey a sound explanation of the benefit.

It’s easy to model hierarchical data in a relational database and you don’t do so by using separate tables for each of the levels in the hierarchy. It’s too inflexible. Hierarchical data within relational databases exists in every major (and most not-so-major) organization in the world. All that customer analytic stuff. Hierarchical and in vast quantities. But web developers don’t do back end systems so they’d never know. And god forbid they should ask someone.

Ivan

November 19, 2013 at 5:17 am

Hi Sarah,

Thanks for being brave to highlight an implementation that did not work out as planned.

Just wondering if anyone on the project had gone through the steps of fleshing out ERD or UML models of the data prior to database selection and code implementation?

On a personal opinion, I feel the latest generation of dev folks have forgotten the hard work of the late Edgar Codd.

Dave Johnson

November 19, 2013 at 6:46 am

Thank you for a very informative article. I recently read the book “Seven Databases in Seven Weeks” and MongoDB was one of the dbs discussed. The author writes:

“Mongo was built to scale out, not to run stand-alone. It was built for data consistency and partition tolerance, but sharding data has a cost. If one part of a collection is lost, the whole thing is compromised. What good is querying against a collection of countries that contains only the western hemisphere? Mongo deals with this implicit sharding weakness in a simple manner: duplication. You should rarely run a single Mongo instance in production but rather replicate the stored data across multiple services.”

That would help with the one of the data integrity issues you encountered.

Your points regarding relationships between documents, whether initially planned or later requested, really hit home with me. Thanks.

Where I work, there is a lot of interest in the NoSQL technologies. Most of our data is geospatial in nature. From what I have seen, a lot of the 40 years of progress made with rdbms’ has been ported to the NoSQL world. For geospatial, the available technologies still have some limitations. For point data used for situations like a cell phone user wanting to find all Starbucks within a 5 mile radius, MongoDb (and Solr) are great. It is easier to decide how to shard point data than it is to shard lines and polygons that cover varying geographic extents. Such division of data is best done after all of the data has been ingested, and redone after more data is added, typically against an off-line data store that can later be hot-swapped as the master.

Thanks again for the post. Very informative!

Paolo Umali

November 19, 2013 at 7:27 am

Call me noob but if I need document-style store, I will just have a json field in my MySQL table and store json in there. I’ll just make sure that if I need relationships, I don’t do a json field.

Keef66

November 19, 2013 at 5:31 pm

There is a particular format and schema, it is JSON.
So, by extension, all well-formed XML has a schema even if it doesn’t. Good one.

Anonymous Coward

November 20, 2013 at 2:59 am

Correct me if I’m wrong, but the TL;DR; to this post seems to me: “don’t use a non-relational store when you have relational data.” IMO that’s obvious.

Other than that, there _are_ relational stores provided as a service in the cloud. You don’t have to roll your own. (Then again, you don’t have to roll your own mongo either.)

Granted, some of the relational cloud services are dead slow compared to plain mysql deployed on a single machine, but that’s a cost you have to pay for huge data. OTOH, 1.2E6 rows doesn’t look like a lot to me – IME mysql can serve any data quite well as long as all indexes fit in mem.

What I see a problem here is that your data doesn’t seem to be relational in a classical sense at all – I mean, it is heavily relational, but potentially without a strict schema. Therefore, for an even better fit I’d look for an object database, rather than stay with mysql – an object database might yield an even better fit. (You know, I absolutely dread schema updates in production.)

Leander Conradie

November 20, 2013 at 3:54 am

Right on.

Document databases are not new kids on the block. Lotus Notes did this in the early 90’s and Notes’ lead designer was instrumental in the design of CouchDb, Mongo’s main competitor. After maintaining a system for a few years which was developed in Notes I came to realize that there is actual value in the discipline and structure that a schema imposes on you. The problem is that in the absence of a formal schema, the “schema” is an emergent property which is reflected in the code used to manipulate the database, and with years of cruft and customer requests and changes it becomes a difficult mess. There is no clear specification of what is stores where, it is all a bit wishy washy. This leads to a big unmaintainable mess, link rot and a maintenance nightmare.

Don’t underestimate the years and years of development that went into query optimization for relational databases. If you think your ruby/VB/Python/PHP script mashing up the data is going to outgun a seven-table join just get off your high horse. You are competing with millions of man-hours that have gone into the development of the current relational databases as they stand now. There is no way you are going to outgun this. Read a book about database implementation and query optimization to see what really goes on under the good of a RDBMS. Jeffery Ullman’s Principle of Data and knowledge base Systems, Volume II, Chapter 1 might be a good eye opener.

Normalization is not a curse. It is a way to atomize the data into its basic constituents. Reassembling the data into the form you want with a join is perhaps inefficient at run time but it also means that your queries do not suffer from the Tyrrany of the Dominant Decomposition.

This was a good article, and it is good to see that someone points out that Mongo (and CouchDB and Riak and a lot of other things) are not be-all and end-all tools. Polyglot is way forward. You use multiple data structures in the average program, because they are optimized for different use cases. Why not do the same for you datastore and cache?

Julia Sheehy

November 20, 2013 at 9:02 am

Absolutely BRILLIANT. Thank you, thank you, thank you, thank you, thank you! It would take me ages* to have articulated this to the shiny-new-tool-obsessed VP who thankfully has left and taken most of the the bleeding-edge-technology- addiction with that departure. There are best-use cases for every tool, just like there are exceptional gadgets in many kitchens that serve a select few needs. I can see why MongoDb held the spotlight a while, but even as a novice (only seven years “experience”) I can see why it is so limited, and why I was well advised to avoid taking a contract to update and maintain a MongoDb application abandoned by the original author. Whew! Gonna go thank that person now.

*Even with a failed MongoDB based legacy application everyone in the department opted to let die and rewrite in PHP with an Oracle back end.

Bob Grommes

November 21, 2013 at 6:26 am

I have followed NoSql in general and MongoDB in particular with great interest, but the thing that keeps holding me back is the experience of your TV show database. You can never reliably anticipate all possible future value that may be discovered even within a pie of JSON that seemingly has no external value in its relationships. How can you know that you won’t encounter a client request that they rightly perceive as trivial, only to discover that your use of a document database has rendered it a Manhattan project to make it happen?

I am not really afraid of 7 table joins, I do them every day in applications that I currently support. SQL is quite efficient and highly tunable.

Sarah Mei