Disclaimer: I do not build database engines. I build web applications. I run 4-6 different projects every year, so I build a lot of web applications. I see apps with different requirements and different data storage needs. I’ve deployed most of the data stores you’ve heard about, and a few that you probably haven’t.
I’ve picked the wrong one a few times. This is a story about one of those times — why we picked it originally, how we discovered it was wrong, and how we recovered. It all happened on an open source project called Diaspora.
The project
Diaspora is a distributed social network with a long history. Waaaaay back in early 2010, four undergraduates from New York University made a Kickstarter video asking for $10,000 to spend the summer building a distributed alternative to Facebook. They sent it out to friends and family, and hoped for the best.
But they hit a nerve. There had just been another Facebook privacy scandal, and when the dust settled on their Kickstarter, they had raised over $200,000 from 6400 different people for a software project that didn’t yet have a single line of code written.
Diaspora was the first Kickstarter project to vastly overrun its goal. As a result, they got written up in the New York Times – which turned into a bit of a scandal, because the chalkboard in the backdrop of the team photo had a dirty joke written on it, and no one noticed until it was actually printed. In the NEW YORK TIMES. The fallout from that was actually how I first heard about the project.
As a result of their Kickstarter success, the guys left school and came out to San Francisco to start writing code. They ended up in my office. I was working at Pivotal Labs at the time, and one of the guys’ older brothers also worked there, so Pivotal offered them free desk space, internet, and, of course, access to the beer fridge. I worked with official clients during the day, then hung out with them after work and contributed code on weekends.
They ended up staying at Pivotal for more than two years. By the end of that first summer, though, they already had a minimal but working (for some definition) implementation of a distributed social network built in Ruby on Rails and backed by MongoDB.
That’s a lot of buzzwords. Let’s break it down.
“Distributed social network”
If you’ve seen the Social Network, you know everything you need to know about Facebook. It’s a web app, it runs on a single logical server, and it lets you stay in touch with people. Once you log in, Diaspora’s interface looks structurally similar to Facebook’s:
There’s a feed in the middle showing all your friends’ posts, and some other random stuff along the sides that no one has ever looked at. The main technical difference between Diaspora and Facebook is invisible to end users: it’s the “distributed” part.
The Diaspora infrastructure is not located behind a single web address. There are hundreds of independent Diaspora servers. The code is open source, so if you want to, you can stand up your own server. Each server, called a pod, has its own database and its own set of users, and will interoperate with all the other Diaspora pods that each have their own database and set of users.
Each pod communicates with the others through an HTTP-based API. Once you set up an account on a pod, it’ll be pretty boring until you follow some other people. You can follow other users on your pod, and you can also follow people who are users on other pods. When someone you follow on another pod posts an update, here’s what happens:
1. The update goes into the author’s pod’s database.
2. Your pod is notified over the API.
3. The update is saved in your pod’s database.
4. You look at your activity feed and see that post mixed in with posts from the other people you follow.
Comments work the same way. On any single post, some comments might be from people on the same pod as the post’s author, and some might be from people on other pods. Everyone who has permission to see the post sees all the comments, just as you would expect if everyone were on a single logical server.
Who cares?
There are technical and legal advantages to this architecture. The main technical advantage is fault tolerance.
If any one of the pods goes down, it doesn’t bring the others down. The system survives, and even expects, network partitioning. There are some interesting political implications to that — for example, if you’re in a country that shuts down outgoing internet to prevent access to Facebook and Twitter, your pod running locally still connects you to other people within your country, even though nothing outside is accessible.
The main legal advantage is server independence. Each pod is a legally separate entity, governed by the laws of wherever it’s set up. Each pod also sets their own terms of service. On most of them, you can post content without giving up your rights to it, unlike on Facebook. Diaspora is free software both in the “gratis” and the “libre” sense of the term, and most of the people who run pods care deeply about that sort of thing.
So that’s the architecture of the system. Let’s look at the architecture within a single pod.
It’s a Rails app.
Each pod is a Ruby on Rails web application backed by a database, originally MongoDB. In some ways the codebase is a ‘typical’ Rails app — it has both a visual and programmatic UI, some Ruby code, and a database. But in other ways it is anything but typical.
The visual UI is of course how website users interact with Diaspora. The API is used by various Diaspora mobile clients — that part’s pretty typical — but it’s also used for “federation,” which is the technical name for inter-pod communication. (I asked where the Romulans’ access point was once, and got a bunch of blank looks. Sigh.) So the distributed nature of the system adds layers to the codebase that aren’t present in a typical app.
And of course, MongoDB is an atypical choice for data storage. The vast majority of Rails applications are backed by PostgreSQL or (less often these days) MySQL.
So that’s the code. Let’s consider what kind of data we’re storing.
I Do Not Think That Word Means What You Think That Means
“Social data” is information about our network of friends, their friends, and their activity. Conceptually, we do think about it as a network — an undirected graph in which we are in the center, and our friends radiate out around us.
When we store social data, we’re storing that graph topology, as well as the activity that moves along those edges.
For quite a few years now, the received wisdom has been that social data is not relational, and that if you store it in a relational database, you’re doing it wrong.
But what are the alternatives? Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production. Other folks say that document databases are perfect for social data, and those are mainstream enough to actually be used. So let’s look at why people think social data fits more naturally in MongoDB than in PostgreSQL.
How MongoDB Stores Data
MongoDB is a document-oriented database. Instead of storing your data in tables made out of individual rows, like a relational database does, it stores your data in collections made out of individual documents. In MongoDB, a document is a big JSON blob with no particular format or schema.
Let’s say you have a set of relationships like this that you need to model. This is quite similar to a project that come through Pivotal that used MongoDB, and was the best use case I’ve ever seen for a document database.
At the root, we have a set of TV shows. Each show has many seasons, each season has many episodes, and each episode has many reviews and many cast members. When users come into this site, typically they go directly to the page for a particular TV show. On that page they see all the seasons and all the episodes and all the reviews and all the cast members from that show, all on one page. So from the application perspective, when the user visits a page, we want to retrieve all of the information connected to that TV show.
There are a number of ways you could model this data. In a typical relational store, each of these boxes would be a table. You’d have a tv_shows
table, a seasons
table with a foreign key into tv_shows
, an episodes
table with a foreign key into seasons
, and reviews
and cast_members
tables with foreign keys into episodes
. So to get all the information for a TV show, you’re looking at a five-table join.
We could also model this data as a set of nested hashes. The set of information about a particular TV show is one big nested key/value data structure. Inside a TV show, there’s an array of seasons, each of which is also a hash. Within each season, an array of episodes, each of which is a hash, and so on. This is how MongoDB models the data. Each TV show is a document that contains all the information we need for one show.
Here’s an example document for one TV show, Babylon 5.
It’s got some title metadata, and then it’s got an array of seasons. Each season is itself a hash with metadata and an array of episodes. In turn, each episode has some metadata and arrays for both reviews and cast members.
It’s basically a huge fractal data structure.
All of the data we need for a TV show is under one document, so it’s very fast to retrieve all this information at once, even if the document is very large. There’s a TV show here in the US called “General Hospital” that has aired over 12,000 episodes over the course of 50+ seasons. On my laptop, PostgreSQL takes about a minute to get denormalized data for 12,000 episodes, while retrieval of the equivalent document by ID in MongoDB takes a fraction of a second.
So in many ways, this application presented the ideal use case for a document store.
Ok. But what about social data?
Right. When you come to a social networking site, there’s only one important part of the page: your activity stream. The activity stream query gets all of the posts from the people you follow, ordered by most recent. Each of those posts have nested information within them, such as photos, likes, reshares, and comments.
The nested structure of activity stream data looks very similar to what we were looking at with the TV shows.
Users have friends, friends have posts, posts have comments and likes, each comment has one commenter and each like has one liker. Relationship-wise, it’s not a whole lot more complicated than TV shows. And just like with TV shows, we want to pull all this data at once, right after the user logs in. Furthermore, in a relational store, with the data fully normalized, it would be a seven-table join to get everything out.
Seven-table joins. Ugh. Suddenly storing each user’s activity stream as one big denormalized nested data structure, rather than doing that join every time, seems pretty attractive.
In 2010, when the Diaspora team was making this decision, Etsy’s articles about using document stores were quite influential, although they’ve since publicly moved away from MongoDB for data storage. Likewise, at the time, Facebook’s Cassandra was also stirring up a lot of conversation about leaving relational databases. Diaspora chose MongoDB for their social data in this zeitgeist. It was not an unreasonable choice at the time, given the information they had.
What could possibly go wrong?
There is a really important difference between Diaspora’s social data and the Mongo-ideal TV show data that no one noticed at first.
With TV shows, each box in the relationship diagram is a different type. TV shows are different from seasons are different from episodes are different from reviews are different from cast members. None of them is even a sub-type of another type.
But with social data, some of the boxes in the relationship diagram are the same type. In fact, all of these green boxes are the same type — they are all Diaspora users.
A user has friends, and each friend may themselves be a user. Or, they may not, because it’s a distributed system. (That’s a whole layer of complexity that I’m just skipping for today.) In the same way, commenters and likers may also be users.
This type duplication makes it way harder to denormalize an activity stream into a single document. That’s because in different places in your document, you may be referring to the same concept — in this case, the same user. The user who liked that post in your activity stream may also be the user who commented on a different post.
Duplicate data Duplicate data
We can represent this in MongoDB in a couple of different ways. Duplication is any easy option. All the information for that friend is copied and saved to the like on the first post, and then a separate copy is saved to the comment on the second post. The advantage is that all the data is present everywhere you need it, and you can still pull the whole activity stream back as a single document.
Here’s what this kind of fully denormalized stream document looks like.
Here we have copies of user data inlined. This is Joe’s stream, and it has a copy of his user data, including his name and URL, at the top level. His stream, just underneath, contains Jane’s post. Joe has liked Jane’s post, so under likes for Jane’s post, we have a separate copy of Joe’s data.
You can see why this is attractive: all the data you need is already located where you need it.
You can also see why this is dangerous. Updating a user’s data means walking through all the activity streams that they appear in to change the data in all those different places. This is very error-prone, and often leads to inconsistent data and mysterious errors, particularly when dealing with deletions.
Is there no hope?
There is another approach you can take to this problem in MongoDB, which will more familiar if you have a relational background. Instead of duplicating user data, you can store references to users in the activity stream documents.
With this approach, instead of inlining this user data wherever you need it, you give each user an ID. Once users have IDs, we store the user’s ID every place that we were previously inlining data. New IDs are in green below.
This eliminates our duplication problem. When user data changes, there’s only one document that gets rewritten. However, we’ve created a new problem for ourselves. Because we’ve moved some data out of the activity streams, we can no longer construct an activity stream from a single document. This is less efficient and more complex. Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars.
What’s missing from MongoDB is a SQL-style join operation, which is the ability to write one query that mashes together the activity stream and all the users that the stream references. Because MongoDB doesn’t have this ability, you end up manually doing that mashup in your application code, instead.
Simple Denormalized Data
Let’s return to TV shows for a second. The set of relationships for a TV show don’t have a lot of complexity. Because all the boxes in the relationship diagram are different entities, the entire query can be denormalized into one document with no duplication and no references. In this document database, there are no links between documents. It requires no joins.
On a social network, however, nothing is that self-contained. Any time you see something that looks like a name or a picture, you expect to be able to click on it and go see that user, their profile, and their posts. A TV show application doesn’t work that way. If you’re on season 1 episode 1 of Babylon 5, you don’t expect to be able to click through to season 1 episode 1 of General Hospital.
Don’t. Link. The. Documents.
Once we started doing ugly MongoDB joins manually in the Diaspora code, we knew it was the first sign of trouble. It was a sign that our data was actually relational, that there was value to that structure, and that we were going against the basic concept of a document data store.
Whether you’re duplicating critical data (ugh), or using references and doing joins in your application code (double ugh), when you have links between documents, you’ve outgrown MongoDB. When the MongoDB folks say “documents,” in many ways, they mean things you can print out on a piece of paper and hold. A document may have internal structure — headings and subheadings and paragraphs and footers — but it doesn’t link to other documents. It’s a self-contained piece of semi-structured data.
If your data looks like that, you’ve got documents. Congratulations! It’s a good use case for Mongo. But if there’s value in the links between documents, then you don’t actually have documents. MongoDB is not the right solution for you. It’s certainly not the right solution for social data, where links between documents are actually the most critical data in the system.
So social data isn’t document-oriented. Does that mean it’s actually…relational?
That Word Again
When people say “social data isn’t relational,” that’s not actually what they mean. They mean one of these two things:
1. “Conceptually, social data is more of a graph than a set of tables.”
This is absolutely true. But there are actually very few concepts in the world that are naturally modeled as normalized tables. We use that structure because it’s efficient, because it avoids duplication, and because when it does get slow, we know how to fix it.
2. “It’s faster to get all the data from a social query when it’s denormalized into a single document.”
This is also absolutely true. When your social data is in a relational store, you need a many-table join to extract the activity stream for a particular user, and that gets slow as your tables get bigger. However, we have a well-understood solution to this problem. It’s called caching.
At the All Your Base Conf in Oxford earlier this year, where I gave the talk version of this post, Neha Narula had a great talk about caching that I recommend you watch once it’s posted. In any case, caching in front of a normalized data store is a complex but well-understood problem. I’ve seen projects cache denormalized activity stream data into a document database like MongoDB, which makes retrieving that data much faster. The only problem they have then is cache invalidation.
“There are only two hard problems in computer science: cache invalidation and naming things.”
Phil Karlton
It turns out cache invalidation is actually pretty hard. Phil Karlton wrote most of SSL version 3, X11, and OpenGL, so he knows a thing or two about computer science.
Cache Invalidation As A Service
But what is cache invalidation, and why is it so hard?
Cache invalidation is just knowing when a piece of your cached data is out of date, and needs to be updated or replaced. Here’s a typical example that I see every day in web applications. We have a backing store, typically PostgreSQL or MySQL, and then in front of that we have a caching layer, typically Memcached or Redis. Requests to read a user’s activity stream go to the cache rather than the database directly, which makes them very fast.
Application writes are more complicated. Let’s say a user with two followers writes a new post. The first thing that happens (part 1) is that the post data is copied into the backing store. Once that completes, a background job (part 2) appends that post to the cached activity stream of both of the users who follow the author.
This pattern is quite common. Twitter holds recently-active users’ activity streams in an in-memory cache, which they append to when someone they follow posts something. Even smaller applications that employ some kind of activity stream will typically end up here (see: seven-table join).
Back to our example. When the author changes an existing post, the update process is essentially the same as for a create, except instead of appending to the cache, it updates an item that’s already there.
What happens if that step 2 background job fails partway through? Machines get rebooted, network cables get unplugged, applications restart. Instability is the only constant in our line of work. When that happens, you’ll end up with invalid data in your cache. Some copies of the post will have the old title, and some copies will have the new title. That’s a hard problem, but with a cache, there’s always the nuclear option.
You can always delete the entire activity stream record out of your cache and regenerate it from your consistent backing store. It may be slow, but at least it’s possible.
What if there is no backing store? What if you skip step 1? What if the cache is all you have?
When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time. At that point, you have no options. Not even a nuclear one. You have no way to regenerate the data in a consistent state.
When Diaspora decided to store social data in MongoDB, we were conflating a database with a cache. Databases and caches are very different things. They have very different ideas about permanence, transience, duplication, references, data integrity, and speed.
The Conversion
Once we figured out that we had accidentally chosen a cache for our database, what did we do about it?
Well, that’s the million dollar question. But I’ve already answered the billion-dollar question. In this post I’ve talked about how we used MongoDB vs. how it was designed to be used. I’ve talked about it as though all that information were obvious, and the Diaspora team just failed to research adequately before choosing.
But this stuff wasn’t obvious at all. The MongoDB docs tell you what it’s good at, without emphasizing what it’s not good at. That’s natural. All projects do that. But as a result, it took us about six months, a lot of user complaints, and a lot of investigation to figure out that we were using MongoDB the wrong way.
There was nothing to do but take the data out of MongoDB and move it to a relational store, dealing as best we could with the inconsistent data we uncovered along the way. The data conversion itself — export from MongoDB, import to MySQL — was straightforward. For the mechanical details, you can see my slides from All Your Base Conf 2013.
The Damage
We had eight months of production data, which turned into about 1.2 million rows in MySQL. We spent four pair-weeks developing the code for the conversion, and when we pulled the trigger, the main site had about two hours of downtime. That was more than acceptable for a project that was in pre-alpha. We could have reduced that downtime more, but we had budgeted for eight hours of downtime, so two actually seemed fantastic.
Epilogue
Remember that TV show application? It was the perfect use case for MongoDB. Each show was one document, perfectly self-contained. No references to anything, no duplication, and no way for the data to become inconsistent.
About three months into development, it was still humming along nicely on MongoDB. One Monday, at the weekly planning meeting, the client told us about a new feature that one of their investors wanted: when they were looking at the actors in an episode of a show, they wanted to be able to click on an actor’s name and see that person’s entire television career. They wanted a chronological listing of all of the episodes of all the different shows that actor had ever been in.
We stored each show as a document in MongoDB containing all of its nested information, including cast members. If the same actor appeared in two different episodes, even of the same show, their information was stored in both places. We had no way to tell, aside from comparing the names, whether they were the same person. So to implement this feature, we needed to search through every document to find and de-duplicate instances of the actor that the user clicked on. Ugh. At a minimum, we needed to de-dup them once, and then maintain an external index of actor information, which would have the same invalidation issues as any other cache.
You See Where This Is Going
The client expected this feature to be trivial. If the data had been in a relational store, it would have been. As it was, we first tried to convince the PM they didn’t need it. After that failed, we offered some cheaper alternatives, such as linking to an IMDB search for the actor’s name. The company made money from advertising, though, so they wanted users to stay on their site rather than going off to IMDB.
This feature request eventually prompted the project’s conversion to PostgreSQL. After a lot more conversation with the client, we realized that the business saw lots of value in linking TV shows together. They envisioned seeing other shows a particular director had been involved with, and episodes of other shows that were released the same week this one was, among other things.
This was ultimately a communication problem rather than a technical problem. If these conversations had happened sooner, if we had taken the time to really understand how the client saw the data and what they wanted to do with it, we probably would have done the conversion earlier, when there was less data, and it was easier.
Always Be Learning
I learned something from that experience: MongoDB’s ideal use case is even narrower than our television data. The only thing it’s good at is storing arbitrary pieces of JSON. “Arbitrary,” in this context, means that you don’t care at all what’s inside that JSON. You don’t even look. There is no schema, not even an implicit schema, as there was in our TV show data. Each document is just a blob whose interior you make absolutely no assumptions about.
At RubyConf this weekend, I ran into Conrad Irwin, who suggested this use case. He’s used MongoDB to store arbitrary bits of JSON that come from customers through an API. That’s reasonable. The CAP theorem doesn’t matter when your data is meaningless. But in interesting applications, your data isn’t meaningless.
I’ve heard many people talk about dropping MongoDB in to their web application as a replacement for MySQL or PostgreSQL. There are no circumstances under which that is a good idea. Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value. If you have an implicit schema — meaning, if there are things you are expecting in that JSON — then MongoDB is the wrong choice. I suggest taking a look at PostgreSQL’s hstore (now apparently faster than MongoDB anyway), and learning how to make schema changes. They really aren’t that hard, even in large tables.
Find The Value
When you’re picking a data store, the most important thing to understand is where in your data — and where in its connections — the business value lies. If you don’t know yet, which is perfectly reasonable, then choose something that won’t paint you into a corner. Pushing arbitrary JSON into your database sounds flexible, but true flexibility is easily adding the features your business needs.
Make the valuable things easy.
The End.
Thanks for reading! Let me sum up how I feel about comments on this post:
Disclaimer: I don’t have mongo in production, so I’m not aware of its prod runtime characteristics of certain features.
Reading the TV search example, I’m curious if you could expand there. It doesn’t mention the aggregation framework or map reduce, which are two features I understand mongodb provides for searching such data. Were they tried but slow, or just too hard to wrap heads around, or…?
Thank you for sharing this bit, it’s thought provoking.
The title here is pretty dishonest. It should be: “Why you should never use a document store for relational data” or: “Why social network data is relational data”.
The amount of detail you go into is good, however, when you step back, everything you say is patently obvious to anyone in the data field. If anything, this article does do a service in that it shows how being a developer has no immediate correlation to understanding the domain of data.
When you choose the wrong tool for the job, you should not be surprised that it doesn’t work out. I disagree that Mongo’s downsides are not readily apparent. Lack of relations has always been known about. It seems that the developers were given a chunk of money far beyond their dreams and fell prey to the “lets do what’s trendy” attitude figuring they could bankroll their way out of any technical obstacles. Choosing a document store for data that is, even under the most cursory examination, heavily relational is poor a design decision. While you did mention that it was a bad design decision, this is minimized by your twisting the argument around in to tool bashing.
Sure it’s an ego check to realize that wrong choices were made, and that those wrong choices lead to many hours of redesign and lost data. This is not Mongo’s fault. I’m not sure if it’s just lazy lack of research, poor testing or if it’s just easier to blame someone/something else but it’s unprofessional to shift blame in anyway from the people who made the choice.
To say that there’s no circumstances where using MongoDB is a good idea seems to be more of the same where ego is causing a lashing out. Do people promote the use MongoDB (or other document stores) for workloads that they shouldn’t? Absolutely, but people also promote relational databases for workloads they are ill suited for and don’t work upon scale. You’ll find people who will promote all manner of things that are poorly suited to solve the problem at hand. This is not new, and does not absolve one of the responsibility of due diligence.
Lastly, I would submit that MySQL is very likely also going to become a poor choice as the data scales. A laying of multiple technologies including a graph database would probably more suited to the type of data in a social site and tracking all the various interconnected bits of data.
Thanks for sharing such a detailed write-up with examples and correct use cases, Sarah.
I have already used MongoDB in two of my recent web applications.
After reading through this article, I realize that Mongo totally doesn’t fit the use-case in one of my apps.
I have faced similar issues with GROUPS and JOINS but tried to apply some workarounds.
I will be making a switch to PostgreSQL.
> I’ve heard many people talk about dropping MongoDB in to their web application as a replacement for MySQL or PostgreSQL.
Sometimes people do and say wrong things: you only replace MySQL with MongoDB when this makes sense to your project.
> There are no circumstances under which that is a good idea.
This statement is false
> Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value.
I wouldn’t say that: the structure of your data is relevant even in NoSQL db’s
Looks like you wanted to use MongoDB as a common relational database. The error was yours…
Please check out this presentation done by Craigslist:
http://www.percona.com/live/mysql-conference-2012/sessions/living-sql-and-nosql-craigslist-pragmatic-approach
Very interesting, thanks a lot for the post. I have thought a lot about using MongoDB (or equivalent) for my web apps and always found it weird to duplicate data. I thought: why? And why can’t I easily reference other documents? Every book can do that? Finally this always led me back to use MySQL (or equivalent). I always thought I didn’t understand NoSQL and maybe I still don’t. But this post proved me my thinking was not completely wrong.
The right tool for the right job.
I’ve been using mongodb for years in a HUGE db what mysql couldn’t stand, why critizing mongodb is so popular?
This is a great war story about choosing between relational and document-based DBs. I’ve run into micro-versions of this on side projects. Fortunately, I’ve never had to migrate gigs of “real” data from one type of DB to another. Thanks for sharing!
Brendon, the takeaway from what I saw/heard at Mongo Days Seattle last week is that the aggregation framework is a little too slow to use for user-queries, and that the map reduce implementation is far, far too slow for it.
Really interesting article as I have a similar social network based project in early stages that I’m looking for the best suitable database to build it on.
I was about to go with MongoDB that I now have some doubts in it. I also took a look at but yet not sure.
I’m not a fan of MongoDB for other reasons, but you seem to be saying that it’s impossible to relate or de-duplicate items among servers which are generating document-oriented records. Surely you can do this with UUIDs? https://en.wikipedia.org/wiki/Universally_unique_identifier
Possibly, if users have “home” servers, you could simply use the URL? http://myserver.diaspora.org/user/12345, for example.
Dealing with updates is nearly impossible in any stream oriented service, but you could simply have a different object type which represented an “update” to an original item, and viewers would fix them up accordingly. It would be wise to include the original (and history) with every update, though, since some viewers won’t get the original.
Agreed, also love the JSON support in PostgreSQL + less learning involved for people already used to MySQL or PostgreSQL
Really interesting read. Don’t know if I agree with your conclusion on MongoDB – there seem to be a few steps missing – but I found the background of Diaspora very very interesting. Thanks
You can definitely maintain consistent and durable data on mongodb. V1.8 (spring 2011) and beyond have journaling you should be able to pull plug from wall and recover automatically and quickly.
Also I think some of the graph products are pretty robust these days, although personally I would only gravitate there if my real world data was innately a graph.
A couple points…
* For the Java developers out there, Spring Data’s MongoTemplate has an @DBRef annotation which abstracts all the relational references away from your code, and it just…works. You don’t need to roll your own code for this.
* Mongo read performance is good enough from what I’ve seen that not having JOIN support doesn’t seem to be a dealbreaker to me. You can normalize your data and still get it out in times close enough to PostgreSQL. And if even PostgreSQL requires an in-memory cache for scaling your app then what’s it matter?
* You can always choose the best of both worlds if your priority is read performance and you model your data intelligently up front. As a rule, don’t embed objects which can stand alone inside other objects. I don’t see this attitude as counter to Mongo’s principles. If read performance degrades, you can later de-normalize, write stand-alone object to their own collections and also embed them in appropriate parent objects in other collections. It’s much harder to refactor embedded objects out than it is to refactor new ones in.
NEVER is a word that should be used sparingly. Ok, you learned the hard way that relational databases still have relevant features. Does that mean JSON data stores are never useful? We use both – and hopefully now you’ve learned when to use the best combination of tools for the job.
I’m in agreement that MongoDB is a poor fit for large, highly-relational Rails apps. Full disclosure, I’ve written and presented about the topic of MongoDB and Developer Happiness:
http://luigimontanez.com/2012/developer-happiness-mongodb/
To counterbalance the positive view in the blog post, I gave some “When Not To Use” tips the last time I gave the talk, which was back in the summer:
https://www.youtube.com/watch?v=ch07bP0WOOc&t=27m12s
I think I echoed lots of the points you made here.
But I do think that MongoDB is valid for more use cases beyond storing arbitrary JSON. If the app is kept small and focused, possibly acting as a service in a larger SOA architecture, then the flexibility of MongoDB is a net win since the problems with JOINs don’t really come up. I know all too well that Rails apps have the tendency to grow into monoliths, so a bit of discipline is required there.
Awesome article. Great simplistic explanation of different data storage techniques, as well as the inside scoop on MongoDB. Very much enjoyed this!
There is something I don’t get, it seems that you started doing the joins INSIDE the Rails App, why not have used MongoDB’s internal map-reduce system (OK, in that case that would have been more of map-expand thing, but still), you would have used MongoDB’s power to do that, it’s really fast and efficiant. I’m actually doing that to avoid duplicates in my documents, it works like a charm.
Or I maybe missed something.
“You Should Never Use MongoDB” it seems as a pretty harsh title. Thanks for sharing your experience and expensive mistakes. Though for me it seems the conclusion is really wrong.
It would be more appropriate to say that MongoDB should be (avoided/not used/carefully considered) in cases when you have a lot of many-to-many relations. In the example of the social network MongDB seems totally off, however:
1) With the TV show a more appropriate approach would have been to take it a bit easy with the duplications and instead of a full duplication of the actor data, you could have used a list/array with actor ids. So taking the information about a tv show would need 2 DB queries: 1. “Get me the TV show data” 2. Get me all the actors with ads in [1,2,…n].
2) Or… if you wish duplicate the data, but still keep an actor identifier and a centralized place where you have a 100% reliable data for that actor (after all actors, don’t change there names or birth date, that often …) a document with actors and all kind of info about them, and in every TV show a list like [{id: 1, name: Some Name”, {id:2, name: Another name},…]
Of course it’s always easy to point someone’s mistake from the distance of time and not that easy to admit your own, thanks for sharing!
For everyone reading this article and saying/thinking “MongoDB is total sh*t”, please consider this as a really good example what should not be done! Take a good care researching the tools you are about to use! Are you killing a fly with a rifle? Or going bear hunting with just a knife? Both can be dangerous.
What a great blog post! Too many critics/fanboys (fangirls?) love to rush to the defense of their beloved stack, but rarely have any real-world examples to back up their claims. This is really a great testimonial on making a concerted effort and being honest about the failures.
Thank you for posting this article, it surely was an eye-opener! I was actually surprised to see you convert to MySQL instead of PostgreSQL for Diaspora, would you mind to elaborate on the background of that choice?
Your criticisms in this article are of the DBMS. I’m not so sure how warranted they are. When I saw how you modeled data, I cringed. Throwing an entire denormalized feed inside one object isn’t a great idea. (What happens when the feed gets too big? Mongo doesn’t store objects > 16 MB) Mongo supports strong consistency, so updates to each object are going to lock the entire database! Yuck!
It doesn’t sound like you vetted out data models as much as you should have. You seemed very dead set on one way of doing things. Why not put every post in a collection called feed, and mess with the indexes so that each find query is blazing fast? Why not make a separate users table? Denormalize does *not* mean put all your data in one collection and duplicate it within on monolithic object.
I could see Cassandra solving your problem as well. This isn’t a Mongo issue.
“Why you should never use MongoDB” doesn’t seem like an apt title to your article. There are more advantages to having a semi-structured data store than you mentioned. Mongo is better than storing arbitrary blobs – you can query them! You can have interesting indexes to speed up queries!
What about a hybrid approach? In our app we have trips that are completely independent of each other (the only way to ‘link’ a trip with another is via a tag or if a user is common between them but you wouldn’t logically navigate trips like that) but there ARE relationships between our users. So we’d store trips in Mongo for easy/fast access but maintain the relationships in a conventional RDBMS?
Thanks for this article! I’m developing a web app with my favorite tools (python+postgresql+stored procedures+redis+…) and this gave me an idea for depends checking on cached items. I’ll be coding that up after I submit this.
Total aside, I went to college with Phil Karlton’s son, David, which lead eventually to my working at Netscape with both of them. I had been to their home several times, he was a great guy (his wife, Jan, was also great). Since this is the second article I’ve seen posting Phil’s quote in as many days, I thought I might mention this old memorial link… http://karlton.hamilton.com/ .
Graph databases are not dead. http://thinkaurelius.github.io/titan/ built with a KV backend
MongoDB serves one very useful purpose– it’s the idea litmus test to see if a project is engineered by competent people or by non-engineers who think they are “hackers” and who pick technologies out of laziness or because they are “cool”.
When I see a project, or a developer, who uses or advocates MongoDB, then I know they are incompetent.
The base requirement of a competent engineer is the ability to know how to pick the correct technology. MongoDB is never the correct technology (because it is not properly distributed, it is not properly engineered, it’s written by incompetents.) Pick Riak or couchbase or several others if you want distributed. Build a massive company on MySQL, and you won’t get condemnation from me. (even though I don’t like MySQL, it’s a valid choice.)
Choosing MongoDB is proof you’re incompetent. (also in this category is node.js, and for large apps only, rails, e.g.: twitter on rails was pure incompetence.) But those three are the only litmus tests I am aware of, because most open source stuff is very good (and rails when correctly used is fine.)
That these are so popular shows how predominant bozos are in engineering departments.
Can you explain why you think there are only two alternatives: fully normalized or fully de-normalized? Why did you reject storing the Name and the Id but not the whole record for some of these links? When the client wanted to be able to show the name what prompted you to try to stuff the whole user record in there?
And why didn’t you also store every relationship in a separate collection (as just a pair of Ids or a pair of Ids and a relationship type)? That would have allowed you to have a ground-truth and to regenerate your cache at any time.
When you’re picking a data store, the most important thing to understand is where in your data — and where in its connections — the business value lies. If you don’t know yet, which is perfectly reasonable, then choose something that won’t paint you into a corner.
This is a great “argument” for just using berkeleydb for everything.
> since graph databases are too niche to be put into production
Amazon, Google, LinkedIn, and Facebook would disagree. 😉
Take a look at Neo4j, or if your data set is too large for one machine, look at the Titan database which can sit on top of HBase or Cassandra (among others) to give real distributed performance.
“Why You Should Never Use MongoDB” is because ‘some arbitrary user(s)’ who spend little time exploring the database system they were going to build their application on, found out it could not be used they way they thought it could.
Bad title. Good points though on why you shouldn’t use MongoDB (optimized for quick writes) for a social network.
Thank you for the final paragraphs though
This is an excellent post, and a very common problem. I agree, every DBMS engine is really good at some things, and not at others. It takes a lot of modeling experience to know which one to use for a given application. Relational DBMS engines still have their place, as do Document or Object databases (like MongoDb and Redis). Either way, the data relationships are there, they exist — and it’s critical to understand how you want to access them before you make your engine selection.
I do MySQL training, and I tell my attendees about non-relational databases is that the process of data modeling is approximately the same as the process of denormalization in an RDBMS. You have the same tradeoffs when you optimize for one query at the expense of other queries, the same risks of inconsistency that come with storing data redundantly, and the same tendency to reinvent relational concepts like constraints and references using application code. That creates more work for your developers, and with more risk that they’ll implement something quickly that has bugs.
I wouldn’t say you should *never* use MongoDB, any more than I say you should never use a denormalized SQL schema. Both have their advantages, and can be the only way to serve a given query load with enough performance. But both non-relational solutions and denormalized SQL schema must be designed to serve specific queries. You can’t start with a data model as you can with relational schema design. You have to know in advance the types of queries you need to run. If you introduce a new query type (as in your example of searching by clicking on an actor), you have to rethink your schema to support the new query. In that way too, non-relational databases resemble denormalized SQL schema.
In my mind there will always be one database for connecting human data together: graph databases, while they are many and esoteric each they happen to best represent the crazy structure that connected human data represents. I’ve built systems that handled billions of rows of human data and as long as I didn’t have to connect each human and their interactions together I loved having it in a relational database. Graph databases take it to a whole new level though, giving you really fast response times for really fast interconnected questions.
That said, there’s no way diaspora could have both been built and used a graph database. I know of only one company that’s tackled graph databases and they’re probably broke now. Stuff is unwieldy and stuck in java land.
As I think you know, I was fishing around for MongoDB use cases on Twitter the other day. I ran across an interesting one, not on Twitter, but in the pre-show for a Wide Teams interview.
The use case is this: saving a snapshot of orders in an ecommerce site, for posterity. The snapshot includes denormalized information about the products in the order exactly as they appeared at the time of the order.
Thoughts on this usage?
Good writeup, Sarah.
Ours is a case where we can work with the documents wholesale. We don’t need multi-table joins or anything like that, so that’s not our problem. Our issue is that each of our documents effectively have hundreds of keys that need to be indexed because we can’t predict what query users are going to generate. Mongo has a limit of 64 indices per collection. Moreover, indexing a large key results in a huge index that can’t fit in a reasonable amount of RAM, even sharded across many hosts.
This slide deck was illuminating: http://blog.mongohq.com/mongodb-scaling-to-100gb-and-beyond/ If we had seen this at the beginning of the evaluation, we might not have considered Mongo in the first place. We’re most likely going to move the evaluation to another NoSQL store or back to a relation DB.
Brendon, Map/Reduce in Mongo is extremely slow (it isn’t made for realtime), and aggregation framework is just a weird way to write SQL. Most of the request could be done with simply queries, well, you will still have to do “join” yourself.
The Diaspora team were lucky enough to spot the problem before it got too big. I’ve seen a lot of relational database projects where either (a) almost no and (b) insane amounts of time were spent developing a ‘business case’ or ‘business rules’. No amount of user testing prior to a go-live date can guarantee a happy client. In virtually every case, once the users get their heads inside the application, they can see lots of ways it could be improved, whether for ease of use or usability [the Technology Adoption Model lives on!], and they feel empowered to make suggestions. As you state in your article, what seem like trivial improvements to the user are often incredibly difficult to implement because they run directly counter to the original data model adopted all those years ago, and the flood of little fixes soon stalls as the client realises that the problems aren’t going away any time soon.
Hi Brendon, my understanding is that the MongoDB aggregation framework out-performs the MapReduce framework in most use cases, but is limited to operations against a single Collection at a time; thus, you cannot perform Aggregation operations that contain references to other documents. Map/Reduce jobs are still mostly single-threaded, but final reduce steps can be run across shards in parallel.
In general, from what I can see, these features do not have a compelling impact on the conclusions Sarah has drawn in her case study.
To be fair, all of these changes made along the way would have made your queries unwieldy and your DB re-written multiple times. Mongo saved you much more time than you gave it credit
It seems to me like the case could be made which is people forget that you don’t have to religiously normalize data in relational databases (or graph databases, which aren’t actually that niche – the better ones tend to be expensive, though). One could call that “and that’s why you need a DBA” 😀
Other than that, spot on
I remember listening to a MongoDB presentation at MySql meetup, and walking away thinking – this is a great log parser, but that’s about it. Agree that random JSON documents is what it is best at. Ended up using Amazon’s document database- at the time it was named Simple DB, for a quick Facebook app, where we needed a list of Vets listed on a virtual monument, and no expectation of traversing or filtering the table. It worked like a charm. But while I keep it in mind, still haven’t found good applications for it.
I’ve run into many of the same problems you’ve outline above; you don’t even mention the performance problems with heavy write loads starving readers if the collections you’re writing to are in the same database. We were considering expanding our use of MongoDB in the early days but have relegated MongoDB to exactly what you’re friend Conrad has done, we store only temporal JSON blobs that have 0 dependencies — MongoDB is now just a JSON cache.
I’ve recently been looking at RethinkDB wish brings the goodness of schema-less design and adds built in JOIN support and is MVCC compliant.
You’re article title is spot on, no-one should be using MongoDB.
Hey Sarah, I’m hacking on diaspora* for ~6 months now and was not aware of this story. If you have some time to write interesting stories like that on the wiki, it would be awesome!
I hope everything is going well for you 😀
We’re working on a strategy whereas we use MongoDB in conjunction with an RDBMS. Django manages the RDBMS and serialises out to MongoDB upon every .save() of the data.
Its encouraging to have something like Django South for migrating our Source of Truth – and using the flexibility of MongoDB with Meteor for Realtime UI interactions.
When using a pure document database you need to query off a different projection of your data.
Awesome write up, I have been interested in testing out Mongo because it seems like an “in” thing to do and purports great speed; but looking back at all of the projects I have worked on, I cannot see a single use case where MongoDB would have been a wise choice. Again thank you for taking the time to write this up – you are saving countless other developers out there many hours of frustration.
Reading this atrial really gave a very detailed view on mongodb and the problem that associated with it. There’s just one thing not very sure about, why is “social data isn’t relational”? Wouldn’t we always want the latest of something. Also, since we are lazy and want to write the thing once and be done with it. It seems like social data is and should relational? Just wondering, if you can to talk a bit more about this?
Thanks for the in depth run down. I’ve never used MongoDB but upon reading this I found it hard to imagine such a blatant lacking.
A few minutes of research yielded this: http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
Is that not the answer to this challenge? Did I misunderstand?
I found the title interesting, thought that it would be bring insight into MongoDB problems that we all could learn from. Unfortunately that title just seems plain wrong and doesn’t describe a MongoDB problem but rather an approach and architecture problem. Correct me if I’m wrong.
From what I read it seems you chose the wrong Database to model a graph. You correctly describe the problems of modelling a graph using a relational model. You quickly comment that graph databases are immature. I find that odd since e.g. Neo4J has been in production for more years than MongoDB.
Also, many of the problems you describe seems to be rooted in the attempt to combine read and write concerns into one model (document). Given the problems you describe a (CQRS based) solution with clear separation of read and write could have helped you a lot. Your read models need to be denormalised to allow for speed and your write models need to guarantee some degree of cosistency (and normalisation). If you made the write model Event sourced then you would have a nice way of recreating read models (like your TV series example) fast and when ever the requirements change and it would also work to signal when you need to invalidate your cache(s). It could even be used as the synchronisation mechanism between pods.