Hacker News new | past | comments | ask | show | jobs | submit login
MongoDB 2.0 Released (mongodb.org)
215 points by meghan on Sept 12, 2011 | hide | past | favorite | 76 comments



I think I speak for everybody here when I say that 1.8 + 0.2 = 1.10.


Sorry, but no. You seem to still be trapped in the ancient mindset that an increment in the most significant version number must mean huge changes. Instead, more often than not, it's merely a milestone signifying that all of your intended features that you outlined in the previous major version's days have been completed. So when 1.0 hits (or close to it), you define a bunch of "must-haves" for 2.0, along with all the bugs and smaller improvements that come up along the way. Now you can wait 18 months, only issuing security patches for 1.0.x, and then deliver all of those features all at once with a 2.0 party. Or as each feature is complete, you release a new minor version (1.1 for foo, 1.2 for bar, etc). Once you're able to scratch off the last of your list for 2.0, you bump the version number up and call it a day, from wherever you are along the 1.x line (be it 1.1 or 1.42). With sufficient planning, you can manage to make it so that you never have to go into 1.xx territory (with double digit minor version numbers). I'm sure I'm not alone in finding 1.10 to be aesthetically ugly. If you have to, then by all means. Avoid it if you can, but at the end of the day milestones should always be your metric for release numbers.

Now in Mongo's case they might just be incrementing the major version number for the hell of it. But I'm assuming they actually planned this out.


You seem to still be trapped in the ancient mindset that an increment in the most significant version number must mean huge changes.

Isn't it more about personal preference rather than having an "ancient mindset"?


He's referring to this:

"Please note version 2.0 is a significant new release, but is 2.0 solely because 1.8 + 0.2 = 2.0; for example the upgrade from 1.6 to 1.8 was similar in scope."


You dont speak for everybody here, but it would seem to agree with Semver: http://semver.org/


3rded.

A major version bump should indicate major changes e.g. time to go through any code that talks to the server. If they just want to have a number that grows, they should do something like vRELEASE.MINOR.


Considering that some projects such as OpenBSD have a set release schedule that add a 0.1 to the version, I don't think you speak for everyone.

OpenBSD 5.0 is in its final stages before release and the 5.0 simply means it came after 4.9.


Yes indeedy. The "2.0" versioning is seriously annoying me.


I'm very confused about the math you did there.


The 0.2 implies two "point releases". Traditionally these increment the value after the decimal point in a version number.

so:

8 + 2 = 10, therefore 1.10.

It's best to view the version number as a string rather than a decimal number.

This is a somewhat old-school way of versioning and common in open source software. Personally, I prefer something more intuitive and don't see anything wrong with MongoDB being at version 2.0!


The point of a major bump is to say to the ops team/sysadmin "HEY! We actually changed some stuff. You probably want to check with your programmers to see if any of these changes will effect our client code".


I think if they want to keep it purely arithmetic they should drop the period and just use integers.


I think if they want to keep it purely arithmetic they should drop the period. It's MongoDB 20.


I'm interested in learning more about MongoDB so that I can use it for web apps. But some comments I've heard about it(especially here https://plus.google.com/111091089527727420853/posts/WQYLkopE...) seemed to discourage its use. Can someone explain to me why it is "criticized by academics"? And what its pros/cons are?


The biggest criticism, for a long time, was that it wasn't guaranteed consistent; a crash could leave your DB in an inconsistent state, and you were supposed to address this by writing data to multiple nodes before deciding it was okay. This is rather mitigated now with the journaling options (which are on by default).

There are lots of other criticisms, but it's a fine data store, if you're dealing with data that fits it well. The best way to think of it is as a hash store - you can store N-deep hashes in it, index pieces of those hashes so you can find whole records ("documents") quickly, and that sort of thing. You can't perform table joins (or don't have to, depending on who you're talking to), but you mitigate that by designing your data such that you get what you need for a given resource in a single query.

Personally, I'm all on board with it for web apps. I think it fits your typical web app's data requirements far more closely than a traditional RDBMS does, and I'm using it very successfully in multiple production systems.

The biggest remaining fault, in the context of web apps (IMO) is the lack of transactions - if you have an application that requires transactions to ensure proper operation, don't use Mongo. Do recognize, though, that many transactional use cases are replaced with Mongo's atomic operation set.


> The biggest remaining fault, in the context of web apps (IMO) is the lack of transactions

While not ACID, "tension" documents are a good way to add a level of transactional support.

For example, if you want to update multiple documents, instead of modifying the original documents directly, you create a new document with instructions on how to update the original documents. Then, you roll through the tension documents and apply their changes. Any failures can be dealt with on subsequent passes, only removing the document from the scan once it has been successfully applied.

It is an eventually consistent pattern, but that is a tradeoff you have already accepted by choosing MongoDB in the first place, so it ends up working well for a lot of transaction use cases.


Its actually possible to be fully consistent if you include the "tension" documents when fetching the main docs.

For example, in the canonical transaction example of Alice giving Bob $5 you could insert a document in the transactions collection that says Alice gave Bob $5. When you go to fetch Alice's pending balance you fetch her current document, then you fetch all pending transactions than she is involved in. Same for Bob and all other users. Then each night, you pull that days transactions and apply them to each user and update an "as of" field to make sure no transaction can ever be applied twice.

Now you could argue that this isn't fully consistent because it is possible for Alice to simultaneously give $5 to both Bob and Charlie even if she only has $7 in her account. However she will then have a pending balance of $-3 so it immediately reflects her current situation. This is similar to the real world example of overdrawing you checking account. If you wanted to, you could void one or both of those transactions if you never want to allow a user to go negative.


The system you describe, recording events and replaying events to get the actual state of the system, usually assumes that you can only write one event at a time (writes need to be super cheap). If you can write multiple events in parallel, you won't have isolation and you may read tension documents that will be voided (in your example).

You could use optimistic concurrency control to ensure that only one tension document is written at a time while ensuring that your constraints are respected:

  current_id = atomic increment global state_id
  create tension document (but with active bit set to false)
  check constraints
  atomic:
    if global state_id = current_id
    set active bit to true
  if active bit is false:
    delete tension document and report error
This would provide isolation (you don't see tension documents that are not yet validated) and consistency (your balance is respected).

The atomic operation in RDBMS is usually implemented in one very simple and fast SQL UPDATE query. I believe mongo must provide something similar too.

Although my bank would allow me to overdraw from my checking account, I have a higher margin/tolerance than my brother, so they must perform some constraints check ;-)


This is roughly what MVCC does under the covers. Of course, you don't have to implement it, the database just does it for you, and it's not eventually consistent.


That's a really clever idea. I'll have to give that a shot - I wonder how hard it'd be to wrap up that pattern for use in the popular ODMs?



Anecdotally: Mongo is great for developing small projects. It's fun to work with since it's got no defined structure, and will just save whatever you give it wherever you say to.

That flexibility comes back to haunt it in larger projects. A typo in a query can trigger a reindex of a huge collection, make moot GBs of index data, or invalidate other queries that work on the same collection. It also demands that the devs do lots of documentation on the data structures as you can't ask mongo to tell you it's schema, it's just a collection of documents.

And if you work with a loosely typed language, be sure to type your inserts as mongo is not a loosely typed datastore (coughPHPcough).

But for internal tools, and smaller projects, the speed of development is great.


Some ODMs like Mongoid have a strict mode that requires a field to be mapped in order to save it to the db. This saves you from typos wrecking havoc. You should probably unit test your queries anyway.


Yeah. Mongoose on Node requires you to code and instantiate a document schema before you can do anything with the database.

Gotta sacrifice some flexibility for a bit of safety.


yep, either use an ODM (or some system that guarantees correctness) or make sure your raw queries are well tested.


... and if that wasn't bad enough, then the global locks and mmap'd I/O haunts you later when you have any sort of concurrency or for data>RAM, or the first time you have to take the site down to perform a compaction. Can you tell I've been burned?


If you search for a second you'll find bad things about every DB out there, my idea is that it is better to pick a product that seems designed for your use case and try it first hand. Stress it, socialize with it, and in a few days it will be clear if it is worth investing more time or not. After all for every bad blog post about Mongo, Redis, Cassandra, and so forth, there are many good as well, so you can collectively consider this blog posts as personal experiences worth ten min of your time but possibly not so important to change your mind about trying it first hand.


I have no idea how anyone who's ever read the mongodb source code can entrust it with their data.


Examples?


A few months ago, I saw an online interactive 'trainer' of sorts for MongoDB, much like Codecademy's JS tutorials. Does anyone know what I'm talking about?

EDIT: Damn Google, you good: http://www.mongly.com/


It was probably either http://try.mongodb.org or http://mongly.com/


A bump like that should be for backward incompatible changes but it actually looks like a makeover to appeal more to serious biz customers that would have trouble getting on to a 1.0 technology.


I'm curious as to why MongoDB is so popular. I chose Riak, and my purpose here isn't to bash Mongo, but to understand what was the key feature that made you choose it?

Is it pure speed on a single machine? Is it the query interface? Something else? Datacenter awareness? Geographic support?

The key features that made me choose riak: - built in distribution/clustering with homogenous nodes - bitcask had the level of reliability/design that I was looking for. - better impedance match for what I was doing than CouchDB (which was what I looked at before choosing riak, but couch does view generation when data is added and I need to be able to do it more dynamically.)

What sold you on Mongo? What would you most like to improve?

(Please don't let this be a debate, I'm more interested in understanding the NoSQL market, what other developers priorities are, etc.)


I struggle choosing. I started both at the same time and keep going back and forth.

Riak

  -> I love the fault tolerance
  -> easier to scale with a dreamworld of hash-rings I wish I could have.
  -> no indexes
  -> no Geo-spatial indexes
  -> I don't like link walking... it's not intuitive.
  -> I want to use Riak (Lucerne/Solr like) full text search, but the docs are poor and have to read dozens of useless documents to get nowhere very quickly. It's a separate install that requires pre-commits?(which are never explained well enough to start using it).
  -> installing on Mac OSX is painful (homebrew works/doesn't work depending on version, 32 bit or 64? how come I don't have include 32bit flags with older repo's, yet newer ones are only 64bit yet don't compile.... errors errors errors...ERLANG which? such a pain in the ass installing Riak unless you do it using specific build tools.
  -> Only Joyent provides a hosted solution which is too expensive relative to its offering.
  -> accepts any docs just set the content type.. handles original documents without conversions
  -> easy to load balance behind NGINX
  -> can choose between conflicting writes 
MongoDB

  -> straight forward install.. up and running in no time
  -> SQL like queries
  -> intuitive indexes
  -> Geo-spatial indexes
  -> Affordable hosted services with more than one provider.
  -> no full text searching over documents
  -> have to fiddle with doc handling GRIDFS etc..
  -> can't choose between conflicting writes  (last one wins)
I wish I could have "RiaMong"

  -> straight forward install.. up and running in no time
  -> fault tolerance
  -> easier to scale with a dreamworld of hash-rings.
  -> indexes
  -> Geo-spatial indexes
  -> Affordable hosted services at with more than one provider.
  -> full text searching for documents
  -> handle original documents without conversions
  -> can choose between conflicting writes 
Oh well maybe someday :)


I work at Basho. Specifically I've been doing a lot of work on Riak Search lately. I absolutely agree that the Search documentation is underwhelming and could use some major love. I take it as a personal challenge to improve the situation before the 1.0 release comes out.

That aside, you might like to know that for the 1.0 release Search is bundled along with the standard Riak package (you just need to set a flag to enable it). You can try using the latest pre-release or even build from source on master or the 1.0 branch. If you don't actually need full-text search then you could also checkout Riak's new support for secondary indexes. Please drop a line on the mailing list or IRC if you have questions.


Great list. I think the reason Mongo has become so popular is because the line it straddles between SQL and NoSQL, specifically the query language. You feel right at home.

CiuchDB is another DB I like quite a bit, but wrapping your had around writing map reduce views in order to index and execute queries is A LOT different than a select.


Thank you very much for your answer, especially since you compared the two. That helps me understand even better.

Just wanted to let you know Basho is addressing some of your wishes. Riak 1.0, which is coming out at the end of this month, has integrated riak search (so it is no longer a separate install) and also has integrated secondary indexes (on numerical values, so not full text search but integrated nicely.) I believe, it may be possible, to do geo-spatial searches using the new index system, but I've only thought about it, haven't tried it yet. (It's not built in, but I think its something one could build, and I'd like to build myself eventually.) They now have binary builds for installing on Mac OS X (IIRC) and I've been able to install via homebrew and source lately, so they might have fixed that. They've also got a new thing called riak_pipe which is really useful for certain classes of problems (its new so not well documented yet.)

Anyway, thanks again for your answer, and just wanted to let you know Basho seems to be addressing the issues you ran into.


Things we're using it for here at http://mailgun.net :

* Fast asynchronous writes for non-critical data (logs). * Replica sets allow for zero-downtime hot failover to a backup server when the primary is taken down or dies. * Simple queries are crazy fast, assuming your index fits in RAM. * find_and_modify() feature is awesome for some tasks.

But we're supplementing Mongo with PostgreSQL because of the following Mongo weaknesses:

* Concurrency is so-so: the entire database is locked during writes. * Weak aggregation capabilities. * Greatly reduced performance when the index doesn't fit in RAM. * Wasteful when it comes to allocating space (on disk and in RAM)


With one hand you giveth; "Fast asynchronous writes" and the other taketh away; "Concurrency is so-so: the entire database is locked during writes.".


It's like unsuspecting developers are walking down Developer Lane and see a candy bar sitting there that says "FAST, ASYNCHRONOUS WRITES." So they pick it up, bite into it, and then the candy bar grows a giant mutant global lock head that eats their face.


Actually I don't believe the two are related. And they're working on the concurrency improvements all the time: https://jira.mongodb.org/browse/SERVER-1240


Can you explain how they're not related? I'm not poking you to start a fight, I'm genuinely curious.


My current wait % on the global lock is 0.02%. Deletes and updates will yield. Inserts won't, but the insert happen to memory, which is fast. I think disk saturation on long sync delays is really the only major pain point now (and even that seems better than it used to be)


The problem is: "a pending write lock acquisition will prevent further read lock acquisitions until fulfilled." Inserts aren't actually guaranteed to be in memory, as they're done in a file-backed memory region. A write and page fault can and frequently does cause a flush and hang while that lock is held, particularly as the hot dataset size increases relative to physical RAM.


Agreed, I don't understand how writes can both be fast and asynchronous, and also lock the whole database. Does it depend on what kind of write you're doing?


The notion of memory-mapped files (mmap on POSIX) is central to how MongoDB works. It doesn't really manage its own writes or buffers. They just update data structures in memory and let the kernel figure out how to map that to disk. However, IIRC they're forcing it to write dirty pages every 10 seconds by default.

Thus I don't see how's that related to locking. My understanding is that the reason they still use a global lock are:

a) They're young and haven't gotten around implementing this yet.

b) Since writes are fast, this hasn't been as big of an issue as people might expect (it was for us, though)

And everything works as expected as long as you're fine with your writes being asynchronous. But once you call getLastError() requesting your writes to go through after every insert, then you start seeing the global lock bite.


The global lock is something that will always exist with their architecture. It isn't a feature to implement, it's a total redesign and rethinking of the fundamentals. Their much-vaunted in-place, memory-mapped I/O architecture becomes useless because concurrency means using lock-free data structures. MyISAM in MySQL locks the entire table because it does in-place updates. Row locks can't just be added into MyISAM. It'd be such a redesign to add it, it'd end up being something much more like InnoDB than MyISAM. 10gen would have to eat an awful lot of their own words if they decided to go that route.


Is the replica set failover really zero-downtime for you? In our usage, it seems like it takes a few seconds to elect a new primary during which time no writes can occur.


Replication is trivial when you're locking the entire dataset on writes.


MongoDB is a surprisingly short hop from SQL databases in a lot of ways. It makes retraining your already-SQL-competent developers much easier than with many other datastores.


This is my main reason for using it as well. For me, MongoDB is less about any of the NoSQL features, but more about the fact that I can be more productive with it. Pair it with a good library like MongoMapper, and persisting your data is trivial.


I was up and running with MongoDB in minutes. I tried Riak a while ago and the "getting started" experience was very poor IMO.

Also, I really like the MongoDB docs.

Can you point me to a great getting started page for Riak (setting it up, loading in data, doing advanced queries, Python/Ruby libraries etc)?


http://wiki.basho.com/The-Riak-Fast-Track.html

It is being updated now for Riak 1.0.


I haven't tried Riak but I will confirm that MongoDB was really easy to get started with. It's performant enough that I don't really need to look at anything else right now.


I chose Mongo because I wanted to use mapreduce and I had never heard of any alternatives. I'm not particularly proud of that reasoning but it has worked out well for me (log analysis: I'm uploading files to GridFS, processing them into line documents, and mapreducing them). I am willing to bet the majority of user fall in the same boat.


For me:

- Well supported and very active community. It's a project that is clearly going to be moving forward for a long time.

- For my needs (structured data) it's super fast.

- The auto-sharding support is nifty.

- Geo-spatial queries out of the box


One of the draws to Mongo for me is that there are already existing cloud-managed service providers (MongoLab, MongoHQ) that hook in to Heroku, which is my choice for deployment. I haven't found any real cloud-managed services for Riak except Joyent's smart machines (which are expensive for my small project budget). The other reason was Mongoose, a great ODM for Node.JS that makes my life a lot easier.

Of course if anyone has managed to hook into a Riak provider on Heroku I'd love to be proved wrong.

And yes, I could always spin up an EC2 instance and manage a Riak cluster myself, but my time is constrained and I'd rather focus on getting my app out the door than sysadmin tasks.


I actually started in Riak before Mongo, and what made the decision for me was...not having to install Erlang. I realize that's not really a compelling argument, but when I was evaluating alternate DBs, the one that gets up and running the fastest is likely to get a little more attention from me. Once I got spun up on Mongo, I didn't find a compelling reason to switch away from it.

"Download and unpack the package, run the binary, and you can instantly start playing with the examples from the console" is a hell of a good way to get people interested in your software.


Ease of first time developer installation should never be the deciding factor in these decisions. I realize it is pretty rampant, but it's not something we should be proud of.


Yeah but how long should I spend on software that I know nothing about the quality of if it's not easy to get up and running? Do you spend a day? two days?

While it's mostly just from experience I find that virtually no one makes a decision based on how easy it is to get up and running with a piece of software but it is an impression that stick's in people's minds and gets repeated when they talk about it later.


I completely agree, and I'm not saying that MongoDB is better because it has an easier install. The question was just "Why did you go to MongoDB over Riak?", and I did because I was able to start evaluating it faster, and discovered that it met my needs handily, deprecating the need to wrestle with my Erlang install any further.


    wget http://downloads.basho.com/riak/riak-0.14/riak_0.14.2-1_amd64.deb
    dpkg -i riak_0.14.2-1_amd64.deb
Our production configuration and setup/deploy recipes for riak are 241 lines in total, ~150 of which are the stock config files.


This would have been something like 18 months ago, and was on an older Fedora box. It wasn't quite so trivial as that two-liner.


We are currently building some core infrastructure with MongoDB. I looked at Riak, but it was capped collections and ad-hoc querying that really nailed it for us. Riak is a great product, I loved it's replication mechanism where any node was writable, but we didn't require that, so we went with mongo.

We had a 3 node system across a wan, where local clients would read from a local node, and if they needed to write, would write to the master across the wan. The issue we had was there was a significant amount of data being written to the master that one or two of the nodes weren't interested in 100% of the time, but it was still being replicated across the wan (and at cost)

so I investigated how the replication mechanism worked - to see if I could gain greater control over what data was replicated. As it turns out, there isn't, but you can emulate the replication yourself if you're interested. Well this is how we did it:

- master node has a capped collection (we call db.messages) - slave nodes have a bit of mongo console javascript that execute tailed cursor querying for messages being inserted - when a message that matches the query inputs is inserted into the messages collection, it gets inserted into a local db.files collection, which clients then read from

The added bonus is that occasionally we do need to replication additional data to particular nodes, so we just craft up a tailed cursor query that finds any messages we can, and pulls them across the wire

we make a fair bit of use of the adhoc querying in mongodb, so that was a massive selling point for us

speed was also a major issue - we have a average write/very high read requirement, and it's really really fast.

finally the platform was a consideration. We're mostly a windows shop, but we'll use linux where needed, and we were prepared to have riak running on linux if thats what we thought was bets, but it just didn't really fit for us.

the only downside was we use Delphi, and the existing Delphi drivers were.. not good, so we wrote our own, which I'm trying to negotiate with my boss so we can push onto github.

good luck!


"speed was also a major issue - we have a average write/very high read requirement, and it's really really fast."

"We're mostly a windows shop"

"we use Delphi"


I missed the point you're trying to make.


He probably never used Delphi and thinks it's slow.

I've worked with Delphi for many years, and its not far from the speed you can achieve with C++. Delphi is one hell of an underrated language for native programming.

Also, Delphi's standard library, the VCL, is great, and you can learn so much reading the source code, which comes with the IDE. Or does it? Delphi 7 was the last version I used.


Actually, I have used Delphi, wrote tens, maybe hundreds of thousands of lines of it early on in my programming days, many years ago.

Yes, Delphi, the language, which uses a static, strongly typed language, reference counting for memory management, and native ahead-of-time compilation can achieve good performance. That's a shocker. The problem is this isn't what it's intended for, and the standard kit is hardly optimized for it.


Reference counting? Except for some basic types, like the string, you have to manually manage memory.

And the other point you are trying to make, I still don't get it. You imply that the guy is doing it wrong for using Delphi where speed is important, and you come here and say that Delphi is, indeed, a fast language.

What if speed is not what the language is intended for? If the language is intended for other purpose, and speed comes as a bonus, I don't know how that could be a bad thing.


Yeah unfortunately the actual Delphi compiler is stuck in stasis, and has received no love in terms of targeting newer cpu instructions like SSE etc, which is why C++ can achieve better performance these days, back in the day Delphi was as fast as C++ for general performance, and Delphi absolutely killed it in terms of productivity.

We started out as a TP shop, so there is code dating back to the 1980s that is still used today (=\)

The rumor is Borland/CodeGear/Embarcadero lost the source code to the delphi compiler, which is the reason there had been no additions to the compiler, but who knows if that is legitimate, or should be an entry on snopes


strings are reference counted

dynamic arrays are reference counted

interfaces are reference counted (mostly for doing com stuff, however I've seen some examples of people trying to use interfaces as a cheap/dirty "i don't have to think about it" memory management tool, badly)

everything else is managed by yours truly

please enlighten us as to what you think Delphi is intended for, and what the standard kit is actually optimised for


Reference counted memory management always involves some level of manual memory management. That's why garbage collection exists, because it's not really possible to get full coverage with ref-counting. Thanks for the education lesson though.

Personally, I think Delphi is for cutters.


Yeah the RTL source/debug dcus still comes with any version from Professional up.


Well said, sir.


I missed the point you are trying to make.


Quite simply: MongoDB is the most SQL-like of all the no-SQL databases out there. E.g. the shortest transition for developers used to using SQL databases (Mysql in particular) who need a bit more flexibility.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: