I can safely say that the team members working in DynamoDB are very skilled and they care deeply about the product. They really work hard and think of interesting solutions to a lot of problems that their biggest customers face which is great from a product standpoint. There are some pretty smart people working there.
Engineering, however, was a disaster story. Code is horribly written and very few tests are maintained to make sure deployments go without issues. There was too much emphasis on deployment and getting fixes/features out over making sure it won't break anything else. It was a common scenario to release a new feature and put duct tape all around it to make sure it "works". And way too many operational issues. There are a lot of ways to break DynamoDB :)
Overall, though, the product is very solid and it's one of the few database that you can say "just works" when it comes to scalability and reliability (as most AWS services are)
>Engineering, however, was a disaster story. Code is horribly written and very few tests are maintained to make sure deployments go without issues. There was too much emphasis on deployment and getting fixes/features out over making sure it won't break anything else. It was a common scenario to release a new feature and put duct tape all around it to make sure it "works". And way too many operational issues. There are a lot of ways to break DynamoDB :)
>Overall, though, the product is very solid and it's one of the few database that you can say "just works" when it comes to scalability and reliability (as most AWS services are)
You throw bodies at it. A small bunch of people will be overworked, stressed, constantly fighting fires and struggling to fight technical debt, implement features, and keep the thing afloat. Production is always a hair away from falling over but luck and grit keeps it running. To the team it's a nightmare, to the business everything is fine.
Yea. I can probably move to a more chill team, but I wouldn't work on anything nearly as cutting edge. I mentally check out for weeks at a time, then get back into it and deliver something large. I'm low key job hunting, but don't entirely trust that it'll be different anywhere else (previous jobs were like this too)
if you want to know why capitalism causes this, start a startup and prioritize quality, do not get to market, do not raise money, do not pass go, watch dumpster fires with millions of betrayed and angry users raise their series d
They both likely have solid 80% solutions (design) and incrementally cover the 20% gap as need arises. This in turn adds to operational complexity.
Alternative would be to attempt a near 'perfect' solution for the product requirements and that may either hit an impossibility wall or may require substantial long term effort that would impede product development cycles. So likely the former approach is the smarter choice.
Customers care about the outcome, not the internal process. Besides, I’ve never worked at any sizable company in my 20+-year-long career where I didn’t conclude, “it’s a miracle this garbage works at all.”
Enjoy the sausage, but if you have a weak stomach, don’t watch how it’s made.
(I work for AWS but not on the DynamoDB team and I have no first-hand knowledge of the above claim. Opinions are my own and not those of my employer.)
> Customers care about the outcome, not the internal process.
This is true though there's only so much technical debt and internal process chaos you can create before it affects the outcome. It's a leading indicator, so by the time customers are feeling that pain you've got a lot of work in front of you before you can turn it around, if at all, and customers are not going to be happy for that duration.
Technical debt is not something to completely defeat or completely ignore, instead it's a tradeoff to manage.
One concrete problem with technical debt the article highlights is it that negatively impacts the time to deliver new features. Customers today usually expect not only a great initial feature set from a product, but also a steady stream of improvements and growth, along with responsiveness to feedback and pain points.
> Customers care about the outcome, not the internal process
Additionally, the business cares about the outcome, not the internal process.
Ostensibly, the business should care about process but it actually doesn't matter as long as the product is just good enough to obtain/retain customers, and the people spending the money (managers) aren't incentivized to make costs any lower than previously promised (status quo).
Just curious, why do you mention you work at AWS if you're just disclaiming that fact in the next sentence? Besides, nothing you stated is specific to AWS or any of its products.
I don't work at Amazon, but our company's social media policy requires us to be transparent about a possible conflict of interest when speaking about things "close to" our company/our position in the industry and also need to be clear about whether we're speaking in an official capacity or in a personal capacity.
This is designed to reduce the chances of eager employees going out and astro-turfing or otherwise acting in trust-damaging ways while thinking they're "helping".
If the developers are happy about the code and testing quality of a project, then you waited too long to ship.
If the customers don't have any feedback or missed feature asks at launch, you waited too long to ship.
You know who has great internal code and test quality? Google. Which is why Google doesn't ship. They're a wealth distribution charity for talented engineers. And their competitive advantage is that they lure talented people away from other companies where they might actually ship something and compete with Google, to instead park them, distract them with toys, beer kegs, readability reviews, and monorepo upgrades.
To me the takeaway is large/interesting/challenging engineering projects are pretty close to disasters generally. Some time they do become disaster actually.
On the other hand if a project looks like straight up designed, neatly put into JIRA stories, and developers deliver code consistently week after week then it may be a successfully planned and delivered project. But it would mostly be doing stuff that has already been many times over and likely by same people on team.
At least this has been my experience while working on standardized / templated projects vs something new.
Challenging the cutting edge of your product domain is what I get from this. Easy things are easy and predictable. Hard things and unpredictable evolving requirements are a tension against the initial system design which is the foundation of your code base. Over time the larger projects get the perhaps further they deviate from the original design. If you could predict it up front in many cases its not all that interesting or challenging of a problem. Duct tape is fine to use as long as you understand when you've gone too far and might want to re-design from scratch based on prior learnings.
On the other hand, if you don't go solder things and remove the duct tape from time to time, you will always come closer to a disaster, never further away.
Some projects are run like the Doomsday Clock, and nobody can get anything done. Other ones increase and decrease on complexity all the time, and those tend to catch-up to the first set quite quickly.
I worked at a company who re-implemented the entire Dynamo paper and API, and it was exactly the same story. Completely eliminated all my illusions about the supposed superiority of distributed systems. It was a mound of tires held together with duct tape, with a tiki torch in each tire.
They did have 100 million to burn, but my mostly-wild-guess is it was closer to $1.5M/yr. But that gives you an in-house SaaS DB used across a hundred other teams/products/services, so it actually saved money (and nothing else matched its performance/CAP/functionality).
Cassandra is too opinionated and its CAP behavior wasn't great for a service like this, so they built on top of Riak. (This also eliminated any thoughts I had about Erlang being some uber-language for distributed systems, as there were (are?) tons of bugs and missing edge cases in Riak)
Erlang gives you great primitives for building reliable protocols, but they're just primitives, and there are tons of footguns since building protocols is hard.
Because Riak uses vector clocks instead of cell timestamps? Cassandra's ONE/QUORUM/ALL consistency levels otherwise allow tuning for tolerance of CP vs AP, don't they?
To be honest I don't know, I wasn't there for the initial decision, but I know it wasn't just about CAP. It could have been as simple as Riak was easier to use (which I don't know either)
Not Invented Here can run very deep in some branches of an organization. Depending on how engineering performance evaluations work, writing a homebrew database could totally be something that aligns with the company incentives. It might not make a single bit of sense from a business standpoint but hey, if the company rewards such behavior don't be surprised when engineers flush millions down the tube "innovating" a brand new wheel.
It's a shame they don't open source it. It's funny too, being AWS they really don't have to worry about AWS running a cheaper service, so at that point why not open source it.
If you’ve read their paper, there is a lot of detail in it to create your own. Of course they haven’t given out the code but the paper is a pretty solid design document.
I kinda doubt it. It's probably just that open sourcing it won't provide much utility (I bet lots of code is aws specific) and just adds a new maintenance burden for them.
The way you need to write code for a massively scalable service is just different. And the things you need to operate a service are also just different.
So they just rolled out global replication, and I can't for the life of me figure out how they resolve write conflicts without cell timestamps or any other obvious CRDT measures.
Questions were handwaved away, and the usual Amazon black box non-answers which always smells like they are hiding problems.
Any ideas how this is working? It seems bolt-on and not well thought out, and I doubt they'll ever pay for Aphyr to put it through his torture tests.
Any changes made to any item in any replica table are replicated to all the other replicas within the same global table. In a global table, a newly written item is usually propagated to all replica tables within a second. With a global table, each replica table stores the same set of data items. DynamoDB does not support partial replication of only some of the items. If applications update the same item in different Regions at about the same time, conflicts can arise. To help ensure eventual consistency, DynamoDB global tables use a last-writer-wins reconciliation between concurrent updates, in which DynamoDB makes a best effort to determine the last writer. With this conflict resolution mechanism, all replicas agree on the latest update and converge toward a state in which they all have identical data.
Honestly your expectations are too high. Conflict resolution is row-level last-write-wins. It's not a globally distributed database, it's just a pile of regional DynamoDB tables duct taped together... They're not going to hire Aphyr for testing because there's nothing for him to test.
I've seen this kind of thing mentioned many times, pretty baffling TBH based on Dynamo's pretty good reputation in industry. Are these mostly to the stateless components of the product, or do they see data loss?
I can't say in risk of violating some NDA but a lot of it is internal stuff that customers will never even be aware of or it would require too much effort for them to break.
There are times when bad deployments happen and customers were impacted.
I've worked at 5 different tech companies now - this is par for course. And every single one, wished they could go back and do it again, but at that point the product was too successful so they ran with it.
We're at early stages of planning an architecture where we offload pre-rendered JSON views of PostgreSQL onto a key value store optimised for read only high volume. Considering DynamoDB, S3, Elastic, etc. (We'll probably start without the pre-render bit, or store it in PostgreSQL until it becomes a problem).
When looking at DynamoDB I noticed that there was a surprising amount of discussion around the requirement for provisioning, considering node read/write ratios, data characteristics, etc. Basically, worrying about all the stuff you'd have to worry about with a traditional database.
To be honest, I'd hoped that it could be a bit more 'magic', like S3, and it AWS would take care of provisioning, scaling, sharding etc. But it seemed disappointingly that you'd have to focus on proactively worrying about operations and provisioning.
Is that sense correct? Is the dream of a self-managing, fire-and-forget key value database completely naive?
Your example really summarizes the challenge with the AWS paradigm: namely that they want you to believe that the thing to do is to spread the the backend of your application across a large number of distinct data systems. No one uses DynamoDB alone: they bolt it onto Postgres after realizing they have availability or scale needs beyond what a relational database can do, then they bolt on Elasticsearch to enable querying, and then they bolt on Redis to make the disjointed backend feel fast. And I'm just talking operational use cases; ignoring analytics here. Honestly it doesn't need to be these particular technologies but this is the general phenomenon you see in so many companies that adopt a relational database, key/value store (could be Cassandra instead of DynamoDB eg like what Netflix does), a search engine, and a caching layer because they think that that's the only option
This inherently leads to a complexity debt explosion, fragmentation in the experience, and an operationally brittle posture that becomes very difficult to dig out of (this is probably why AWS loves the paradigm).
Almost every single team at Amazon that I can think of off the top of my head uses DynamoDB (or DDB + S3) as its sole data store. I know that there are teams out there using relational DBs as well (especially in analytics), but in my day-to-day working with a constantly changing variety of teams that run customer-facing apps, I haven't seen RDS/Redis/etc being used in months.
The thing about Amazon is that it is massive. In my neck of the woods, I've got the complete opposite experience. So many teams have the exact DDB induced infrastructure sprawl as described by the GP (e.g. supplemental RDBMS, Elastic, caching layers, etc..).
Which says nothing of DDB. It's an god-tier tool if what you need matches what it's selling. However, I see too many teams reach for it by default without doing any actual analysis (including young me!), thus leading to the "oh shit, how will we...?" soup of ad-hoc supporting infra. Big machines look great on the promo-doc tho. So, I don't expect it to stop.
> they bolt it onto Postgres after realizing they have availability or scale needs beyond what a relational database can do, then they bolt on Elasticsearch to enable querying, and then they bolt on Redis to make the disjointed backend feel fast.
This made my head explode. Why would you explicitly join two systems made to solve different issues together? This sounds rather like a lack of architectural vision. Postgres's zero access-design inherently clashes with DynamoDB's; same goes with ElasticSearch scenario: DynamoDB's was not made to query everything, it's made to query specifically what you designed to be queried and nothing else. Redis sort-of make sense to gain a bit of speed for some particular access, but you still lack collection level querying with it.
In my experience, leave DynamoDB alone and it will work great. Automatic scaling is cheaper eventually if you've done your homework about knowing your traffic.
In my experience, leave DynamoDB alone and it will work great.
My experience agrees with yours and I'm likewise puzzled by the grandparent comment. But just a shout out to DAX (DyanmoDB Accelerator) which makes it scale through the roof:
Judging a consistency model as "terrible" implies that it does not fit any use case and therefore is objectively bad.
On the contrary, there are plenty of use cases where "eventually consistent writes" is the perfect use case. To judge this as true, you only have to look and see that every major database server offers this as an option - just one example:
I think main advantage of DDB is being serverless. Adding a server-based layer on top of it doesn't make sense to me.
I have a theory it would be better to have multiple table-replicas for read access. At application level, you randomize access to those tables according to your read scale needs.
Use main table streams and lambda to keep replicas in sync.
Depending on your traffic, this might end more expensive than DAX, but you remain fully serverless, using the exact same technology model, and have control over the consistency model.
Haven't had the chance to test this in practice, though.
I am working with a company that is redesigning an enterprise transactional system, currently backed by an Oracle database with 3000 tables. It’s B2B so loads are predictable and are expected to grow no more than 10% per year.
They want to use DynamoDB as their primary data store, with Postgres for edge cases it seems to me the opposite would be more beneficial.
At what point does DynamoDB become a better choice than Postgres? I know that at certain scales Postgres breaks down, but what are those thresholds?
You can make Postgres scale, but there is an operational cost to it. DynamoDB does that for you out of the box. (So does Aurora, to be honest, but there is also an overhead to setting up an Aurora cluster to the needs of your business.)
I've found also that in Postgres the query performance does not keep up with bursts of traffic -- you need to overprovision your db servers to cope with the highest traffic days. DynamoDB, in contrast, scales instantly. (It's a bit more complicated that that, but the effect of it is nearly instantaneous.) And what's really great about DynamoDB is after the traffic levels go down, it does not scale down your table and maintains it at the same capacity at no additional cost to you, so if you receive a burst of traffic at the same throughput, you can handle it even faster.
DynamoDB does a lot of magic under the hood, as well. My favorite is auto-sharding, i.e. it automatically moves your hot keys around so the demand is evenly distributed across your table.
So DynamoDB is pretty great. But to get the the best experience from DynamoDB, you need to have a stable codebase, and design your tables around your access patterns. Because joining two tables isn't fun.
> So DynamoDB is pretty great. But to get the the best experience from DynamoDB, you need to have a stable codebase, and design your tables around your access patterns. Because joining two tables isn't fun.
More than just joining--you're in the unenviable place of reinventing (in most environments, anyway) a lot of what are just online problems in the SQL universe. Stuff you'd do with a case statement in Postgres becomes some on-the-worker shenanigans, stuff you'd do with a materialized view in Postgres becomes a batch process that itself has to be babysat and managed and introduces new and exciting flavors of contention.
There are really good reasons to use DynamoDB out there, but there are also an absolute ton of land mines. If your data model isn't trivial, DynamoDB's best use case is in making faster subsets of your data model that you can make trivial.
They should be looking at Aurora, not Dynamo. Using Dynamo as the primary store for relational data (3000 tables!) sounds like an awful idea to me. I’d rather stay on Oracle.
It seems to me that what this is saying is that storage has become so cheap that if another database provides even slight advantages over another for some workload it is likely to be deployed and have all the data copied over to it.
HN entrepreneurs take note, this also suggests to me that there may be a market for a database (or a "metadatabase") that takes care of this for you. I'd love to be able to have a "relational database" that is also some "NoSQL" databases (since there's a few major useful paradigms there) that just takes care of this for me. I imagine I'd have to declare my schemas, but I'd love it if that's all I had to do and then the DB handled keeping sync and such. Bonus points if you can give me cross-paradigm transactionality, especially in terms of coherent insert sets (so "today's load of data" appears in one lump instantly from clients point of view and they don't see the load in progress).
At least at first, this wouldn't have to be best-of-breed necessarily at anything. I'd need good SQL joining support, but I think I wouldn't need every last feature Postgres has ever had out of the box.
If such a product exists, I'm all ears. Though I am thinking of this as a unified database, not a collection of databases and products that merely manages data migrations and such. I'm looking to run "CREATE CASSANDRA-LIKE VIEW gotta_go_fast ON SELECT a.x, a.y, b.z FROM ...", maybe it takes some time of course but that's all I really have to do to keep things in sync. (Barring resource overconsumption.)
> I'd love to be able to have a "relational database" that is also some "NoSQL" databases (since there's a few major useful paradigms there) that just takes care of this for me. I imagine I'd have to declare my schemas, but I'd love it if that's all I had to do and then the DB handled keeping sync and such.
You might be interested in what we're building [0]
It synchronizes your data systems so that, for example,
you can CDC tables from your Postgres DB, transform them in interesting ways,
and then materialize the result in a view within Elastic or DynamoDB
that updates continuously and with millisecond latency.
It will even propagate your sourced SQL schemas into JSON schemas,
and from there to, say, equivalent Elastic Search schema.
I think there was a project like this a few years ago (wrapping a relational DB + ElasticSearch into one box) and I thought it was CrateDB, but from looking at their current website I think I'm misremembering.
The concept didn't appeal to me very much then, so I never looked into it further.
---
To address your larger point, I think Postgres has a better chance of absorbing other datastores (via FDW and/or custom index types) and updating them in sync with it's own transactions (as far as those databases support some sort of atomic swap operation) than a new contender has of getting near Postgres' level of reliability and feature richness.
My understanding of the cockroach db architecture, it that it’s essentially two discrete components, a key value store that actually persists the data, and a SQL layer built on top.
Although I don’t think it’s recommended or supported to access the key value store directly.
I have no direct experience with scaling DynamoDB in production, so take this with a grain of salt. But it seems to me that the on-demand scaling mode in DynamoDB has gotten _really_ good the last couple of years.
For example, you used to have to manually set RCU/WCU to a high number when you expected a spike in traffic, since the ramp-up for on-demand scaling was pretty slow (could take up to 30 minutes). But these days, on-demand can handle spikes from 10s of requests a minute to 100s/1000s per second gracefully.
The downside of on-demand is the pricing - it's more expensive if you have continuous load. But it can easily become _much_ cheaper if you have naturally spiky load patterns.
> The downside of on-demand is the pricing - it's more expensive if you have continuous load.
True, although you don't have to make that choice permanently. You can switch from provisioned to on demand once every 24 hours.
And you can also set up application autoscaling in provisioned mode, which'll allow you to set parameters under which it'll scale your provisioned capacity up or down for you. This doesn't require any code and works pretty well if you can accept autoscaling adjustments being made in the timeframe of a minute or two.
We've some regular jobs that require scaling up dynamodb in advance few times per day, but then dynamo is only able to scale down 4x per day, so we're probably paying for over capacity unnecessarily (10x or more) for a couple hours a day
Now we just moved ondemand and let them handle it, works fine
> Is the dream of a self-managing, fire-and-forget key value database completely naive?
It's not, if you plan it right. Learn about single table design for DynamoDB before you start. There are a lot of good resources from Amazon and the community.
Here is a very accessible video from the community:
If you use single table design, you can turn on all of the auto-tuning features of DynamoDB and they will work as expected and get better and more efficient with more data.
Some people worry that this breaks the cardinal rule of microservices: One database per service. But the actual rule is never have one service directly access the data of another, always use the API. So as long as your services use different keyspaces and never access each other's data, it can still work (but does require extra discipline).
A lot of things that used to be a concern (hot partitions, etc) are not a concern anymore and most have been solved these days :)
Put it on on-demand pricing (it'll be better and cheaper for you most likely), and it will handle any load you throw at it. Can you get it to throttle? Sure, if you absolutely blast it without ever having had that high of a need before (and it can actually be avoided[0]).
You will need to understand how to model things for the NoSQL paradigm that DynamoDB uses, but that's a question of familiarity and not much else (you didn't magically know SQL either).
My experience comes from scaling DynamoDB in production for several years, handling both massive IoT data ingestion in it as well as the user data as well. We were able to replace all things we thought we would need a relational database for, completely.
My comparison between a traditional RDS setup:
- DynamoDB issues? 0. Seriously. Only thing you need to monitor is billing.
- RDS? Oh boy, need to provision for peak capacity, need to monitor replica lags, need to monitor the Replicas themselves, constant monitoring and scaling of IOPS, suddenly queries get slow as data increases, worrying about indexes and the data size, and much more...
> We're at early stages of planning an architecture where we offload pre-rendered JSON views of PostgreSQL onto a key value store optimised for read only high volume.
If possible, put the json in Workers KV, and access it through Cloudflare Workers. You can also optionally cache reads from Workers KV into Cloudflare's zonal caches.
> To be honest, I'd hoped that it could be a bit more 'magic', like S3
You could opt to use the slightly more expensive DynamoDB On-Demand, or the free DynamoDB Auto-Scaling modes, which are relatively no-config. For a very ready-heavy workload, you'd probably want to add DynamoDB Accelerator (an write-through in-memory cache) in front of your tables. Or, use S3 itself (but a S3 bucket doesn't really like when you load it with a tonne of small files) accelerated by CloudFront (which is what AWS Hyperplane, tech underpinning ALB and NLB, does: https://aws.amazon.com/builders-library/reliability-and-cons...)
It is a resource that can often be the right tool for the job but you really have to understand what the job is and carefully measure Dynamo up for what you are doing.
It is _easy_ to misunderstand or miss something that would make Dynamo hideously expensive for your use case.
Hot keys are the primary one. They destroys your "average" calculations for your throughput.
Bulk loading data is the other gotcha I've run into. Had a beautiful use case for steady read performance of a batch dataset that was incredibly economical on Dynamo but the cost/time for loading the dataset into Dynamo was totally prohibitive.
Basically Dynamo is great for constant read/write of very small, randomly distributed documents. Once you are out of thay zone things can hey dicey fast.
I do not recommend starting off with a decision to use DynamoDB before you have worked with it directly for some time to understand it. You could spend months trying to shoehorn your use case into it before realizing you made a mistake. That said, DynamoDB can be incredibly powerful and inexpensive tool if used right.
Yea, probably, but it is especially true for DynamoDB because it can initially appear as though your use cases are all supported but that is only because you haven't internalized how it works yet. By the time you realize you made a mistake, you are way too far in the weeds and have to start over from scratch. I would venture that more than 50% of DynamoDB users have had this happen to them early on. Anecdotally, just look at the comments on this post. There are so many horror stories with DynamoDB, but they're basically all people who decided to use it before they really understood it.
I believe it used to be static provisioning, you'd set the read and limit capacity beforehand. Then obviously there is autoscaling of those but it is still steps of capacity being provisioned.
They now have a dynamic provisioning scheme, you simply don't care but it is more expensive so if you have predictible requirements it is still better to use static capacity provisioning. There is an option though.
DynamoDB also requires the developer to know about its data storage model. While this is generally a good practice for any data storage solution, I feel like Dynamo requires a lot more careful planning.
I also think that most of the best practices, articles etc apply to giant datasets with huge scale issues etc. If you are running a moderately active app, you probably can get away with a lot of stupid design decisions.
My experience with dynamic provisioning has been that it is pretty inelastic, at least at the lower range of capacity. E.g. if you have a few read units and then try to export the data using AWS's cli client, you can pretty quickly hit the capacity limit and have to start the export over again. Last time, I ended up manually bumping the capacity way up, waiting a few minutes for the new capacity to kick in, and then exporting. Not what I had in mind when I wanted a serverless database!
I understand it's not really your point, but if you're actually looking to export all the data from the table, they've got an API call you can give to have DynamoDB write the whole table to S3. This doesn't use any of your available capacity.
Yes, you have to learn about all these things upfront. But once you figure it out, test it, and configure it - it will work as you expect. No surprises.
Whereas Relational Databases work until they don't. A developer makes a tiny (even a no-op) change to a query or stored procedure, a different SQL plan gets chosen, and suddenly your performance/latency dramatically reduces, and you have no easy way to roll it back through source control/deployment pipelines. You have to page a DBA who has to go pull up the hood.
It is for now but it doesn't have to be. Dynamo's design isn't particularly amenable to dynamic and heterogenous shard topologies however.
There could exist a fantasy database where you still tell it your hash and range keys, which are roughly how you tell the database which data isn't closely related to each other and which data is (and which you may want to scan) but instead of hard provisioning shard capacity it automagically splits shards when they hotspot and doesn't rely consistent hashing so that every shard can be sized differently depending on how hot it is.
Right now such a database doesn't exist AFAICT as most places that need something the scales big enough also generally have the skill to avoid most of the pitfalls that cause problems on simple databases like Dynamo.
I’d urge you to start writing a prototype, a lot of your assumptions might get thrown out the window. Dynamo is not necessarily good for reading high volume. You’ll end up needing to use a parallel scan approach which is not fast.
I'd say Dynamo is extremely good at reading high volume, with the appropriate access pattern. It's very efficient at retrieving huge amounts of well partitioned data using the data's keys, but scanning isn't so efficient.
You can only ever fetch 1MB of data at a time though, even when using the more efficient query method (as opposed to scan). If your individual entities are not very tiny, it is hard to get for instance 2M items back in a reasonable amount of time.
I don't know your scaling needs, but I would highly recommend just using Aurora postgresql for read-only workloads. We have some workloads that are essentially K/V store lookups that were previously slated for dynamodb. On an Aurora cluster of 3*r6g.xlarge we easily handle 25k qps with p99 in the single-digit ms range. Aurora can scale up to 15 instances and up to 24xlarge, so it would not be unreasonable to see 100x the read workload with similar latencies.
Happy to talk more. We're actively moving a bunch of workloads away from DynamoDB and to Aurora so this is fresh on our minds.
The salespeople always promise magic and handwave CAP away.
But data at scale is about:
1) knowing your queries ahead of time (since you've presumably reached the limit of PG/maybesql/o-rackle.
2) dealing with CAP at the application level: distributed transactions, eventual consistency, network partitions.
3) dealing with a lot more operational complexity, not less.
So if the snake oil salesmen say it will be seamless, they are very very very much lying. Either that, or you are paying a LOT of money for other people to do the hard work.
Which is what happens with managing your own NoSQL vs DynamoDB. You'll pay through the roof for DynamoDB at true big data scales.
If you know and understand S3 pretty well, and you purely need to generate, store, and read materialized static views, I highly recommend S3 for this use case. I say this as someone who really likes working with DDB daily and understands the tradeoffs with Dynamo. You can always layer on Athena or (simpler) S3 Select later if a SQL query model is a better fit than KV object lookups. S3 is loosely the fire and forget KV DB you’re describing IMO depending on your use case
Plenty of options already exist. DynamoDB has both autoscaling and serverless modes. AWS also has managed Cassandra (runs on top of DynamoDB) which doesn't need instance management.
Azure has CosmosDB, GCP has Cloud Datastore/Firestore, and there are many DB vendors like Planetscale (mysql), CockroachDB (postgres), FaunaDB (custom document/relational) that have "serverless" options.
Exactly. This has been my experience with several AWS technologies. Like with their ElasticSearch service, where I had to constantly fine-tune various parameters, such as memory. I was curious why they couldn't auto-scale the memory, why I had to do that manually. There are several AWS services that should be a bit more magical, but they are not.
There's not really magic with s3, you still need to name things with coherrent prefixes to spread around the load.
DynamoDB is almost simple enough to learn in a day. And if you're doing nothing with it, you're only really paying for storage. Good luck with your decisions.
I'm not going to speculate on the accuracy of 90% value, but I will say that appropriately prefixed objects substantially help with performance when you have tons of small-ish files. Maybe most orgs don't have that need but in operational realms doing this with your logs make the response faster.
Your impressions are cordect: DynamoDB is quite low-level and more like a DB kit than ready to use DB, for most applications it's better to use something else.
If you use the "pay per request" billing model instead of provisioned throughput, DynamoDB scaling is self-managing, and you can treat your DB as a fire-and-forget key/value store. You need to plan how you'll query your data and structure the keys accordingly, but honestly, that applies even more to S3 than it does to Dynamo.
Exactly my experience. I got sucked into using more than once, thinking it would be better next time, but there are just so many sharp edges.
At one company, someone accidentally set the write rate rate high to transfer data into the db. This had the effect of permanently increasing the shard count to a huge number, basically making the DB useless.
I think this is a good summary, and it even gets more complicated if you start using the DAX cache. Your read/write provisioning for DAX is totally different than the underlying dynamodb tables. The write throughput for Dax is limited by the size of the master node in the cluster. Can you say bottleneck?
Take a look at Firestore / Google Cloud Datastore. It's pretty much exactly what you describe - fire and forget. There's no concept of "node" (at least not from the outside).
Thinking like this both baffles me, but also makes me happy because there will always be a need for people like me, infra. AWS is not a magical tool that will replace your infra team, it is a magical tool that will allow your infra team to do more. I am the infra team of my startup and I estimate that only 50% of my time is doing infra work. The rest is supporting my peers, work in frameworky stuff, solve dev efficiency issues bla bla.
Lets say that you operate in an AWS-less environment, with everything bare metal, in a datacenter. Your GOOD infra team has to do the following:
Hardware:
- make sure there is a channel to get new hardware, both for capacity increase and spares. What are you going to do? Buy 1 server and 2 spares? If one of the servers has an issue, isn't it quite likely that the other servers, from the same batch, to have the same issue? Is this affecting you, or not? Where do you store the spares? In a warehouse somewhere, making it harder to deploy? In the rack with the one in use, wasting rackspace/switch space? Are you going to rely on the datacenter to provide you with the hardware? What if you are one of their smaller customers and your requests get pushed back because some larger customer requests get higher priority?
- make sure there is a way to deploy said hardware. You don't want to not be able to deploy a new server because there is no space in the rack, or no space in the switch. Where are your spares? In a warehouse miles away from the datacenter? Do you have access to said warehouse at midnight, on Thanksgiving? Oh shit, someone lost the key to your rack! Oh noes, we don't have any spare network cable/connectors/screws...
Software:
- did you patch your servers? did you patch your switches?
- new server, we need to install the os. And a base set of software, including the agent we use to remote manage the server.
- oh, we also need to run and maintain the management infra, say the control plane for k8.
- oh, we want some read replicas for this db, not only we need the hardware to run the replicas on (and see above for what that means), now you need to add a bunch of monitoring and have plans in place to handle things like: replicas lagging, network links between master and replicas being full, failover for the above, master crapping out yada yada.
I bet there are many other aspects I'm missing.
Choices:
Your GOOD infra team will have to decide things like: how many spares do we need, is the capacity we have atm enough for the launch of our next world-changing feature that half the internet wants to use? Are we lucky enough to survive a few months without spares or should we get estra capacity in another datacenter? Do we want to have replicas on the west coast or is the latency acceptable?
These are the main areas of what an infra team is supposed to do: Hardware, Software and Choices. AWS (and most other cloud providers) is making the first 2 points non issues. For the last area you can do 2 things: get an infra team (could be a full fledged team, could be 1 person, you could do it) and teoretically you will get choices tailored to what your business needs OR let AWS do it for you. *AWS might make these choices based on a metric you disagree with and this is the main reason people complain*.
Every system I've built on DynamoDB just works. The APIs that use it have had virtually 100% uptime and have never needed database maintenance. It is not a replacement for a RDBMS, but for some use cases, it's a killer service.
As a developer, I really have a love-hate relationship with Dynamo. I love how fast and easy it is to setup and get rolling.
The partitioning scheme came off as confusing and opaque but I think that says more about Amazon's documentation than the scheme itself.
I do not like that there's no really third party tooling integration to be able to query. Their UI in the console is _so freaking terrible_ yet you have no other way than code to query it. This problem is so bad that I will avoid using it where I can despite it being a good option, performance-wise.
We realized how great Dynamo was only after we migrated off AWS.
Dynamo was a key factor to us when we were releasing the MVP of our News API [0]. We used Dynamo, ElasticSearch, Lambda and could make it running in 60 days while being full-time employed.
Also, the best tech talk I saw was given by Rick Houlihan on re:Invent [1]
I highly recommend every engineer to watch it: it's a great overview of SQL vs NoSQL
On that thread he criticizes AWS regarding DynamoDB openly.
> I will always love DynamoDB, but the fact is it is losing ground fast because AWS focuses most of their resources on the half baked #builtfornopurpose database strategy. I always hated that idea, I just bit my tongue instead of saying it.
> The problem is the other half-baked database services that all compete for the same business. DocumentDB, Keyspaces, Timestream, Neptune, etc. Databases take decades to optimize, the idea that you can pump them out like web apps is silly.
> I was very tired of explaining over and over again that DynamoDB is actually not the dumbed down Key-Value store that the marketing message implied. When AWS created 6 different NoSQL databases they had to make up reasons for each one and the messaging makes no sense.
Interesting. MongoDB actually came to mind while I was reading the other comment here:
> No one uses DynamoDB alone: they bolt it onto Postgres after realizing they have availability or scale needs beyond what a relational database can do, then they bolt on Elasticsearch to enable querying, and then they bolt on Redis to make the disjointed backend feel fast. And I'm just talking operational use cases; ignoring analytics here.
Today I would choose JSON in Postgres before I would just jump to Monogo but it certainly serves a purpose for many shops and it is still widely used AFAIK.
It had some operational quirks 10 years ago (allocating giant chunks of space was more of an issue that dataloss) and I've not used it directly in that many years. We lost some data during an OOM process kill but it was just twitter firehose data so not a huge deal.
For anyone else expecting this to be a paper given the domain name, it’s not.
It’s a non technical interview with a couple of the original papers authors.
Not bad, just not as exciting as I imagine a paper detailing what they’ve learnt from a distributed systems perspective etc operating Dynamo then DynamoDB for so long now.
We don't have a paper on DynamoDB's internals (yet?), but here's a talk you might find interesting from one of the folks who built and ran DDB for a long time: https://www.youtube.com/watch?v=yvBR71D0nAQ
If we did publish more about the internals of DDB, what would you be looking to learn? Architecture? Operational experience? Developer experience? There's a lot of material we could share, and it's useful to hear where people would like us to focus.
To be honest, as a customer, it is hard for me to justify using DynamoDB. Some of this criticism can be out of date:
1. DynamoDB is not as convenient. There are a bit too many dials to turn.
2. DynamoDB does not have a SQL facade on top.
3. DynamoDB is proprietary, I believe there's no OSS API equivalent if you want to migrate out.
4. DynamoDB was kind of expensive. But it has been a while since I last check the pricing page.
It's simply much better to start with PostgreSQL Aurora and move to a more scalable storage based on specific uses-cases later. For example: Cassandra, Elastic, Druid, or CockroachDB.
I strongly agree that most early stage businesses should be on Postgres. There's simply too much churn in early stage data models. Also, unforeseen esoteric needs jumping out of the wood work that you can knock out a SQL query for instead of having to build a solution come up constantly. However, this does assume that your development team has a competent understanding of SQL.
I've been in a couple startups that went Dynamo first and development velocity was a pale shadow of velocity with Postgres. When one of those startups dumped dynamo for Postgres velocity multiplied immediately. I'd estimate we were moving at around 1000% and the complete transition took less time than even I expected (about a month). Once the business matures, moving tables onto dynamo and wrapping them in a microservice makes a lot of sense. Dynamo does solve a lot of problems that become increasingly material as the business evolves.
Eventually, SQL's presence declines and transitions into an analytics system as narrower, but easier for ops, options proliferate.
Ad 3. Lot of people don't know about it, but there's a open source, free, and DynamoDB compatibile databse called ScyllaDB - API it's called Alternator to be specific.
We landed on DynamoDB when we migrated a monolith to microservice architecture. I have to say that DynamoDB fits fairly well in the microservices world where the service scope is small, query patterns are pretty narrow and don't really change much. Building new things using DynamoDB when query patterns aren't necessarily known is very painful and require tedious migration strategies unless you don't mind paying for GSIs.
Quite a few of the teams that were early adopters of AWS DynamoDB were not prepared for the pricing nuances that had to be taken into consideration when building their solutions.
I remember trying Dynamodb around 2015/2016: You had to specify your expected read and write throughout and you would be billed for that. At that time we had a pretty spikey traffic use case which made using dynamodb efficiently impossible
I had a similar experience, but ultimately wrote a service to monitor our workloads and request increased provisioning during spikes. You could reduce your provisioning like 10 times a day, but after that you could only increase it and would be stuck with the higher rate for a time.
And then on-demand provisioning was released and it was cheap enough to be worth simplifying our workflows.
I was one of these. However I now understand that the pricing nuances reflected a reality that I appreciate. We used DDB in a way that was not the best fit and the cost was a reflection of this.
If you follow Rick Houlihan (@houlihan_rick) then all the accolades that AWS for DynamoDB pale in comparison to its current team and execution in that the company seems to not be investing in it so much so that Rick left to join MongoDB.
Man I love Rick’s talks as much as anyone but let’s be real, he likely left AWS not for his love of first class geographical indexes but because Mongo offered a giant pile of money for him to evangelize their tech.
Though I have no doubts that he actually had a lot of reservations around Dynamo’s DX before, he likely has some around mongodb but those won’t be the bulk of his content
At his rank at AWS I don’t know if money was such an issue. He strikes me as a person who cares deeply about the underlying tech. But I have no idea one way or the other.
I think I've seen you post something similar on r/aws about how Rick was "top DynamoDb person at AWS" (apologies if that wasn't you). I think you are overestimating Rick's "rank".
I just looked him up (I had not heard of him before seeing his name mentioned on r/aws a few days ago) and he was an L7 TPM/Practice Manager in AWS's sales organization. That's not really a notably high position, and in the grand scheme of Amazon pay scales, isn't that high up. An L7 TPM gets paid about the same as, or sometimes less than, an L6 software dev (L6 is "senior", which is ~5-10 years of experience).
Also, him being in the sales org means he had practically nothing to do with the engineering of the service. AWS Sales is a revolving door of people. I mean no offense towards Rick (again, I didn't know him or even know of him before I read his name in a comment a few days ago), but I would not read anything at all into the fact that an L7 Sales TPM left for another company.
Actually, I was a direct report to Colin Lazier (https://twitter.com/clazier) who is the GM for DynamoDB, Keyspaces, and Glue Elastic Views. I was the original TPM for DocumentDB before joining the Professional Services team as a Senior Practice Manager to head up the NoSQL Blackbelt team which led the archtecture/design effort for Amazon's RDBMS->NoSQL migration. I was brought back to the service team by Jim Scharf to lead the technical solutions team for strategic accounts, but I maintained the org chart role of Senior Practice Manager until I left for MongoDB.
Compensation was a minor issue. I was an org chart aberration already and AWS pulled out all the stops to retain me. I will always appreciate the opportunity that AWS provided me and my time at DynamoDB will always hold a special place in my heart. I really do believe that MongoDB is poised to do great things and my decision had more to do with being a part of that than anything else.
You never heard of Rick Houlihan? He is the 90% of DynamoDB Evangelism...
At the same time you are able to this internal lookups? Do you work with DynamoDB?
AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
https://youtu.be/HaEPXoXVf2k
Maybe that was the problem. He cited that there was seemingly not enough effort in making DynamoDB better as evidenced by the many orthogonally very close other DBs that AWS promotes. If Rick was ears to the ground listening to customers and sending back feedback but it was falling on deaf ears that's enough ground for someone as high up and as influential and productive as him to leave. It also speaks to inner AWS turmoil at least at DynamoDB.
>It also speaks to inner AWS turmoil at least at DynamoDB.
How? Rick wasn't part of the DynamoDB service team. He wasn't an engineer, nor a manager on the team, nor even a product manager. He was a salesperson that specialized in DDB. He most likely had very little interactions, if any, with the engineering team. I don't see how him leaving speaks at all to anything about the inner workings of the engineering teams.
Rick seems cool, and after skimming some of his chats he seems really knowledgeable about the customer-facing side of DDB, and I mean absolutely no disrespect to him. But I think you're making way too many assumptions about his "rank" and "influence" within the company.
I have watched almost all those talks as they are technically dense and full of very good and very useful technical knowledge that I would be much poorer for not watching. These are not sales videos but highly complex instructional content meant for developers on the ground
There are over a thousand breakout sessions at every reinvent every year. Some of the speakers are sales people, some are engineers, some are managers. There are L5 or junior engineers who give reinvent session talks. It's a fun gig, but it doesn't mean that the speaker is some top executive or anything like that.
Rich was in the sales org. His primary job was sales. Reinvent is a sales conference. Speaking at reinvent is a sales pitch. He was a salesperson. I'm not sure why you're so offended by that. Being a salesperson isn't bad, it's just an explanation for why engineers wouldn't have heard of him.
What do you think Solutions Architects and Developer Advocates (between the two groups who do most Re:invent sessions) are?
Hell, what do you think re:Invent is? It's a sales conference.
In any company you have two groups of people: Those that build the product, and those that sell it. Ultimately, solutions architects and developer advocates are there to help sell the product.
Of course Amazon is customer obsessed. And genuinely interested in ensuring customers have a good experience, and their technical needs are met - through education, support, and architectural guidance. But ultimately, that's what it is.
No, I haven't. There are thousands of reinvent sessions every year. I don't watch them all (I don't watch hardly any of them, and most people I know in Amazon watch a couple breakout sessions if that. Some don't even watch the keynotes). Their targeted audience is AWS customers, not internal engineers. Reinvent itself is a sales conference. If internal Amazonians want to learn about something like DDB, there are internal talks and documents given by the engineering leaders that we watch.
>At the same time you are able to this internal lookups?
I looked him up on LinkedIn. Nothing internal about it.
unless he posts here about it we can't really know -- we can only speculate but I think he had a higher amount of influence than his title/rank might suggest. I think Rick's influence with respect to DynamoDB is akin to that of Kelsey Hightower's influence over k8s at Google.
The only experience I had with dynamodb has been in aws: we set up a testing DB with defaults... then we left it there.... two months later we realized we've lost about 1500$: our mistake was to use the setup defaults (which had some sort of auto-scaling): I hope in the meanwhile AWS corrected this: we did let them know)
I haven't run into anyone who uses Dynamo for anything other than managing Terraform backend state locking. And I think that's still the best use case for it: you just want to store a couple random key-values somewhere and have more functionality than AWS Parameter Store. Trying to build anything large-scale with it will probably leave you wanting.
DynamoDB for me is the perfect database for my serverless / graphql API. My only gripe is the limitation of items in a transaction of 25. I've had to resort to layering another transaction mgmt system on the top of it.
we use DynamoDB like a big hash table of s3 file locations.. we look up these locations via a key (at the time, it sounded like a pretty good use-case for it). I suppose we could have used some other managed redis or memcached thing, but being an AWS shop, it was, and is, pretty useful. I have to say, it's been pretty effortless to configure.. read/write units are really the only thing we've had to configure (other than the base index) The rest of it has been easy. It has about a 100 million entries that are read or written pretty quickly.
I remember talking to someone who was playing with AWS stuff for the first time and they had a similar architecture, using Dynamo for a lookup store. It still seems a bit odd to me though. It's been a long time since I've worked with the S3 API, so maybe it just doesn't support the same sort of thing, but wouldn't it be nicer to just query S3 with some key and get back either the path/URL to render a link, or the content itself? Why the Dynamo intermediary? (And on the other side, if you don't need to render a link to serve the content, why not use Dynamo as the actual document store and skip S3? Storage cost?)
Moreover, storing a large binary file (in this case) in Dynamo is probably not the best use-case for it..most likely would have to convert to base64 in and out of it.
This will sound flippant, but that's not what Dynamo is for. If you want to do freeform relational queries like that then put it in a relational database.
Dynamo is primarily designed for high volume storage/querying on well understood data sets with a few query patterns. If you want to be able to query information on employees based on their name and city you'll need to build another index keyed on name and city (in practice Dynamo makes that reasonably simple by adding a secondary index).
Amazon has a perfect use case for this. You click on a product in the search results, that url contains a UUID, that UUID is used to search Dynamo and returns an object that has all the information on the product, from that you build the page.
If what you are trying to do looks more like "Give me all the customers that live in Cuba and have spent more than $10 and have green eyes", Dynamo isn't for you. You can query that way but after you put all the work in to get it up and running, you'd probably be better off with Postgres.
If that's one of 12 or less query patterns you need, I can write you a simple dynamo table for it.
Dynamo's limitation is that it can only support n different query patterns, and you have to hand craft an index for each one(well, sometimes you can get multiple on one index)
Alternatively, practice single table design: structure your table keys in such a way that they can represent all (or at least most) of the queries you need to run.
This is often easier said than done, but it can be far less expensive and more performant than adding an index for each search.
That's not what DynamoDB is for. If you need to run queries like that, you should be using RDBMS. DynamoDB should only really be used for use cases where the queries are known up-front. There are ways to design your data model in Dynamo so that you could actually run queries like that, but you would have had to that work from day 1. You won't be able to retroactively support queries like that.
The most reliable way to build a system with DynamoDB is to plan queries upfront. Trying to use it like a SQL database and make Ad-Hoc queries won't work because it's not a SQL DB.
Data should be stored in the fashion you wish for it to be read, and storing the same data in more than one configuration is acceptable.
DynamoDB is not meant for ad-hoc query patterns; as others have said, plan your indexes around your access patterns.
However, so long as you add a global secondary index (GSI) with name, city as the key, you can certainly do such things. But be aware for large-scale solutions:
1. There's a limit of 20 GSIs per table. You can increase with a call to AWS support.
2. GSIs are latently updated; read-after write is not guaranteed, and there is no "consistent read" option on a GSI like there is with tables.
3. WCUs on GSIs should match (or surpass) the WCUs on the original table, else throughput limit exceeded exceptions will occur. So, 3 GSIs on a table means you pay 4x+ in WCU costs.
4. The keys of the GSI should be evenly distributed, just like the PK on a main table. If not, there is additional opportunity for hot partitions on write.
I struggled at first but I watched Advanced Design Patterns for DynamoDB[0] a few times and it clicked. As other responses have suggested, generally you define your access patterns first and then structure the data later to fit those access patterns.
DynamoDB (and other dynamo-like systems like Cassandra, Bigtable) are just advanced key/value stores. They support multiple levels of keys->values but fundamentally you need the key to find the associated value.
If you want to search by parameters that aren't keys then you need to store your data that way. Most of these systems have secondary indexes now, and that's basically what they do for you automatically in the backend, storing another copy of your records using a different key.
If you need adhoc relational queries then you should use a relational database.
Note that this is just a new syntax for the existing querying capabilities. If you query something that's not in the hash/sort key, you still need to filter on the "client" after the 1mb data set size limit etc.
I haven't used DynamoDB in a couple of years, so I'd be curious to know how querying compares if anyone can share some light that has used both Cosmos and Dynamo recently.
Engineering, however, was a disaster story. Code is horribly written and very few tests are maintained to make sure deployments go without issues. There was too much emphasis on deployment and getting fixes/features out over making sure it won't break anything else. It was a common scenario to release a new feature and put duct tape all around it to make sure it "works". And way too many operational issues. There are a lot of ways to break DynamoDB :)
Overall, though, the product is very solid and it's one of the few database that you can say "just works" when it comes to scalability and reliability (as most AWS services are)
I worked at DynamoDB for over 2 years.