More

jurre · on Oct 30, 2020

What specifically are you running into? I may have trouble seeing any gaps in documentation because I’ve been running rails in a containzerized environment for years and it seems to just work

b0afc375b5 · on Nov 2, 2020

I'm still in the planning stages of migrating rails from ec2 to ecs.

I asked because parent comment said he/she was running into problems. But if your experience is smooth sailing so far then that's great.

jurre · on Oct 18, 2020

You should have zero fear instilled when pressing any button. The system or process has failed if a single person pressing a button can bring something down unintended. Fix the system/process, don’t “instill fear” onto the person, it’s toxic, plus now you have to make sure any new person on boarded has “fear instilled”, and that’s just hugely unproductive

qz2 · on Oct 18, 2020

That’s precisely my point. A lot of people have no fear because they’re complacent or ignorant rather than the button is well engineered.

But to get there you need to fear the bad outcomes.

ddelt · on Oct 18, 2020

I’m sorry, but this really hasn’t been my experience at all in web technology or managing on-prem systems either.

I used to be extremely fearful of making mistakes, and used to work in a zero-tolerance fear culture. My experience and the experience of my teammates on the DevOps team? We did everything manually and slowly because we couldn’t see past our own fear to think creatively on how to automate away errors. And yes, we still made errors doing everything carefully, with a second set of eyes, late night deployments, backups taken, etc.

Once our leadership at the time changed to espouse a culture of learning from a mistake and not being afraid to make one as long as you can recover and improve something, we reduced a lot of risk and actually automated away a lot of errors we typically made which were caused initially by fear culture.

Just my two cents.

qz2 · on Oct 18, 2020

I’m not talking about fear culture. It’s a more personal thing. Call it risk management culture if that helps which is the inverse.

Manual is wrong for a start. That would increase the probability of making a mistake and thus increase risk for example. The mitigations are automation, review and testing.

I agree with you. Perhaps fear was the wrong term. I treat it as my personal guide to how uneasy i feel about something on the quality front.

inglor_cz · on Oct 18, 2020

I recalled Akimov pressing the AZ-5 button in Chernobyl...

jurre · on Oct 9, 2020

_credit card_ payments are like that, but there is no technical reason for payments to work like that anymore.

jurre · on Oct 2, 2020

Now you need to queue up events for some time, reorder them using the timestamp, and then process them. It’s possible, but has overhead in both performance and custom code you’ll have to maintain. If there is no guarantee of order, two separate systems consuming those same events also might get different results, depending on the implementation, that can be problematic

jurre · on Sept 17, 2020

I think the key difference here is that there is no network in between components in a componentized monolith, each component runs the entire “monorepo”

dodobirdlord · on Sept 17, 2020

Whether there’s actually network between components is something a platform team can handle based on their best judgement. Having collections of containers that always run together is a common pattern.

pbourke · on Sept 17, 2020

Certainly, but such a system is not a monolith. A core trait of the monolith is that there are no network calls between components.

inopinatus · on Sept 18, 2020

This negative-space definition of "monolith" is unhelpful to the point of irrelevance. It's unreasonable, in the sense that adopting it gives us nothing to reason about, as with the comment above. By such a standard the last monolithic in-service system was a Burroughs mainframe ca. 1975. I've got statically linked binaries that would fail this definition.

Even the plainest Rails application depends on network traffic, including to communicate with parts of itself. It cannot function without an operating system, which is also talking to parts of itself via network protocols, and this runs on a server whose internal buses are themselves a distributed system.

It's networks, all the way down, and a heads-in-the-sand attitude doesn't help us reason about performance, reliability, scalability, maintainability et cetera.

Put this in a "Falsehoods programmers believe ...": calling a stateless function in a stack-based language to compute an immutable result won't lead to a network call.

Monolithic applications are defined by something they are, not something they don't do, and what they are is a single unit of code for development and deployment purposes that includes everything necessary to fulfil an entire system's purpose. The issue of intentionally crossing a network boundary, and when, and why, is an dependent topic in comparative systems architecture, but it's analytically orthogonal.

thebean11 · on Sept 17, 2020

Is there really that big of an advantage to avoiding the network boundary though?

nthj · on Sept 17, 2020

Absolutely:

* Avoid network and JSON serialization overhead

* Perform larger refactorings or renamings without considering deployment staggering or API versioning

* testing locally is far easier

* Debugging in production is far easier

* Useful error stack traces are included for free

* Avoid (probable in my experience, at least in larger security software organizations) dependency on SecOps to make network changes to support a refactoring or introducing new components

If an organization is or will pursue a FedRAMP certification, as I understand it, that organization must propose and receive approval every time data may hit a network. Avoiding the network in that case may be the difference between a 50-line MR that's merged before lunch and a multi-week process involving multiple people.

gen220 · on Sept 17, 2020

FWIW, I think that gRPC/protobufs have pretty compelling answers to each of the historically-valid complaints you've listed here.

- cpu cycle overhead: this is valid if the overhead is very high or very important. otherwise, most companies would love to trade off cpu cycles for dev productivity.

- refactorings/renamings without deployment staggering. protobufs were specifically designed with this in mind, insofar as they support deprecating fields and whatnot. However, writing a deprecatable-API is a skill, even with protos. If you have many clients and want to redo everything by scratch, you will have problems.

- "testing locally" (which I take to mean integration testing locally) is the only one that requires some imagination to solve, assuming all your traffic is guarded by short-term-lease certs issued by vault or something similar. But even this is quite achievable.

- error stack traces included for free: may I introduce you to context.abort(). It's not a stack trace by default, but you can actually wrap the stack trace into the message if you so-care to. opentracing isn't quite free, in a performance sense, but in a required-eng-time-to-setup-and-maintain-sense, it is pretty cheap.

- dependency on secops to make network changes: I've never encountered this, but I bet you that a good platform team can provide a system where application teams effectively don't need to worry about this. It's impossible to overcome this challenge in an existing company that's used to doing things this way, though.

mpweiher · on Sept 18, 2020

> cpu cycle overhead

The original poster's point was CPU and network overhead. A local procedure/function call or message-send takes on the order of one or up to a few nanoseconds. Depending on how you organize things, an IPC is going to be in the microsecond or even millisecond range. That's a lot of orders of magnitude. It's also latency that you just aren't going to get back, no matter what extra resources you throw at it. [1][2]

In the early naughties, a rewrite of very SOA/microservice-y BBC backend system I re-architected as a monolith became around 1000x faster. [3]

In addition, in-process calls are essentially 100% reliable. Network calls, and various processes attached to them, not so much (see [1], again). The BBC system not just became a lot faster, it also became roughly 100 times more reliable, and that's probably low-balling it a bit. It essentially didn't fail for internal reasons after we learned about Java VM parameters. And it was less code, did more, and was easier to develop for.

[1] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115...

[3] https://link.springer.com/chapter/10.1007%2F978-1-4614-9299-...

gen220 · on Sept 18, 2020

Ah gotcha, thank you for locking-in on the issue. You're absolutely right that network hops introduce overhead (I was intending to wrap i/o blocking on network calls under the banner of cpu cycles, adjacent to serialization)

Like any other design decision, there's a trade-off here. (see my other comments in this tree, about how many 9's in reliability/latency you're targeting).

If you're working in an environment where sub-5ms latency to the 4th or 5th 9 is critical, inter-machine communication is not for your application, period.

Reliability, as an orthogonal concern, is one that has improved incredibly since the early aughts. The "transport" and error-handling layer of open-source RPC frameworks has improved by orders of magnitude. I'd recommend taking a long look at the experiences of companies built on gRPC.

It's much easier to build a reliable SOA-esque system today than it was even 5 years ago. It's been an area of rapid progress.

mpweiher · on Sept 21, 2020

Yes, obviously these are trade-offs.

However, I find the way you framed these trade-offs decidedly...odd, in terms of "who needs that kind of super-high performance and reliability????", as if achieving that were only possible through herculean effort that just isn't worth it for most applications.

The fact of the matter is that a local message-send is also a helluva lot easier than any kind of IPC. Also easier to deploy, as it comes in the same binary so is already there and easier to monitor (no monitoring needed).

So the trade-off is more appropriately framed as follows: why on earth would you want to spend significant extra effort in coding, deployment and monitoring, for the dubious "benefit" of frittering away 3-6 orders of magnitude of performance and perfect reliability?

Of course there can be benefits that outweigh all these negatives in effort and performance/reliability, but those benefits have to be pretty amazing to be worth it.

gen220 · on Sept 22, 2020

> as if achieving that were only possible through herculean effort

I encourage you to reread my comments, I'm not suggesting anywhere that high-performance requires exceptional effort.

In fact, I'm actively admitting that for applications where high-performance is required, IPCs/RPCs are not an option.

> just isn't worth it for most applications

Performance is valuable, but it's one dimension of value.

My premise is that, given the maturity of RPC frameworks and network tooling in 2020, most already-networked applications can afford to trade the performance hit of additional hops on the backend.

Whether what you get in exchange for that performance hit is valuable?

That is mostly a function of the quality of your eng platform.

> a local message-send is also a helluva lot easier [on the programmer?] than any kind of IPC

This strongly depends on your engineering org, although it seems like this is the point that's hardest to imagine for some people.

If you're on a team that depends on the availability of data maintained by N other teams,

(given the maturity of RPC Frameworks and network tooling in 2020, again)

It is much easier to apply SLOs and SLAs to an interface that's gated by an RPC service.

> spend significant extra effort in coding, deployment and monitoring

The extra effort here is made completely negligible by the existence of a decent platform team.

FWIW, I wouldn't be able to imagine it if I haven't experienced it myself.

> benefits have to be pretty amazing to be worth it

I still think you're overestimating some of the costs (see above). FWIW, I've worked in an RPC-oriented environment for years now, and reliability has never been a concern. Our platform team is pretty good, but we are not a Google-esque company (200 engineers, including eng managers)

The performance trade-off has been demonstrably worthwhile, because we've used it to purchase a degree of team independence that would not have been otherwise possible.

mpweiher · on Sept 22, 2020

>In fact, I'm actively admitting that for applications where high-performance is required, IPCs/RPCs are not an option.

But you're framing it as "...for applications where high-performance is required", as if taking the performance, expressiveness and reliability hits should obviously be the default, unless you have very special circumstances.

My point is, and continues to be, that it's the other way around: you should go for simplicity, reliability and performance unless you have, and can demonstrate you have, very special requirements.

lmm · on Sept 18, 2020

Thrift or protobuf is a huge step up from the alternatives, but you still have a lot of overhead. Generics are limited and you're essentially forced to "defunctionalise the continuation" everywhere: any time you want to pass a callback around you have to turn it into a command object instead.

gen220 · on Sept 18, 2020

I don't disagree with you, this actually sounds like the beginning of a super interesting conversation.

Can you share some examples of the generics problem and "defunctionalizing the continuation"?

Does google's `any` package help with the generics problem you describe? (Acknowledging that it's obviously clunky)

lmm · on Sept 20, 2020

> Can you share some examples of the generics problem and "defunctionalizing the continuation"?

Well, the generics problem is that you don't have generics. So you just can't define a lot of general-purpose functions in gRPC, and have to make a specific version of them instead. Even something like "query for objects like this and then apply this transform to the results" just can't be done, because there's no way to pass the transformation over the wire, so you have to come up with a datastructure to represent all the transformations that you want to do instead. "Defunctionalizing the continuation" is the technique for doing that, https://www.cis.upenn.edu/~plclub/blog/2020-05-15-Defunction... is an example, but it's a manual process that requires creative effort each time.

> Does google's `any` package help with the generics problem you describe? (Acknowledging that it's obviously clunky)

Not really, because you don't have the type information at compile time. Erased generics are fine in a well-typed language, but just using an any type you can't even do something like: a function that takes two values of the same type.

vosper · on Sept 18, 2020

People who are downvoting the parent comment: I’d love to know why? I won’t claim expertise here, but it doesn’t strike me as clearly incorrect.

closeparen · on Sept 17, 2020

How are you getting around API versioning with independently deployable components?

sokoloff · on Sept 18, 2020

If you call a piece of functionality from your own single deployable that you are refactoring, it’s much more like refactoring a function call than if it were an independent micro-service across a network.

heavenlyblue · on Sept 17, 2020

What is a network?

Spivak · on Sept 18, 2020

Any application boundary that requires that you serialize your calls/requests to the other service/component in some form.

Any form of IPC basically.

gen220 · on Sept 17, 2020

I think there used to be, before "off-the-shelf" RPC frameworks, service discovery, and the like were mature. There still are, for very small companies.

In 2020, if you have an eng count of >50: you use gRPC, some sort of service discovery solution (consul, envoy, whatever), and you basically never have to think about the costs of network hops. Opentracing is also pretty mature these days, although in my experience it's never been necessary when I can read the source of the services my code depends on.

Network boundaries are really useful for enforcing interface boundaries, because we should trust N>50 programmers to correctly-implement bounded contexts as much as we trust PG&E to maintain the power grid.

That being said, if you have a small, crack team, bounded contexts will take you all the way there and you don't need network boundaries to enforce them.

twunde · on Sept 17, 2020

It depends on your speed requirements and whether calls are being sent async or not. Also keep in mind that even with internal apis, an api call is usually multiple network boundaries (service1 --> service2 (potential DNS lookup) --> WAF/security proxy --> Firewall --> Load balancer --> SSL handshake --> server/container firewall --> server/container). Then you get into whether the service you're calling calls other apis etc. You can quickly burn 50ms or more with multiple hops. If you're trying to return responses within 200ms you now have very little margin.

gen220 · on Sept 17, 2020

Acknowledging that there are indeed many hops, I think it might be a bit disingenuous to say 50ms is easy to burn, depending on what p-value we're talking about.

IIRC, a typical round trip service call at my current place of work (gRPC with protobufs, vault/ssl for verification, consul for dns, etc) carries a p99 minimum latency (i.e. returning a constant) of around 2ms.

A cold roundtrip obviously takes longer (because DNS, ssl, etc).

It depends on how many 9's you want within 10ms, but there are various simple tricks (transparent to the application developer) that a platform team can apply to get you there.

As a sidenote on calling other APIs, my anecdata suggests that most companies microservice call graphs are at most 3-4 services deep, with the vast majority being 1-2 services deep.

This doesn't show the call graph, but it does demonstrate how many companies end up building a handful of services that gatekeep the core data models, and the rest simply compose over those services: https://twitter.com/adrianco/status/441883572618948608/photo...

jurre · on Sept 17, 2020

It depends, but it means you don’t have to serialize/deserialize data, deal with slow connections, retries, network failures, circuit breakers etc

jurre · on Feb 23, 2018

Isn't the fact that you've seen a bunch of people already hand-roll this a good indication that there's a need for something like this, so no-one else needs to hand-roll it anymore?

It doesn't need to be novel right? It just needs to be helpful.

zdragnar · on Feb 23, 2018

I tried to address that gently in my second paragraph. Every one of those were abandoned. The most complete one's developer said "what the hell was I doing?" onve he finally gave angular 1 a shot.

newfoundglory · on Feb 23, 2018

Plenty of people have hand rolled crypto too. "Lots of people do it" doesn't always point to an actual gap in the available tools.

jurre · on Jan 18, 2018

> it would be nice to override

One of the big advantages I have experienced from the Golang formatter is that you cannot configure it, period. This took me a little bit off getting over myself, but eventually the fact that every single Golang codebase adheres to these rules is so nice. Plus you don't have to think about it, ever.

In this case I think it would be best if Ecto just recommends the new default style in their docs.

ninjakeyboard · on Jan 18, 2018

That's fair - thanks for the comment. It makes sense - then there are no opinions or variation so "this is the way" and all code looks like that.

jurre · on Nov 1, 2017

> Kafka is terribly complex to run

I read this quite often, but we run a relatively small kafka cluster on GCP and it's pretty much hassle-free. We also run some ad-hoc clusters in kubernetes from time to time which also works well.

What exactly have you found complex about running Kafka?

nemothekid · on Nov 1, 2017

>What exactly have you found complex about running Kafka?

I run small 2-node kafka cluster that processes to 10 million messages/hr - not large at all - it's very stable, for almost a year now. However what was complex was:

* Setup. We wanted it managed it by mesos/marathon, and having to figure out BROKER_IDs took a couple hours of trial and error.

* Operations. Adding queues and checking on consumers isn't amazing to do from the command line.

* Monitoring. It took a while before I settled on a decent monitoring solution that could give insight into kafka's own consumer paradigm. Even still there are more metrics I would like to have about our cluster that I don't care to put the time in to retrieve.

crcastle · on Nov 1, 2017

Another thing I found "complex" was the Java/Scala knowledge requirement. I wanted Kafka-like functionality for a Node.js project, but my limited Java and Scala knowledge made me concerned about my ability to deal with any problems I might run into.

In other words, I could probably get everything up and running (especially with the various Kafka-in-Docker projects I found), but what happens if (when) something goes wrong?

eklavya · on Nov 2, 2017

What do you mean by "Java/Scala knowledge requirement"? I don't know much c/c++ but I use postgres just fine. There is a bunch of stuff in software ecosystem in a bunch of languages that if I had to know it all I wouldn't progress much.

jurre · on Nov 3, 2017

I've never had to dive into any Java or Scala to maintain our Kafka cluster

pjmorris · on Nov 1, 2017

> Monitoring. It took a while before I settled on a decent monitoring solution that could give insight into kafka's own consumer paradigm.

Would you be willing to write a bit (or point to a post with) more about this? What do you find useful?

nemothekid · on Nov 1, 2017

Like I mentioned our Kafka setup is relatively small - we moved from RabbitMQ to Kafka because of the sheer size (as in byte size) of the messages we needed to process (~10 million/hr), where each message could be 512-1024kb which caused RabbitMQ to blowup unpredictably.

Secondly, due to the difference in speed in the consumer and producer, we typically have an offset lag of around 10MM, and its important to monitor this lag for us because if it gets too high, then it means we are falling behind (our consumers scale up and down through the day to mitigate this).

Next, we use Go, which is not an official language supported by the project but has a library written by Shopify called Sarama. Sarama's consumer support had been in beta mode in a while, and in the past had caused some issues were every partition of a topic wasn't being consumed.

Lastly, at the time we thought creating new topics would be a semi-regular event, and that we might have dozens of them (this didn't pan out), but having a simple overview of the health of all of our topics and consumers was thought to be good too.

We found Yahoo's Kafka Manager[1], which has ended up being really useful for us in standing up and managing the cluster without resorting to the command line. It's been great, but it wasn't exactly super obvious for me to find at the time.

Currently the only metrics I don't have are things plottable things like processed/incoming msg/sec (per topic), lag over time and disk usage. I'm sure these are now easily ingested into grafana, I just haven't had the time to do it.

All of this information is great to have, but requires some setup, tuning, and elbow grease that is probably batteries included in a managed service. At the same time however, this is something you get almost out of the box with RabbitMQ's management plugin.

[1] https://github.com/yahoo/kafka-manager

jurre · on Nov 1, 2017

Yes, I do agree with these (except mesos is not a requirement for us). Is any of this significantly better for hosted Kafka or kinesis though? I have no experience with either

manigandham · on Nov 2, 2017

Yes, not having to worry about any of that is primary reason for managed services.

qaq · on Nov 1, 2017

What is the point of 2 node cluster?

nemothekid · on Nov 1, 2017

Topic sharding. The messages were pretty large and at the time we set this up we were on DO-like platform where the only way to get more disk space was to buy a larger instance. We didn't need the extra cpu power, but needed extra disk space, and it was cheaper to opt to two nodes instead of upgrading to n+2.

CSDude · on Nov 1, 2017

Running Kafka is just fine, the issues arise when a node fails, when you need to add data and re-partition a topic. However, it is not that hard once you know what to do, but Kinesis is simpler but it is expensive as shit.

optimuspaul · on Nov 1, 2017

At small scale Kinesis is far less expensive. There is definitely a point where Kinesis becomes more expensive, especially if you consider the operational and human costs involved.

knicholes · on Nov 1, 2017

I get shit for free daily!

jurre · on Nov 10, 2016

They don't call them unit tests though do they?

jurre · on Oct 27, 2016

It is! It's part of a tutorial