They did (in a typical non-fault admitting way). They didn't escalate to Lambda engineering team, then said that this is a code issue, and that they should move to EC2 or Fargate, which is the polite way of saying "you can't do that on lambda. its your issue. no refunds. try fargate."
OP seems to be fixated on wanting MicroVM logs from AWS to help them correlate their "crash", but there likely no logs support can share with them. The microVM is suspended in a way you can't really replicate or test locally. "Just don't assume you can do background processing". Also to be clear, AWS used to allow the microVM to run for a bit after the response is completed to make sure anything small like that has been done.
It's an nondeterministic part of the platform. You usually don't run into it until some library start misbehaving for one reason or another. To be clear, it does break applications and it's a common support topic. The main support response is to "move to EC2 or fargate" if it doesn't work for you. Trying to debug and diagnose the lambda code with the customer is out of support scope.
I know for a fact that’s not true. You must have misunderstood the issue.
This is one of the main technical differences between Azure Functions and Google Cloud Run vs Lambda. Azure and GCP offer serverless as a billing model rather than an execution model precisely to avoid this issue. (Among many other benefits*)
Both in Azure and GCP you listen and handle SIGTERM (on Linux at least, Azure has a Windows offering and you use a different Windows thing there that I’m forgetting) and you can control and handle shutdown. There is no “suspend”. “Suspending nodejs” is not a thing. This is a super AWS Lambda specific behavior that is not replicable outside AWS (not easily at least)
The main thing I do is review cloud issues mid size companies run into. Most of them were startups that transitioned into a mid size company and now need to get off the “Spend Spend Spend” mindset and rain in cloud costs and services. The first thing we almost always have to untangle is their Lambdas/SF and it’s always the worst thing to ever untangle or split apart because you will forever find your code behavior differently outside of AWS. Maybe if you have the most excellent engineers working with the most excellent processes and most excellent code. But in reality all Lambda code takes complete dependency on the fact that lambda will kill your process after a timeout. Lambda will run only 1 request through your process at any given time. Lambda will “suspend” and “resume” your process. 99% of lambdas I helped move out of AWS have had a year+ trail of bugs where those services couldn’t run for any length of time before corrupting their state. They rely on the process restarting every few request to clear up a ton of security-impacting cross contamination.
* I might be biased, but I much much prefer GCR or AZF to Lambda in terms of running a service. Lambda shines if you resell it. Reselling GCR or AZF as a feature in your application is not straight forward. Reselling a lambda in your application is very very easy
It may just be such a basic tenant of the platform that no one thought to. You stop getting billed after lambda returns a response so why would you expect computation to continue? This guy expected free lunch.
I don't see that on the main AWS Lambda pages. It just says that you pay for what you use. It would make sense that the time billed would be until there is no more code to execute.
So many times I've caught myself thinking "I don't want to understand this shit, I just wanna fix it," as I've grudgingly opened up whatever docs I've avoided reading; almost as many times as I've wondered "what fucking moron wrote this code" before immediately `git blame`ing myself.
Fair enough, I guess this just seems like a bold assumption to make since an explicit handler function is a cornerstone of lambda, rather than being able to run module level code and having the end self detected.
Based on the way the document is written, it seems very likely that several people did realize exactly what the misunderstanding was and try to explain it to them.
No AWS employee knows “AWS well”. AWS has a huge surface area. If you work on one of the service teams - ie the team that maintains the different AWS services - you are very much unlikely to know the entire AWS surface area and be focused on your team and surrounding services.
It use to be the case if you were interviewing for an SDE position, you were specifically told not to mention specific AWS services in the system design rounds and speak of generic technologies.
I remember interviewing a guy who worked at Amazon - he commented that he started trusting AWS much less after getting to know some of the developers who worked on AWS.
Yep that. We are a fairly high spend customer which gives us direct access to engineering managers there. Some of the non-core products are run by what looks like two guys in a trailer protected by a layer of NDAs and finger crossing. Had some pretty crazy problems even with their core stuff over the years which absolutely knocked my confidence.
At the same time, I've had two separate Apple SREs in the last 5 years tell me that I should never trust their cloud services.
If you can't see it and can't control it yourself, then you accept these things silently.
My distrust for this goes as far as the only thing actually being subcontracted out for me is Fastmail, DNS and domains and the mail is backed up hourly.