AWS Has a Lambda Problem

I used AWS in the early days before cloud computing was buzz or reality. The company where I was employed was developing a new website for selling products and they had a massive number of SKUs and tons of high-resolution product images. The old website was hosted at a colocation facility adjacent to the local telco hotel, which was something that we were commonly concerned with in those days. We had recently revamped our WAN situation and had tons of bandwidth. We had not planned for anything with respect to the new website and bringing in-house, but someone (I assume the fly-by-night CEO that was hired who had a history of job hopping without ever realizing the product of his strategies) made decisions without consulting people that would know if it was feasible.

It was a huge mess with many issues. We hired a development house to make the site that he was familiar with but someone else had decided we would be making the site with one language and they were experts in a different language. They were just instructed to learn the language we wanted… what could go wrong? Many things. The session data wasn’t being stored on the server-side… it was all being passed in every request. To make it clear… I mean ALL of the session data… not just the ID. This data was approaching somewhere around 35MB… for every request. Having been long out of my web development days, I didn’t expect to be troubleshooting their code, but it happened. Because they lacked experience in the language, they didn’t realize that needed to make a simple configuration change to store the session data in memory on the web server and only pass the ID back and forth. But the images were something crazy.

Since they were static content, we evaluated using AWS S3 as a cheap CDN. It worked fantastically. So, the dynamic content for the site still lived on our servers, but the static content lived in the cloud and used someone else’s bandwidth (unless you were sitting on our network and you didn’t query S3 for the content). These things are relatively simple. Another thing that it highlights is that more complex CDN solutions are often overkill unless you’re a content company on the scale of something like Netflix.

The simple solutions are straight forward and well understood by many. Once you move beyond these things, problems start to emerge.

Context

I have been focused more on AWS recently, but I have had years of focus in Azure. Sure, there can be some bias, but I will attempt to be objective and constructive. I do think this is more of an issue of “you know what you” and getting an outsider to take a look exposes issues that people have just been conditioned to accept as “the way it is.”

Also, Azure has plenty of issues that could stand to be improved, so this isn’t a “dump AWS and go to Azure” sort of discussion… more of a plea to fix some fundamental issues.

Azure went live in February 2010, as Windows Azure. It began its development cycle in 2008 under the codename of Reddog. The story about the website predates all of this. AWS has been around for a while. I wasn’t working with Azure in those days so I don’t recall if Azure System Manager (ASM) was the first API for the platform, but it quickly became problematic. Microsoft moved to a new API called Azure Resource Manager (ARM) in 2014 and rebranded to Microsoft Azure. This included many changes, not only an API change, but a significant change to the way Microsoft architected datacenters with many lessons learned from AWS. Only next month, September 2023, will the last vestiges of ASM be finally removed, as some of the resources have lived on as “Classic” resources for this entire time. But Microsoft recognized that a significant change was required and they were willing to tear it down and build it back up. Doing something like this at such a scale, it worked impressively well.

AWS really needs to do this. It isn’t made more clear than by all of the reference architectures that depend on Lambda. Having the capability to use Lambda as a way to extend automation and connectivity of resources is a necessary thing, that is no issue. However, the architecture not ultimately depend on Lambda more often than not (at least from the visibility of the customer). When you look to use a solution that is mostly “managed,” it is an immediately let down when you approach deploying it with Infrastructure as Code and you realize that the Management Console is the principal means of it being a “managed” solution. When you start looking at CloudFormation documentation and writing Terraform code, you see that you’re deploying each of the constituent parts yourself and not using a managed solution.

It seems like lately, so many things that should be simple are perhaps impossible based on API limitations, or nearly impossible, and they need Lambda.

Issues

Here is a quick list of things in the past few weeks:

Wanting to trigger an AWS CodePipeline via a webhook – it has webhooks but the means to control them are so limited that reference architectures include first calling Lambda via a webhook to trigger the pipeline via a webhook! Or if you’re using AWS CodeCommit as your git platform, you monitor for events with AWS CloudWatch (commits, merges, etc.) to trigger a Lambda to trigger the pipeline! Sure, if you need to do something that isn’t handle by routine means, using Lambda is great. The events in these architectures should be routine things that don’t require a customer to plumb them together (maybe AWS uses Lambda on the backend, but that is for them to support).
Wanting to assign a group to an AWS SageMaker Domain? You can do that in the Management Console. You cannot do this through any other means! No API call, no CloudFormation, no Terraform! So, instead of exposing group assignment via the API so that an extremely reasonable thing can be done… setup AWS CloudTrail to monitor group membership changes, then use EventBridge to realize the change, then trigger a Lambda to enumerate group members and add user profiles for each user!

This is not “well-architected;” pun intended. This is popsicle sticks and hot glue. Or duct tape and bailing wire. Or paint and putty. Expose the assignment of groups in the API, it is a simple thing. Clearly the actual API call exists because that must be how the Management Console assigns the group… just give it to customers.

A Glimmer of Hope, But Maybe Later

A year ago, AWS announced the Cloud Control API. It seemed like maybe it is a new API to replace the old APIs. There was some new attention give to it at HashiTalks Build this week where AWS discussed the AWS Cloud Control Provider for Terraform. I thought maybe this would be a path forward. Diving into the documentation and using the AWS CLI to see how it works and it is essentially a standardized way to access the existing APIs, nothing new is exposed. In fact, getting deep into the documentation takes you over to the CloudFormation documentation. It was rather disappointing.

Why? Seriously, Why?

Experiencing these issues leaves me in awe as to why they exist. Surely other people find this approach similar to a Rube Goldberg Contraption? I started searching to see if discussions existed. They do, but they’re few and far between. This creates a negative feedback loop because the lack of discussions must provide a sort of feedback to AWS that everything is fine. And since they think it is fine, they don’t work to improve it. Constructive criticism is how innovation happens for an existing player. Without it, the problems get fixed by the existing players getting disrupted by new players. So, grow or die.