Scraping without JavaScript using Chromium on AWS Lambda: The Novel

UPDATE: 2022-01-17 16:33 CST

Forget the below. Just do this instead!

UPDATE: 2022-01-15 16:43 CST

It appears that Docker as configured within the runners provided by GitHub Actions do not native support building ARM images. However, you can use qemu-user-static to emulate instructions for other CPU architectures to get around this. This image uses binfmt_misc to tell the host’s Linux kernel to tell a third-party application (in this case, qemu) to execute binaries in formats that it doesn’t recognize.

~~In our case, we are using qemu to tell the x86_64 GitHub Actions hosts to send executables built for arm64 or aarch64 to qemu to run in a virtualized environment.~~

~~You can see this behavior happen here~~.

~~It is definitely slower, but it is fairly reliable!~~

~~To enable this functionality, do the following:~~

~~1. Add binfmt_misc and qemu-user-static to your Docker image. With an~~ ~~ Ubuntu or Debian base image, you’d add this to your Dockerfile:~~

~~dockerfile~~ ~~RUN apt -y install qemu binfmt-support qemu-user-static~~ ~~~~

~~2. Before you build your arm64 Docker image, add this command to your~~ ~~ deploy script to enable this translation:~~

~~sh~~ ~~docker run --rm --privileged multiarch/qemu-user-static --reset -p yes~~ ~~~~

TL;DR

At >1,700 words, this is a long post. Here’s a summary if you’re short on time!

Lambda layers are too big to fit Chrome unless it’s compressed with Brotli. You’ll need to write your own decompression logic if you’re not using Node.
If you don’t want to do that, you’ll need to create your own custom Docker image.
The AWS-provided base images do not play nicely with Chromium on M1 MacBooks. You’ll need to use your own base image and ensure that it handles loading the AWS Lambda Runtime Interface Client correctly; see the link above for more.
Ensure that Chromium is started with these flags: --single-process, --disable-gpu, --disable-dev-shm-usage, --no-sandbox

Setting the Scene

Say that you’re a developer in 2022 that is looking to scrape your favorite website for interesting data. Since you weren’t doing anything terribly heavy, you used PhantomJS to take advantage of a truly-headless browser that was ultra lightweight and ran on WebKit (i.e. most sites would work with it like they would a normal browser).

Let’s, further, assume that PhantomJS stopped working with your favorite website sometime in 2020 because the web moves on while PhantomJS did not.

Next, let’s assume that you wrote a non-JavaScript app to scrape that website. That app ran on AWS Lambda via API Gateway so that you could use it conveniently by way of your favorite web browser or iOS Shortcut. When PhantomJS stopped working, so, too, did your app and its creature comforts.

Finally, let’s say that you purchased an ultra-fast M1 MacBook Pro and are now doing all of your development against ARM64-compiled binaries and toolchains.

What are your options?

Use Chromium Headless, or
Abandon your project.

No, really, those are your options.

You could use Firefox and Marionette/geckodriver. Good luck finding a pre-existing ARM-compiled geckodriver, though! You’ll need even more luck if you run into problems during initialization or rendering since the “open web” has mostly decided to gravitate to Chromium/Google Chrome for everything.

No other headless WebKit browsers exist. So, yeah, Chrome is your only option unless your site is lucky enough to expose its API traffic.

But there’s a problem…

Chrome is big.

Okay, Chrome isn’t really that big. As of this time of writing, the latest Chromium build is about 100MB compressed and about 150-200 MB uncompressed. For most computers, this won’t be a problem.

Unfortunately, Lambda is not “most” computers.

In order to have Lambda serve your code, you need to compress your ZIP file, upload it into AWS S3, then tell Lambda where it is through its function definition. The filesystem onto which your ZIP is decompressed is called a “layer”.

Lambda layers can’t be more than 50 MB.

This is, obviously, an issue for what we’re trying to do.

Some projects, such as chrome-aws-lambda, a popular package for NodeJS that simplifies all of this (more on that later), vendor a special build of Chrome with lots of stuff removed compressed with Google Brotli. This nets you a ~46MB archive that fits nicely into the Lambda runtime with some space left over for other stuff.

However, our app wasn’t written in JavaScript and doesn’t run in Node. Since Lambda doesn’t support Brotli-compressed archives out of the box, you’ll need to write a function that decompresses this archive on function startup and, ideally, caches it somewhere (in S3) for faster retrieval in the future, compromising cold-start times in the process.

This is a huge downer if you’re like our hypothetical developer (definitely 100% NOT ME) who just wants to get their previously-working app working again.

Docker to the rescue!

It’s easy to think that there is no solution and give up at this point. Fortunately, the AWS Lambda engineering team knew that this was a huge restriction for many workflows that were well-suited for the serverless revolution.

In December 2020, AWS announced support for running Docker containers from OCI images hosted in AWS Elastic Container Registry (ECR) inside of Lambda. Moreover, Lambda supports Docker images up to 20GB, which is fit for just about anything!

The power and ephemerality of Lambda with the flexibility of Docker. Best of both worlds, and, most importantly, a perfect solution for this exact problem!

Let’s do it!

I Our imaginary developer loves to use Serverless Framework for doing anything with AWS Lambda, Azure Functions, or any of the serverless platforms out there. Let’s use that in our micro-tutorial here.

This demo isn’t guaranteed to work. I wrote this to demonstrate the plight of trying to scrape with headless Chromium without becoming a JavaScript developer.

First, I’m going to create a Dockerfile that will run my Ruby function. I’m going to use the images provided by AWS since they are already configured to wire up to Lambda:

FROM public.ecr.aws/lambda/ruby:2.7

RUN yum -y install amazon-linux-extras
RUN amazon-linux-extras install epel -y
RUN yum -y install chromium chromedriver

COPY . "${LAMBDA_TASK_ROOT}"
RUN bundle install

ENTRYPOINT ["my_app.my_function"]

I’m, then, going to create a Gemfile that will install Capybara and Selenium so that I can scrape my web page:

source 'https://rubygems.org'

gem 'capybara'
gem 'selenium-webdriver'

Finally, let’s create our app at my_app.rb:

# my_app.rb
# frozen_string_literal: true
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
require 'json'

# yes, you need ALL of these options. None of them are typos.
CHROMIUM_ARGS = %w[headless
                   enable-features=NetworkService,NetworkServiceInProcess
                   no-sandbox
                   disable-dev-shm-usage
                   disable-gpu]

def my_function
  session = init_capybara
  # always prints 'success'
  session.visit('http://detectportal.firefox.com')
  {
    statusCode: 200,
    body: { message: session.body }.to_json
  }
end

def init_capybara
    Capybara.register_driver :headless_chrome do |app|
      caps = ::Selenium::WebDriver::Remote::Capabilities.chrome(
        "goog:chromeOptions": {
          args: CHROMIUM_ARGS
        }
      )

      Capybara::Selenium::Driver.new(app,
                                     browser: :chrome,
                                     capabilities: caps)
    end

    Capybara.default_driver = :headless_chrome
    Capybara.javascript_driver = :headless_chrome
    Capybara::Session.new :headless_chrome
end

Next, I’m going to tell serverless how to deploy this with a serverless.yml file:

provider:
  name: aws
  runtime: ruby2.7
  region: us-east-2
  deploymentBucket:
    name: my-serverless-bucket
  deploymentPrefix: serverless
  # This is what tells Serverless about what images to build.
  # It even builds them for you...kind of. More on that later.
  ecr:
    images:
      app:
        path: .
  functions:
    my_function:
      image:
        name: app
        command: my-app.my_function
      events:
        - http:
            path: myApp
            method: get

Next, I create my bucket with the AWS CLI

aws s3 mb s3://my-serverless-bucket # Highly unlikely to work, as names are global

And then I’m off to the races!

docker run -v $PWD:/app -w /app carlosnunez/serverless:latest deploy --stage v1

But wait! I can test locally! Because Docker! Or can I?

Well…that depends.

If you’re using an Intel Mac or an Intel machine in general, everything works fine.

However, our hypothetical developer is fancy schmancy and is using an M1 MacBook.

This is where things get a little complicated.

Let’s revisit our Dockerfile:

FROM public.ecr.aws/lambda/ruby:2.7

RUN yum -y install amazon-linux-extras
RUN amazon-linux-extras install epel -y
RUN yum -y install chromium chromedriver # LIES!

COPY . "${LAMBDA_TASK_ROOT}"
RUN bundle install

ENTRYPOINT ["my_app.my_function"]

If you docker build this Dockerfile right now, you’ll likely get something like this:

sh-4.2# yum -y install chromium chromedriver
Loaded plugins: ovl
epel/aarch64/metalink                                                                                 |  17 kB  00:00:00
epel                                                                                                  | 5.4 kB  00:00:00
(1/3): epel/aarch64/group_gz                                                                          |  88 kB  00:00:00
(2/3): epel/aarch64/updateinfo                                                                        | 1.0 MB  00:00:00
(3/3): epel/aarch64/primary_db                                                                        | 6.6 MB  00:00:01
No package chromium available.
No package chromedriver available.
Error: Nothing to do
sh-4.2#

WTF?

Here’s the deal. The pre-baked images provided by Amazon inherit from Amazon Linux 2, which is derived from Red Hat Enterprise Linux (RHEL) 7.5, released in 2018. If you look at the default repository for CentOS 7 (the FOSS equivalent of RHEL 7.5), you’ll see that it only offers x86_64 packages. Since the Extra Packages for Enterprise Linux (EPEL) repository follows the OS release, it, too, will only offer x86_64 binaries.

This means that neither repository host any arm64 compatible binaries of Chromium or Chromedriver, hence this error.

But RHEL 8/CentOS 8 do have arm64 binaries? What if I just use that?

Then you’ll enter the second trap door: glibc.

Amazon Linux 2 ships with glibc 2.26. If you look at the list of dependencies for Chromium 96 (which is outdated at this time of writing), you’ll see that some of its libraries require glibc 2.27 or higher. You’ll discover as much if you try to install the RPM directly:

Error: Package: chromium-common-96.0.4664.110-2.el8.aarch64 (/chromium-common-96.0.4664.110-2.el8.aarch64)
           Requires: libm.so.6(GLIBC_2.27)(64bit)
Error: Package: chromium-common-96.0.4664.110-2.el8.aarch64 (/chromium-common-96.0.4664.110-2.el8.aarch64)
           Requires: libz.so.1(ZLIB_1.2.9)(64bit)
Error: Package: chromium-common-96.0.4664.110-2.el8.aarch64 (/chromium-common-96.0.4664.110-2.el8.aarch64)
           Requires: libc.so.6(GLIBC_2.28)(64bit)

Since upgrading glibc isn’t easy or recommended, this approach is a non-starter.

Custom Docker images and AWS RIC/RIE to the rescue!

If you’re like our poor developer here, you’re ready to throw in the towel and never browse the web again. Unfortunately, he discovered that Lambda also supports running custom Docker images. Our journey isn’t over yet!

AWS open-sourced both the client that’s used by the Lambda runtime and an emulator that emulates a Lambda runtime environment. Subsequently, this makes it very easy to create Docker containers on your machine that behave as if they’re running on Lambda.

This is massive. Previously, the only way to do this was to use the lambci/lambda Docker image (which is, also, only x86_64 compatible). This image was GIGANTIC and was a best-guess approximation of the Lambda runtime environment.

This is almost the exact same thing. Almost. Unfortunately. More on that later.

What does this look like in practice? All of the langauges that Lambda supports have their own runtime clients (RICs). Therefore, your Dockerfile will need to download the appropriate RIC and use an entrypoint script to determine whether it needs to run the client inside of an emulated Lambda runtime (if you’re running it on your own machine) or standalone (if you’re running it in Lambda).

The link above gives a good example of how to do this. For Ruby, it would look something like this:

# entrypoint.sh
#!/usr/bin/env sh
if test -z "$AWS_LAMBDA_RUNTIME_API"
then
  exec /usr/local/bin/aws_lambda_rie aws_lambda_ric "$@"
else
  aws_lambda_ric "$@"
fi

# Dockerfile
FROM ruby:2.7-alpine3.15
ENV AWS_LAMBDA_RIE_URL_ARM64=https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie-arm64
ENV AWS_LAMBDA_RIE_URL_AMD64=https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie

RUN echo "@testing http://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories && \
    apk update

RUN apk add libffi-dev readline sqlite build-base\
    libc-dev linux-headers libxml2-dev libxslt-dev readline-dev gcc libc-dev \
    freetype fontconfig gcompat chromium@testing chromium-chromedriver@testing

RUN mkdir /app
COPY Gemfile /app
WORKDIR /app
RUN bundle install

RUN gem install aws_lambda_ric
RUN apk add curl
RUN if uname -m | grep -Eiq 'arm|aarch'; \
    then curl -Lo /usr/local/bin/aws_lambda_rie "$AWS_LAMBDA_RIE_URL_ARM64"; \
    else curl -Lo /usr/local/bin/aws_lambda_rie "$AWS_LAMBDA_RIE_URL_AMD64"; \
    fi && chmod +x /usr/local/bin/aws_lambda_rie

RUN mkdir /app
COPY . /app
RUN bundle install
COPY include/entrypoint.sh /entrypoint.sh

ENTRYPOINT [ "/entrypoint.sh" ]

This is awesome because we can now download and install a modern version of Chromium on our own terms and our own operating system!

Testing your function in your local Lambda environment is easy. First, start the container in the background (with the -d switch)…

docker build -t local-lambda . &&
  docker run -d --rm -it --publish 8080:8080 local-lambda my_app.my_function

…then invoke a new function run:

curl -X POST -d '{}' localhost:8080/2015-03-31/functions/function/invocations

Make sure that you don’t forget -X POST -d '{}'. Lambda functions are always provided with a payload (even if the API Gateway endpoint from which they are called only takes GET requests). Failing to provide one will crash the runtime client, which you won’t know happened because the RIE will continue to send back 200s regardless.

Well, almost to the rescue.

What, you thought we were done?!

Our developer got their custom Docker image written. They’re able to build it, and they’ve confirmed that it can spin up Chromium and talk to Selenium locally. It’s super fast because the M1 is super fast. They deploy it up into Lambda with Serverless. Serverless builds the image, stores it in ECR, and pushes the function into S3 and its definition into Lambda. All is good.

They try to run their function through API Gateway with a shiver of anticipation…and they get this:

'{"message":"Internal server error"}'

YOU’VE GOT TO BE KIDDING ME!

Why does a Docker container that works locally not work in Lambda? The whole point of Docker is to obtain consistent behavior regardless of where the container’s running! This doesn’t make any sense!

Since there isn’t an easy way to SSH into a Lambda instance to debug Chromium directly, one (slow) way to debug this would be to create a simple function that simply invokes chromium with the same flags that Selenium uses, like this:

# my_app.rb
# Rest of the code
def test_chromium
  args = CHROMIUM_ARGS.map { |arg| arg.prepend('--') }
                      .map { |arg| arg.gsub('----', '--') }
                      .join(' ')
  output = `2>&1 chromium #{args} https://example.website`
  rc = $CHILD_STATUS
  { statusCode: 200, body: { message: "rc: #{rc}, opts: #{args}, output: #{output}" }.to_json }
end

…and then in serverless.yml:

provider:
  name: aws
  runtime: ruby2.7
  region: us-east-2
  deploymentBucket:
    name: my-serverless-bucket
  deploymentPrefix: serverless
  # This is what tells Serverless about what images to build.
  # It even builds them for you...kind of. More on that later.
  ecr:
    images:
      app:
        path: .
  functions:
    debug_chromium:
      image:
        name: app
        command: my_app.test_chromium
    my_function:
      image:
        name: app
        command: my-app.my_function
      events:
        - http:
            path: myApp
            method: get

Upon doing this, I (okay, IT WAS ME ALL ALONG!) found that Chromium was crashing due to this error:

[35941:0821/171720.038162:FATAL:gpu_data_manager_impl_private.cc(415)] GPU process isn't usable. Goodbye.

This is odd, given that Docker containers don’t normally gain access to GPUs unless you use --privileged or manually specify its capabilities. As it happens, the Chromium team all but deprecated the --disable-gpu switch (to improve performance), so Chromium will try to find a usable GPU on startup anyway.

For reasons unclear to me, the only way around this is to use the --single-process switch. (The reasons are unclear to me because the docs make it clear that the browser and the GPU run in a single process, but I would think that this would still require a GPU to be present, which would force the check that’s failing.) Once I added that to my list of flags, the crashes stopped and rendering worked once again!

THE FINAL BOSS: mismatched architectures

Now that our app is working, since we intend on ~~never scraping websites again~~ updating this app when our website changes, we want to have CI that deploys our function:

# .github/workflows/main.yml
---
name: Deploy function
on:
  schedule:
    - cron: "0 13 * * *"

jobs:
  sanity:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1

      - name: Deploy!
        run: >-
          docker run --rm carlosnunez/serverless:v2.69.1 \
            -v $PWD:/app \
            -w /app \
            -e ARCHITECTURE=linux/arm64 \
            serverless deploy

You commit the workflow, push, then set and forget…until GitHub tells you that it failed.

After quickly going into your Terminal to see what happened, you see this in your CloudWatch logs:

exec format error

Welp.

At this point we know that our computer and our Lambda function are both running on ARM CPUs. However, GitHub Actions only provides x86_64 runners. Consequently, because of this code snippet in our Dockerfile:

# Dockerfile
FROM ruby:2.7-alpine3.15
ENV AWS_LAMBDA_RIE_URL_ARM64=https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie-arm64
ENV AWS_LAMBDA_RIE_URL_AMD64=https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie

# Rest of code

RUN if uname -m | grep -Eiq 'arm|aarch'; \
    then curl -Lo /usr/local/bin/aws_lambda_rie "$AWS_LAMBDA_RIE_URL_ARM64"; \
    else curl -Lo /usr/local/bin/aws_lambda_rie "$AWS_LAMBDA_RIE_URL_AMD64"; \
    fi && chmod +x /usr/local/bin/aws_lambda_rie

The architecture of our Lambda Runtime Client depends on the architecture of the Docker container that built it. By default, Docker will create containers with the same platform as their host. (It is possible to run Docker containers with other platforms, but it’s not the default behavior.) When you build images manually with docker build, you can work around this by providing the --platform option:

docker build --platform linux/arm64 ...

Fortunately, Serverless also supports this flag:

# rest of config
  ecr:
    images:
      app:
        platform: linux/arm64
        path: .

So to work around this, we can change our platform to be an environment variable:

# serverless.yml

# rest of config
  ecr:
    images:
      app:
        platform: "${env:ARCHITECTURE}"
        path: .

Then modify our CI to

Lessons Learned

This was a heck of a journey. It took me days to work through all of this, and there were several moments where I contemplated giving up on using Lambda for web scraping like this. However, just like a steep climb on a bike ride or a super heavy lift, finishing is always worth the struggle.

Here’s what I learned from all of this:

Getting Chromium working on Lambda is a gigantic pain in the rear.
The only way to get this working with the least amount of pain and without picking up JavaScript is to run Docker containers inside of Lambda and roll your own base images.
Make sure that your function uses at least 2GB RAM. (Anything less will cause random timeouts.)
Also make sure that you use --no-sandbox, --disable-dev-shm-usage, --disable-gpu, and --single-process.

neurons firing from a keyboard

thoughts about devops, technology, and faster business from a random guy from dallas.