Ruby on AWS Lambda: Layer Dependencies

Joel Hayhurst - July 13, 2020

This article is part of our Ruby on AWS Lambda blog series. A recent project had us migrating an existing PDF document processing system from Rails Sidekiq to AWS Lambda. The processing includes OCR, creating preview images, splicing the PDF, and more. Moving to Lambda reduced processing time by 300 times in some cases.

This series of articles will serve less as a step-by-step process to get OCR serverless infrastructure up and running and more of a highlight reel of our "Aha!" moments. In part one, we talk about creating a AWS Lambda Layer with Docker. Check out the other posts in the series:

Building dependencies for Lambda can be confusing. They need to be compiled, zipped up, and then made into a layer. We also have to keep the file size for the dependencies under Lambda's limit. Here is how we did that using a Dockerfile.

You can check out the full Dockerfile here.

Dockerfile details

One question when building the dependencies was which Docker image to use. We first tried using the amazonlinux image, but this actually resulted in some build problems with one of our dependencies. We later found the LambCI images and ended up using lambci/lambda:build-ruby2.7, because we are using Ruby. This worked perfectly for us, and it has the benefit of already having build tools installed, making for faster Docker builds.

# Use AWS Lambda ruby2.7 build environment
FROM lambci/lambda:build-ruby2.7

AWS Lambda currently has a limit of 250 MB for its dependencies. If you are building significantly larger dependencies, as we were, then you are likely to reach this limit. It is therefore very important to compile your dependencies in a way that reduces file size. By using the -Os compile time option, we were able to reduce the size of our binaries by over 90%.

# Optimize compilation for size to try and stay below Lambda's 250 MB limit
# This reduces filesize by over 90% (!) compared to the default -O2
ENV CFLAGS "-Os"
ENV CXXFLAGS $CFLAGS

While slightly less performant than the default -O2, the massively reduced file size is worth it in this situation.

Next up is building Leptonica.

WORKDIR /root

# Leptonica image-reading dependencies
RUN yum install -y libjpeg-devel libpng-devel libtiff-devel

RUN curl -O http://www.leptonica.org/source/leptonica-1.79.0.tar.gz
RUN tar zxvf leptonica-1.79.0.tar.gz

WORKDIR leptonica-1.79.0
RUN ./configure --prefix=/opt
RUN make install

You will need to install the -devel packages when compiling, but won't need the -devel variant when providing the dependencies for runtime.

Then Tesseract:

WORKDIR /root

# Optional Tesseract foreign language training dependencies
# libicu-devel on Yum is of insufficient version (50, 52 is required)
# These are also not really necessary for our usage.
#RUN yum install -y libicu-devel pango-devel cairo-devel

RUN curl -Lo tesseract-4.1.1.tar.gz \
  https://github.com/tesseract-ocr/tesseract/archive/4.1.1.tar.gz
RUN tar zxvf tesseract-4.1.1.tar.gz

WORKDIR tesseract-4.1.1
RUN ./autogen.sh --prefix=/opt
# These ENV vars have to be set or it will not build
ENV LEPTONICA_CFLAGS -I/opt/include/leptonica
ENV LEPTONICA_LIBS -L/opt/lib -lleptonica
RUN ./configure --prefix=/opt
RUN make install

# English training data
WORKDIR /opt/share/tessdata
RUN curl -LO https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

GhostScript was technically installable via RPM, but the amount of dependencies was too great. More on that later. We decided to just compile it with a minimal dependency set.

WORKDIR /root

RUN curl -LO \
  https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs952/ghostscript-9.52.tar.gz
RUN tar zxvf ghostscript-9.52.tar.gz

WORKDIR ghostscript-9.52
RUN ./configure --prefix=/opt
RUN make install

Ironically, we end up installing ghostscript-devel so that ImageMagick can be built. It might be possible to use the prior GhostScript installation here, but this was simple enough for build purposes.

WORKDIR /root

RUN yum install -y ghostscript-devel

RUN curl -Lo ImageMagick-7.0.10-6.tar.gz \
  https://github.com/ImageMagick/ImageMagick/archive/7.0.10-6.tar.gz
RUN tar zxvf ImageMagick-7.0.10-6.tar.gz

WORKDIR ImageMagick-7.0.10-6
RUN ./configure --prefix=/opt
RUN make install

We considered using libvips instead of ImageMagick, but it would have added scope to the project. Nonetheless I have some commented out code in the Dockerfile for building libvips in case we decide to switch to it in the future.

For Ruby gems, a Gemfile in the same directory that contains all of the gems is all we need.

source 'https://rubygems.org'

# PDF processing gems
gem 'simhash2'
gem 'phashion'
gem 'rtesseract'
gem 'mini_magick'
gem 'pdf-reader'
gem 'hexapdf'

# Other gems used in these files
gem 'procto'
gem 'adamantium'
gem 'concord'
gem 'activesupport', '~> 6.0.2'
gem 'chronic'
gem 'activestorage'

Note that a little bit of trickery is necessary to modify the gem paths for loading in Lambda.

WORKDIR /root

# Phashion dependencies
# Can skip this step because they are already installed above for Leptonica
#RUN yum install -y libjpeg-devel libpng-devel

# Copy Gemfile from host into container's current directory
COPY Gemfile .

RUN bundle config set path vendor/bundle
RUN bundle

# Modify directory structure for Lambda load path
WORKDIR vendor/bundle
RUN mkdir ruby/gems
RUN mv ruby/2.* ruby/gems
RUN mv ruby /opt
WORKDIR /root

Now the RPM packages. I left in the installation for GhostScript and its dependencies from RPM as a comment, mainly so you can see how many packages I had to specify manually.

Note that you will have to specify the x86_64 variants of these packages when using these tools.

WORKDIR /root

# Install yumdownloader and rpmdev-extract
RUN yum install -y yum-utils rpmdevtools

RUN mkdir rpms
WORKDIR rpms

# Download dependency RPMs
RUN yumdownloader libjpeg-turbo.x86_64 libpng.x86_64 libtiff.x86_64 \
  libgomp.x86_64 libwebp.x86_64 jbigkit-libs.x86_64
# GhostScript and dependencies
# To reduce dependencies, we are compiling GhostScript from source instead
# RUN yumdownloader ghostscript.x86_64 cups-libs.x86_64 fontconfig.x86_64 \
#   fontpackages-filesystem freetype.x86_64 ghostscript-fonts jasper-libs.x86_64 \
#   lcms2.x86_64 libICE.x86_64 libSM.x86_64 libX11.x86_64 libX11-common \
#   libXau.x86_64 libXext.x86_64 libXt.x86_64 libfontenc.x86_64 libxcb.x86_64 \
#   poppler-data stix-fonts urw-fonts xorg-x11-font-utils.x86_64 avahi-libs.x86_64 \
#   acl.x86_64 audit-libs.x86_64 cracklib.x86_64 cracklib-dicts.x86_64 cryptsetup-libs.x86_64 \
#   dbus.x86_64 dbus-libs.x86_64 device-mapper.x86_64 device-mapper-libs.x86_64 \
#   elfutils-default-yama-scope elfutils-libs.x86_64 gzip.x86_64 kmod.x86_64 kmod-libs.x86_64 \
#   libcap-ng.x86_64 libfdisk.x86_64 libpwquality.x86_64 libsemanage.x86_64 \
#   libsmartcols.x86_64 libutempter.x86_64 lz4.x86_64 pam.x86_64 qrencode-libs.x86_64 \
#   shadow-utils.x86_64 systemd.x86_64 systemd-libs.x86_64 ustr.x86_64 util-linux.x86_64 \
#   expat.x86_64 xz-libs.x86_64 libgcrypt.x86_64 libgpg-error.x86_64 elfutils-libelf.x86_64 \
#   bzip2-libs.x86_64

# Extract RPMs
RUN rpmdev-extract *.rpm
RUN rm *.rpm

# Copy all package files into /opt/rpms
RUN cp -vR */usr/* /opt

# The x86_64 packages extract as lib64, we need to move these files to lib
RUN yum install -y rsync
RUN rsync -av /opt/lib64/ /opt/lib/
RUN rm -r /opt/lib64

Notice some more path management. We used rsync for copying because cp gave us some problems.

Now we just need to zip up the dependencies.

WORKDIR /opt
RUN zip -r /root/ProcessDocumentLayer.zip *

And lastly, the entrypoint for the Dockerfile, which copies the zip file to an output directory.

ENTRYPOINT ["/bin/cp", "/root/ProcessDocumentLayer.zip", "/output"]

Now we just need the Docker commands to build this. I put them at the very top of the file under a "Usage" section.

# Usage:
# docker build -t lambda .
# docker run -v $(pwd):/output lambda
# ./publish_layer.sh.

The publish_layer.sh script is a small one we wrote that uses awscli to upload and publish the layer. You'll have to authenticate with AWS for it to work. I used aws configure for this purpose but you can check out this article for more info.

#!/bin/sh
aws s3 cp ProcessDocumentLayer.zip s3://process-document-layers
aws lambda publish-layer-version --layer-name ProcessDocumentLayer --description "Process Document dependencies" \
  --content S3Bucket=process-document-layers,S3Key=ProcessDocumentLayer.zip --compatible-runtimes ruby2.7

And that's it. With this Dockerfile, we are able to easily build and publish a dependency layer for our OCR system on Lambda. We hope this was useful for you!

Joel Hayhurst

Joel has been programming in Rails since before version 1.0. He spends his free time with nature, family, and music.

  
  

Ready to Get Started?

LET'S CONNECT