This article is part of our Ruby on AWS Lambda blog series. A recent project had us migrating an existing PDF document processing system from Rails Sidekiq to AWS Lambda. The processing includes OCR, creating preview images, splicing the PDF, and more. Moving to Lambda reduced processing time by 300 times in some cases.
This series of articles will serve less as a step-by-step process to get OCR serverless infrastructure up and running and more of a highlight reel of our "Aha!" moments. In part one, we talk about creating a AWS Lambda Layer with Docker. Check out the other posts in the series:
Building dependencies for Lambda can be confusing. They need to be compiled, zipped up, and then made into a layer. We also have to keep the file size for the dependencies under Lambda's limit. Here is how we did that using a Dockerfile.
You can check out the full Dockerfile here.
One question when building the dependencies was which Docker image to use. We first tried using the
amazonlinux image, but this actually resulted in some build problems with one of our dependencies. We later found the LambCI images and ended up using
lambci/lambda:build-ruby2.7, because we are using Ruby. This worked perfectly for us, and it has the benefit of already having build tools installed, making for faster Docker builds.
# Use AWS Lambda ruby2.7 build environment FROM lambci/lambda:build-ruby2.7
AWS Lambda currently has a limit of 250 MB for its dependencies. If you are building significantly larger dependencies, as we were, then you are likely to reach this limit. It is therefore very important to compile your dependencies in a way that reduces file size. By using the
-Os compile time option, we were able to reduce the size of our binaries by over 90%.
# Optimize compilation for size to try and stay below Lambda's 250 MB limit # This reduces filesize by over 90% (!) compared to the default -O2 ENV CFLAGS "-Os" ENV CXXFLAGS $CFLAGS
While slightly less performant than the default
-O2, the massively reduced file size is worth it in this situation.
Next up is building Leptonica.
WORKDIR /root # Leptonica image-reading dependencies RUN yum install -y libjpeg-devel libpng-devel libtiff-devel RUN curl -O http://www.leptonica.org/source/leptonica-1.79.0.tar.gz RUN tar zxvf leptonica-1.79.0.tar.gz WORKDIR leptonica-1.79.0 RUN ./configure --prefix=/opt RUN make install
You will need to install the
-devel packages when compiling, but won't need the
-devel variant when providing the dependencies for runtime.
WORKDIR /root # Optional Tesseract foreign language training dependencies # libicu-devel on Yum is of insufficient version (50, 52 is required) # These are also not really necessary for our usage. #RUN yum install -y libicu-devel pango-devel cairo-devel RUN curl -Lo tesseract-4.1.1.tar.gz \ https://github.com/tesseract-ocr/tesseract/archive/4.1.1.tar.gz RUN tar zxvf tesseract-4.1.1.tar.gz WORKDIR tesseract-4.1.1 RUN ./autogen.sh --prefix=/opt # These ENV vars have to be set or it will not build ENV LEPTONICA_CFLAGS -I/opt/include/leptonica ENV LEPTONICA_LIBS -L/opt/lib -lleptonica RUN ./configure --prefix=/opt RUN make install # English training data WORKDIR /opt/share/tessdata RUN curl -LO https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
GhostScript was technically installable via RPM, but the amount of dependencies was too great. More on that later. We decided to just compile it with a minimal dependency set.
WORKDIR /root RUN curl -LO \ https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs952/ghostscript-9.52.tar.gz RUN tar zxvf ghostscript-9.52.tar.gz WORKDIR ghostscript-9.52 RUN ./configure --prefix=/opt RUN make install
Ironically, we end up installing
ghostscript-devel so that ImageMagick can be built. It might be possible to use the prior GhostScript installation here, but this was simple enough for build purposes.
WORKDIR /root RUN yum install -y ghostscript-devel RUN curl -Lo ImageMagick-7.0.10-6.tar.gz \ https://github.com/ImageMagick/ImageMagick/archive/7.0.10-6.tar.gz RUN tar zxvf ImageMagick-7.0.10-6.tar.gz WORKDIR ImageMagick-7.0.10-6 RUN ./configure --prefix=/opt RUN make install
We considered using libvips instead of ImageMagick, but it would have added scope to the project. Nonetheless I have some commented out code in the Dockerfile for building libvips in case we decide to switch to it in the future.
For Ruby gems, a Gemfile in the same directory that contains all of the gems is all we need.
source 'https://rubygems.org' # PDF processing gems gem 'simhash2' gem 'phashion' gem 'rtesseract' gem 'mini_magick' gem 'pdf-reader' gem 'hexapdf' # Other gems used in these files gem 'procto' gem 'adamantium' gem 'concord' gem 'activesupport', '~> 6.0.2' gem 'chronic' gem 'activestorage'
Note that a little bit of trickery is necessary to modify the gem paths for loading in Lambda.
WORKDIR /root # Phashion dependencies # Can skip this step because they are already installed above for Leptonica #RUN yum install -y libjpeg-devel libpng-devel # Copy Gemfile from host into container's current directory COPY Gemfile . RUN bundle config set path vendor/bundle RUN bundle # Modify directory structure for Lambda load path WORKDIR vendor/bundle RUN mkdir ruby/gems RUN mv ruby/2.* ruby/gems RUN mv ruby /opt WORKDIR /root
Now the RPM packages. I left in the installation for GhostScript and its dependencies from RPM as a comment, mainly so you can see how many packages I had to specify manually.
Note that you will have to specify the
x86_64 variants of these packages when using these tools.
WORKDIR /root # Install yumdownloader and rpmdev-extract RUN yum install -y yum-utils rpmdevtools RUN mkdir rpms WORKDIR rpms # Download dependency RPMs RUN yumdownloader libjpeg-turbo.x86_64 libpng.x86_64 libtiff.x86_64 \ libgomp.x86_64 libwebp.x86_64 jbigkit-libs.x86_64 # GhostScript and dependencies # To reduce dependencies, we are compiling GhostScript from source instead # RUN yumdownloader ghostscript.x86_64 cups-libs.x86_64 fontconfig.x86_64 \ # fontpackages-filesystem freetype.x86_64 ghostscript-fonts jasper-libs.x86_64 \ # lcms2.x86_64 libICE.x86_64 libSM.x86_64 libX11.x86_64 libX11-common \ # libXau.x86_64 libXext.x86_64 libXt.x86_64 libfontenc.x86_64 libxcb.x86_64 \ # poppler-data stix-fonts urw-fonts xorg-x11-font-utils.x86_64 avahi-libs.x86_64 \ # acl.x86_64 audit-libs.x86_64 cracklib.x86_64 cracklib-dicts.x86_64 cryptsetup-libs.x86_64 \ # dbus.x86_64 dbus-libs.x86_64 device-mapper.x86_64 device-mapper-libs.x86_64 \ # elfutils-default-yama-scope elfutils-libs.x86_64 gzip.x86_64 kmod.x86_64 kmod-libs.x86_64 \ # libcap-ng.x86_64 libfdisk.x86_64 libpwquality.x86_64 libsemanage.x86_64 \ # libsmartcols.x86_64 libutempter.x86_64 lz4.x86_64 pam.x86_64 qrencode-libs.x86_64 \ # shadow-utils.x86_64 systemd.x86_64 systemd-libs.x86_64 ustr.x86_64 util-linux.x86_64 \ # expat.x86_64 xz-libs.x86_64 libgcrypt.x86_64 libgpg-error.x86_64 elfutils-libelf.x86_64 \ # bzip2-libs.x86_64 # Extract RPMs RUN rpmdev-extract *.rpm RUN rm *.rpm # Copy all package files into /opt/rpms RUN cp -vR */usr/* /opt # The x86_64 packages extract as lib64, we need to move these files to lib RUN yum install -y rsync RUN rsync -av /opt/lib64/ /opt/lib/ RUN rm -r /opt/lib64
Notice some more path management. We used
rsync for copying because
cp gave us some problems.
Now we just need to zip up the dependencies.
WORKDIR /opt RUN zip -r /root/ProcessDocumentLayer.zip *
And lastly, the entrypoint for the Dockerfile, which copies the zip file to an output directory.
ENTRYPOINT ["/bin/cp", "/root/ProcessDocumentLayer.zip", "/output"]
Now we just need the Docker commands to build this. I put them at the very top of the file under a "Usage" section.
# Usage: # docker build -t lambda . # docker run -v $(pwd):/output lambda # ./publish_layer.sh.
publish_layer.sh script is a small one we wrote that uses
awscli to upload and publish the layer. You'll have to authenticate with AWS for it to work. I used
aws configure for this purpose but you can check out this article for more info.
#!/bin/sh aws s3 cp ProcessDocumentLayer.zip s3://process-document-layers aws lambda publish-layer-version --layer-name ProcessDocumentLayer --description "Process Document dependencies" \ --content S3Bucket=process-document-layers,S3Key=ProcessDocumentLayer.zip --compatible-runtimes ruby2.7
And that's it. With this Dockerfile, we are able to easily build and publish a dependency layer for our OCR system on Lambda.
We hope this was useful for you! Don't forget to check out part two of this series, Planning and Architecture (coming soon).