<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Alden Page's Tech Blog</title><link href="https://alden.page/" rel="alternate"/><link href="https://alden.page/feed.xml" rel="self"/><id>urn:uuid:8151999f-3260-3fe7-b8b5-03ffe951a5c6</id><updated>2024-05-13T00:00:00Z</updated><author><name/></author><entry><title>Turn Your Dead Personal Blog Into a Tor WebTunnel in 5 Minutes</title><link href="https://alden.page/blog/run-bridges/" rel="alternate"/><updated>2024-05-13T00:00:00Z</updated><author><name>Alden Page</name></author><id>urn:uuid:240a10b6-0a4e-3347-8d8b-fa7169c84c9d</id><content type="html">&lt;p&gt;&lt;img src="onions.jpg" alt="A photo of some chopped onions"&gt;&lt;/p&gt;
&lt;p&gt;Like many people in the software world, I have a personal website that I have zero time to maintain. Outside of hosting a &lt;a href="../../contact"&gt;vanity email address&lt;/a&gt; and an &lt;a href="../../crawler"&gt;old whitepaper&lt;/a&gt;, there's not a whole lot going on here. My VPS spends 99% of its cycles idling. The other 1% is probably being spent on serving drive-by WordPress vulnerability probes and &lt;a href="https://openai.com/"&gt;crawl bots trawling for training data&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With just a few minutes of my time, I turned my dead website into something more productive: a censorship-busting gateway to the internet. This is possible thanks to the new Tor WebTunnel pluggable transport, which hides Tor bridge traffic behind your HTTP proxy of choice.&lt;/p&gt;
&lt;p&gt;If you already know what Tor is and how it works, just &lt;a href="#the_good_stuff"&gt;skip to the good stuff&lt;/a&gt; to learn how to set up your WebTunnel bridge relay.&lt;/p&gt;
&lt;h2&gt;Tor Crash Course&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://www.torproject.org/"&gt;Tor&lt;/a&gt; is a decentralized and volunteer-run network used to access the internet anonymously. Anyone can download the Tor Browser Bundle and have a reasonably secure and private link to the rest of the internet. A person's traffic gets bounced around three different &lt;em&gt;relays&lt;/em&gt;, none of whom have a complete picture of what the user is doing thanks to the power of &lt;a href="https://en.wikipedia.org/wiki/Onion_routing"&gt;onion routing&lt;/a&gt;. Besides browsing the "clearnet", another thing you can use Tor for is for accessing onion services, which offer a number of interesting security properties. You can read &lt;a href="http://aldenp5fkdeagzwb7j4snypyxm76tucru2bm2b4bwdfd76k2dfti4tad.onion/blog/run-bridges"&gt;this very post&lt;/a&gt; through an onion service.&lt;/p&gt;
&lt;p&gt;Tor can't be accessed from some countries because of government censorship. By hosting a special kind of relay, you can help Tor's most vulnerable users gain access to the uncensored internet. Think of running a Tor Bridge as a type of volunteerism like Folding@Home. You give a bit of your expertise, CPU cycles, and bandwidth to make the world a slightly better place.&lt;/p&gt;
&lt;p&gt;A full overview of how Tor works is outside of the scope of this post, but those interested in diving deeper should start with Ben Collier's excellent &lt;a href="https://direct.mit.edu/books/oa-monograph/5761/TorFrom-the-Dark-Web-to-the-Future-of-Privacy"&gt;&lt;em&gt;Tor: From the Dark Web to the Future of Privacy&lt;/em&gt;&lt;/a&gt; for an accessible history of the project.&lt;/p&gt;
&lt;h3&gt;Relays and Bridges&lt;/h3&gt;
&lt;p&gt;Relays are the backbone of the Tor network. Anybody, including you, can run a Tor relay. The more &lt;a href="https://blog.torproject.org/strength-numbers-measuring-diversity-tor-network/"&gt;diverse&lt;/a&gt; the network becomes, the stronger the privacy guarantees become.&lt;/p&gt;
&lt;p&gt;Hosting regular relays (&lt;a href="https://community.torproject.org/relay/types-of-relays/"&gt;guards, middle nodes, and exits&lt;/a&gt;) has some caveats, but pretty much anybody can safely run a bridge relay with minimal effort. A bridge is a special kind of relay that is hard for adversaries to block, and can even be safely hosted on a home network, if you're willing to spend some time wrestling with NAT. Now, with the latest pluggable transport introduced by the Tor project, it's possible to host a particularly difficult to block type of bridge with a garden-variety web server.&lt;/p&gt;
&lt;p&gt;&lt;div id="the_good_stuff"&gt;&lt;/div&gt;&lt;/p&gt;
&lt;h2&gt;The New WebTunnel Pluggable Transport&lt;/h2&gt;
&lt;p&gt;Evading blocking is an unending arms race between censors and Tor developers. Pluggable transports are supplemental software that help relays disguise web traffic from censors. The newest one is called WebTunnel, and it works by proxying HTTPS-ified Tor traffic through your web server to the rest of the Tor network. If you have a web domain, a VPS, and an application server set up, you're essentially 90% finished setting up a WebTunnel relay already.&lt;/p&gt;
&lt;p&gt;&lt;img src="tor_webtunnels_small.jpg" alt="A pencil sketch illustrating the architecture of a WebTunnel bridge relay. A reverse proxy routes traffic to both the dead blog and the Tor Network, mediated through the WebTunnel executable and the Tor Daemon."&gt;
&lt;em&gt;&lt;center&gt;Tor users can connect to the Tor network using a hidden path on your web server.&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Set up your WebTunnel in 5 minutes&lt;/h2&gt;
&lt;p&gt;I'm writing this under the assumption that you have a domain, a VPS, and a web server serving HTTPS responses to the internet at your disposal. If you're starting from scratch, you need to &lt;a href="https://community.torproject.org/relay/setup/webtunnel/"&gt;read the official documentation instead&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's what you need to do:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Generate a random string.&lt;/li&gt;
&lt;li&gt;Configure your web server to route traffic to the WebTunnel port.&lt;/li&gt;
&lt;li&gt;Install and configure WebTunnel and Tor on your VPS.&lt;/li&gt;
&lt;li&gt;Read the server logs and fix any mistakes you made.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Be sure to have automated security updates turned on as well&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This guide is written for any apt-based Linux distribution, but can be trivially adapted to work on any web server on any Unix-like.&lt;/p&gt;
&lt;h3&gt;1. Generate a random string.&lt;/h3&gt;
&lt;p&gt;Your tunnel hides behind a reasonably long random path string on your web server. Run this command to generate an eligible string:&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;cat&lt;span class="w"&gt; &lt;/span&gt;/dev/urandom&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;tr&lt;span class="w"&gt; &lt;/span&gt;-cd&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;qwertyuiopasdfghjklzxcvbnmMNBVCXZLKJHGFDSAQWERTUIOP0987654321&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;head&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;24&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;h3&gt;2. Configure your web server to route traffic to the WebTunnel port.&lt;/h3&gt;
&lt;p&gt;Now, we're going to take that random string we generated and configure our web server to use it as a path. I happen to use nginx, but if you don't, you can figure out how to route a request to localhost:15000 on your own HTTPS-enabled webserver easily enough.&lt;/p&gt;
&lt;p&gt;Let's create a new virtual host for serving WebTunnel users. Paste this in &lt;code&gt;/etc/nginx/sites-available/webtunnel-vhost&lt;/code&gt;. You will need to customize &lt;code&gt;ssl_certificate&lt;/code&gt;, &lt;code&gt;ssl_certificate_key&lt;/code&gt;, and the &lt;code&gt;location&lt;/code&gt; line.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Change &lt;code&gt;ssl_certificate&lt;/code&gt; and &lt;code&gt;ssl_certificate_key&lt;/code&gt; to point to your existing SSL key. If you use Let's Encrypt, it will look similar to what I have below.&lt;/li&gt;
&lt;li&gt;On the &lt;code&gt;location&lt;/code&gt; line, change &lt;code&gt;$YOUR_RANDOM_STRING&lt;/code&gt; to the random string you generated in step 1.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;listen&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;[::]:443&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;ssl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;http2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;listen&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;443&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;ssl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;http2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;server_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$SERVER_ADDRESS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;#ssl on;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# certificates generated via acme.sh&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_certificate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/etc/letsencrypt/live/your.site/fullchain.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# managed by Certbot&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_certificate_key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/etc/letsencrypt/live/your.site/privkey.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# managed by Certbot&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_session_timeout&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_protocols&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;TLSv1.2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;TLSv1.3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_ciphers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;ECDHE-ECDSA-AES128-GCM-SHA256&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;ECDHE-RSA-AES128-GCM-SHA256&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;ECDHE-ECDSA-AES256-GCM-SHA384&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;ECDHE-RSA-AES256-GCM-SHA384&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;ECDHE-ECDSA-CHACHA20-POLY1305&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;ECDHE-RSA-CHACHA20-POLY1305&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;DHE-RSA-AES128-GCM-SHA256&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;DHE-RSA-AES256-GCM-SHA384&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_prefer_server_ciphers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_session_cache&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;shared:MozSSL:50m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;#ssl_ecdh_curve secp521r1,prime256v1,secp384r1;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;ssl_session_tickets&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;add_header&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;Strict-Transport-Security&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;max-age=63072000&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kn"&gt;location&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$YOUR_RANDOM_STRING&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_pass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:15000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_http_version&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="s"&gt;.1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;### Set WebSocket headers ###&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;Upgrade&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$http_upgrade&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;Connection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;upgrade&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;### Set Proxy headers ###&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;Accept-Encoding&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;Host&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nv"&gt;$host&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;X-Real-IP&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="nv"&gt;$remote_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;X-Forwarded-For&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$proxy_add_x_forwarded_for&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_set_header&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;X-Forwarded-Proto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$scheme&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;add_header&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="s"&gt;Front-End-Https&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="no"&gt;on&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;proxy_redirect&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;access_log&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kn"&gt;error_log&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Enable the virtual host and test your configuration.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;ln&lt;span class="w"&gt; &lt;/span&gt;-s&lt;span class="w"&gt; &lt;/span&gt;/etc/nginx/sites-available/webtunnel-vhost&lt;span class="w"&gt; &lt;/span&gt;/etc/nginx/sites-enabled/
sudo&lt;span class="w"&gt; &lt;/span&gt;nginx&lt;span class="w"&gt; &lt;/span&gt;-t
sudo&lt;span class="w"&gt; &lt;/span&gt;systemctl&lt;span class="w"&gt; &lt;/span&gt;reload&lt;span class="w"&gt; &lt;/span&gt;nginx
&lt;/pre&gt;&lt;/div&gt;
&lt;h3&gt;3. Install and configure WebTunnel and Tor on your VPS.&lt;/h3&gt;
&lt;p&gt;You have two choices for installing WebTunnel: compiling from source, or &lt;a href="https://community.torproject.org/relay/setup/webtunnel/docker/"&gt;installing from Docker&lt;/a&gt;. I don't want to deal with setting up Docker on my server, and compiling Golang is super easy, so we're going to install from source.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;golang
&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/tmp/
git&lt;span class="w"&gt; &lt;/span&gt;clone&lt;span class="w"&gt; &lt;/span&gt;https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/webtunnel
&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;webtunnel/main/server
go&lt;span class="w"&gt; &lt;/span&gt;build
sudo&lt;span class="w"&gt; &lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;server&lt;span class="w"&gt; &lt;/span&gt;/usr/local/bin/webtunnel
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next, we need to install Tor through your package manager. To ensure you're using a bleeding-edge version of Tor, the Tor developers recommend that you use their repositories.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Add Tor repositories&lt;/span&gt;
&lt;span class="nv"&gt;CODENAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;lsb_release&lt;span class="w"&gt; &lt;/span&gt;-cs&lt;span class="k"&gt;)&lt;/span&gt;
cat&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&amp;lt;EOF &amp;gt; /etc/apt/sources.list.d/tor.list&lt;/span&gt;
&lt;span class="s"&gt;   deb     [signed-by=/usr/share/keyrings/tor-archive-keyring.gpg] https://deb.torproject.org/torproject.org $CODENAME main&lt;/span&gt;
&lt;span class="s"&gt;   deb-src [signed-by=/usr/share/keyrings/tor-archive-keyring.gpg] https://deb.torproject.org/torproject.org $CODENAME main&lt;/span&gt;
&lt;span class="s"&gt;EOF&lt;/span&gt;

&lt;span class="c1"&gt;# Import Tor developers&amp;#39; package signing key&lt;/span&gt;
wget&lt;span class="w"&gt; &lt;/span&gt;-qO-&lt;span class="w"&gt; &lt;/span&gt;https://deb.torproject.org/torproject.org/A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89.asc&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gpg&lt;span class="w"&gt; &lt;/span&gt;--dearmor&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;tee&lt;span class="w"&gt; &lt;/span&gt;/usr/share/keyrings/tor-archive-keyring.gpg&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;/dev/null

&lt;span class="c1"&gt;# Install Tor&lt;/span&gt;
sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;update&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;tor&lt;span class="w"&gt; &lt;/span&gt;deb.torproject.org-keyring&lt;span class="w"&gt; &lt;/span&gt;-y
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's configure the Tor daemon to run in relay mode. Replace &lt;code&gt;/etc/tor/torrc&lt;/code&gt; with the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ORPort 127.0.0.1:auto
AssumeReachable 1
ServerTransportPlugin webtunnel exec /usr/local/bin/webtunnel
ServerTransportListenAddr webtunnel 127.0.0.1:15000
Log notice file /var/log/tor/notices.log
ExtORPort auto
SocksPort 0
BridgeRelay 1

ContactInfo &amp;lt;address@email.com&amp;gt;
ServerTransportOptions webtunnel url=https://your.site/$YOUR_RANDOM_STRING
Nickname $YOUR_RELAY_NICKNAME
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pay special attention to the last three lines, which you need to customize.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ContactInfo: Your email address. This is public information. Consider using an inbox dedicated to relay operation.&lt;/li&gt;
&lt;li&gt;ServerTransportOptions: You need to change this to point to your domain and random path string.&lt;/li&gt;
&lt;li&gt;Nickname: A fun name for identifying your bridge.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;AppArmor settings&lt;/h4&gt;
&lt;p&gt;Finally, we need to allow WebTunnel to inherit capabilities from the Tor daemon. This is achieved using the &lt;a href="https://www.novell.com/documentation/apparmor/apparmor201_sp10_admin/data/bx5bmls.html"&gt;Inherit Execute (ix)&lt;/a&gt; access mode in the Tor AppArmor profile.&lt;/p&gt;
&lt;p&gt;Open &lt;code&gt;/etc/apparmor.d/system_tor&lt;/code&gt; and add the access mode &lt;code&gt;/usr/local/bin/webtunnel ix,&lt;/code&gt; to the system_tor profile inside of the curly braces. Use &lt;a href="system_tor_apparmor.txt"&gt;my settings&lt;/a&gt; as a reference (avoid copying and pasting it; there might be upstream changes).&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;.&lt;span class="w"&gt; &lt;/span&gt;.&lt;span class="w"&gt; &lt;/span&gt;.
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;# WebTunnel-specific access mode&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;/usr/local/bin/webtunnel&lt;span class="w"&gt; &lt;/span&gt;ix,
.&lt;span class="w"&gt; &lt;/span&gt;.&lt;span class="w"&gt; &lt;/span&gt;.
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Once you've made the change, reload the system_tor AppArmor profile.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;apparmor_parser&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;/etc/apparmor.d/system_tor
&lt;/pre&gt;&lt;/div&gt;
&lt;h3&gt;4. Start the Tor daemon and monitor the logs.&lt;/h3&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Start Tor&lt;/span&gt;
sudo&lt;span class="w"&gt; &lt;/span&gt;systemctl&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;enable&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--now&lt;span class="w"&gt; &lt;/span&gt;tor.service
&lt;span class="c1"&gt;# Tail the logs and make sure the server starts successfully&lt;/span&gt;
sudo&lt;span class="w"&gt; &lt;/span&gt;journalctl&lt;span class="w"&gt; &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;-u&lt;span class="w"&gt; &lt;/span&gt;tor@default&lt;span class="w"&gt; &lt;/span&gt;--follow
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If you see a message saying your server has been fully bootstrapped, you're done. If you're stuck, I recommend referring to the &lt;a href="https://community.torproject.org/relay/setup/webtunnel/source/"&gt;full how-to guide on the Tor website for additional support&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;That's it!&lt;/h2&gt;
&lt;p&gt;In a few days, your server fingerprint will be distributed to users of the Tor network, and traffic will start flowing in.&lt;/p&gt;
&lt;p&gt;Ideally, to make it harder to categorize and block your site, you won't announce your WebTunnel to the world like I did. In addition to torifying &lt;a href="https://alden.page"&gt;https://alden.page&lt;/a&gt;, I've started up bridge relays on a handful of other servers that I own as well.&lt;/p&gt;
</content></entry><entry><title>How to Politely Crawl and Analyze Half a Billion Images</title><link href="https://alden.page/blog/how-to-politely-crawl-and-analyze-half-a-billion-images/" rel="alternate"/><updated>2020-09-01T00:00:00Z</updated><author><name>Alden Page</name></author><id>urn:uuid:022e61a0-6ffe-3a55-aa47-df7bdb0005c4</id><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;A few years ago, &lt;a href="https://creativecommons.org/"&gt;Creative Commons&lt;/a&gt; tasked me with building a web crawler capable of downloading 500 million images.&lt;/p&gt;
&lt;p&gt;Crawling anything beyond a few thousand URLs demands a fast distributed system. Moreover, it's not enough to be fast; moral, legal, and practical considerations demand that a crawler be &lt;em&gt;polite&lt;/em&gt;: a crawler must be carefully designed to avoid exhausting the resources of its targets. Finally, there is the matter of analyzing and indexing the dataset produced by the crawler. Achieving these aims on the scale of several hundred million images is a major challenge; the problems of rate limiting and task scheduling become far more difficult when state is spread across multiple nodes.&lt;/p&gt;
&lt;p&gt;In this article, I discuss the process of designing, implementing, and deploying a large scale image crawler, with a few code snippets and diagrams along the way. The full source code is available &lt;a href="https://github.com/cc-archive/image-crawler"&gt;on GitHub&lt;/a&gt; under the MIT License.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;With CC Search (now &lt;a href="https://openverse.org"&gt;Openverse&lt;/a&gt;), Creative Commons (CC) set out to index all of the CC licensed  works on the internet, starting with images. We indexed over 500 million images, which we believe is roughly 36% of all open content by &lt;a href="https://creativecommons.org/2018/05/08/state-of-the-commons-2017/"&gt;our last count&lt;/a&gt;. Recently, we reached a point where improving the quality of the search results demanded crawling and analyzing a copy of every image in our system.&lt;/p&gt;
&lt;p&gt;Originally, when we discovered an image and inserted it into CC Search, we didn't even bother downloading it; we stuck the URL in our database and embedded the image in our search results. This approach has a lot of problems:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Embedding third party content is fraught. What if the other party's server goes down, the images disappear due to link rot, or a result's TLS certificate expires? Each of these situations results in broken images appearing in the search results or browser alerts about degraded security.&lt;/li&gt;
&lt;li&gt;The dimensions and compression quality of images are unknown. We have no way to lower the rank of poor quality images, and filtering our search results by resolution is impossible.&lt;/li&gt;
&lt;li&gt;Without the images themselves, it is not possible to perform more sophisticated analysis such as tagging.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We solved (1) by setting up a &lt;a href="https://github.com/willnorris/imageproxy"&gt;caching thumbnail proxy&lt;/a&gt; between images in the search results and their 3rd party origin, as well as some last-minute liveness checks to make sure that the image hasn't 404'd.&lt;/p&gt;
&lt;p&gt;(2) and (3), however, are not possible to solve without actually downloading the image and performing some analysis on the contents of the file. To reproduce the features that users take for granted in image search, we're going to need a fairly powerful crawling system.&lt;/p&gt;
&lt;p&gt;On the scale of several thousand images, it would be easy to cobble together a few scripts to spit out this information, but with half a billion images, there are a lot of hurdles to overcome.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We want to crawl &lt;a href="https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy"&gt;politely&lt;/a&gt;; however, the concentration and quantity of images means that we have to hit some sources with a high crawl rate in order to have any hope of finishing the crawl in a reasonable period of time. Our data sources range from non-profit museums with a single staff IT person to tech companies with their own data centers and thousands of employees; the crawl rate has to be tailored to download quickly from the big players but not overwhelm small sources. At the same time, we need to be sure that we are not overestimating any source's capacity and watch for signs that our crawler is straining the server.&lt;/li&gt;
&lt;li&gt;We need to keep the time to process each image as low as possible. This means that the crawling and analysis tasks need to be distributed to multiple machines in parallel.&lt;/li&gt;
&lt;li&gt;The crawler will produce a lot of metadata. Integrating it with our internal systems should not interfere with processing incoming metadata. That suggests that a message bus will be necessary to buffer messages before they are written into our data layer, where writes can be expensive.&lt;/li&gt;
&lt;li&gt;We need a way to understand how the crawl is progressing. We should have summaries of error counts, status codes, and crawl rates broken down by source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In summary, the challenge isn't so much making a fast crawler as much as it is tailoring the crawl speed to each source. At a minimum, we'll need to deal with concurrency and parallelism, provisioning and managing the life cycle of crawler infrastructure, pipelines for capturing output data, a way to monitor the progress of the crawl, a suite of tests to make sure the system behaves as expected, and a reliable way to enforce a so-called "politeness policy". That's not a trivial project, particularly for our tiny three person tech team (of which only one person is available to do all of the crawling work). Can't we just use an off-the-shelf open source crawler?&lt;/p&gt;
&lt;h2&gt;What about existing open source crawlers?&lt;/h2&gt;
&lt;p&gt;Any decent software engineer will consider existing options before diving into a project and reinventing the wheel. My assessment was that although there are a lot of open source crawling frameworks available, few of them focus on images, some are not actively maintained, and all would require extensive customization to meet the requirements of our crawl strategy. Further, many solutions are more complex than than our use case demands and would significantly expand our use of cloud infrastructure, resulting in higher expenses and operational headaches. I experimented with Apache Nutch, Scrapy Cluster, and Frontera; none of the existing options looked quite right for this use case.&lt;/p&gt;
&lt;p&gt;As a reminder, we want to eventually crawl every single Creative Commons work on the internet. Effective crawling is central to the capabilities that our search engine is able to provide. In addition to being central to achieving high quality image search, crawling could also be useful for discovering new Creative Commons content of any type on any website. In my view, that's a strong argument for spending some time designing a custom crawling solution where we have complete end-to-end control of the process, as long as the feature set is limited in scope. In the next section, we'll assess the effort required to build a crawler from the ground up.&lt;/p&gt;
&lt;h2&gt;Designing the crawler&lt;/h2&gt;
&lt;p&gt;We know we're not going to be able to crawl 500 million images with one virtual machine and a single IP address, so it is obvious from the start that we are going to need a way to distribute the crawling and analysis tasks over multiple machines. A basic queue-worker architecture will do the job: when we want to crawl an image, we can dispatch the URL to an inbound images queue, and a worker eventually pops that task out and processes it. Kafka will handle all of the hard work of partitioning and distributing the tasks between workers.&lt;/p&gt;
&lt;p&gt;The worker processes do the actual analysis of the images, which entails downloading the image, extracting interesting properties, and sticking the resulting metadata back into a Kafka topic for later downstream processing. The worker will also have to include some instrumentation for conforming to rate limits and error reporting.&lt;/p&gt;
&lt;p&gt;We also know that we will need to share some information about crawl progress between worker processes, such as whether we've exceeded our prescribed rate limit for a website, the number of times we've seen a status code in the last minute, how many images we've processed so far, and so on. Since we're only interested in sharing application state and aggregate statistics, a lightweight key-value store like Redis is a sensible choice.&lt;/p&gt;
&lt;p&gt;Finally, we need a supervising process that centrally controls the crawl. This key governing process will be responsible for making sure our crawler workers are behaving properly by moderating crawl rates for each source, taking action in response to errors, and reporting statistics to the operators of the crawler. We'll call this process the crawl monitor.&lt;/p&gt;
&lt;p&gt;Here's a rough sketch of how things will work:&lt;/p&gt;
&lt;p&gt;&lt;img src="image_crawler_simplified.png" alt="Diagram"&gt;&lt;/p&gt;
&lt;p&gt;At a high level, the problem of building a fast crawler seems solvable for our team, even on the scale of several hundred million images. If we can sustain a crawl and analysis rate of 200 images per second, we could crawl all 500 million images in about a month.&lt;/p&gt;
&lt;p&gt;Next, we'll examine some of the key components that make up the crawler.&lt;/p&gt;
&lt;h2&gt;Detailed breakdown&lt;/h2&gt;
&lt;h3&gt;Concurrency with &lt;code&gt;asyncio&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Crawling is an I/O bound task. The workers need to maintain lots of simultaneous open connections with internal systems like Kafka and Redis as well as 3rd party websites holding the target images. Once we have the image in memory, performing our actual analysis task is easy and cheap. For these reasons, an asynchronous approach seems more attractive than using multiple threads of execution. Even if our image processing task grows in complexity and becomes CPU bound, we can get the best of both worlds by offloading heavyweight tasks to a process pool. See "&lt;a href="https://docs.python.org/3/library/asyncio-dev.html#running-blocking-code"&gt;Running Blocking Code&lt;/a&gt;" in the &lt;code&gt;asyncio&lt;/code&gt; docs for more details.&lt;/p&gt;
&lt;p&gt;Another reason that an asynchronous approach may be desirable is that we have several interlocking components which need to react to events in real-time: our crawl monitoring process needs to simultaneously control the rate limiting process and interrupt crawling if errors are detected, while our worker processes need to consume crawl events, process images, upload thumbnails, and produce events documenting the metadata of each image. Coordinating all of these components through inter-process communication could be difficult, but breaking up tasks into small pieces and yielding to the event loop is comparatively easy; we won't need to worry so much about tripping over Python's global interpreter lock or numerous other multithreading pitfalls.&lt;/p&gt;
&lt;h3&gt;The resize task&lt;/h3&gt;
&lt;p&gt;This is the most vital part of our crawling system: the part that actually does the work of fetching and processing an image. As established previously, we need to execute this task concurrently, so we need to make extensive use of the &lt;code&gt;async&lt;/code&gt; and &lt;code&gt;await&lt;/code&gt; keywords to allow the event loop to multitask. The actual task itself is otherwise straightforward:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Download the remote image and load it into memory.&lt;/li&gt;
&lt;li&gt;Extract the resolution and compression quality.&lt;/li&gt;
&lt;li&gt;Thumbnail the image for later computer vision analysis and upload it to S3.&lt;/li&gt;
&lt;li&gt;Write the information we've discovered to a Kafka topic.&lt;/li&gt;
&lt;li&gt;Report success/errors to Redis in aggregate.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;See &lt;a href="https://github.com/creativecommons/image-crawler/blob/master/worker/image.py"&gt;image.py&lt;/a&gt; for the nitty-gritty details.&lt;/p&gt;
&lt;h2&gt;Rate limiting with token buckets and error circuit breakers&lt;/h2&gt;
&lt;h3&gt;How do we determine the rate limit?&lt;/h3&gt;
&lt;p&gt;Often times, when designing highly concurrent software, the goal is to maximize the throughput and push servers to their absolute limit. The opposite is true with a web crawler, particularly when you are operating under a non-profit organization completely reliant on the goodwill of others to exist. We want to be as certain as reasonably possible that we aren't going to knock a resource off of the internet with an accidental &lt;a href="https://en.wikipedia.org/wiki/Denial-of-service_attack"&gt;DDoS&lt;/a&gt;. At the same time, we need to crawl as quickly as possible against sources with adequate resources to withstand a heavy crawl, or else we'll never finish. How can we match our crawl rate to a site's capabilities?&lt;/p&gt;
&lt;p&gt;Originally, my plan was to determine this through an adaptive rate limiting strategy, where we would start with a low rate limit and use a hill climbing algorithm to determine the optimal rate. We could track metrics like &lt;a href="https://en.wikipedia.org/wiki/Time_to_first_byte"&gt;time to first byte&lt;/a&gt; (TTFB) and bandwidth speed to determine the exact moment that we have started to strain upstream servers. A few concerns deterred me from implementing this design:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The assumption that performance will steadily degrade is flawed. What if TTFB holds steady until suddenly the site goes down?&lt;/li&gt;
&lt;li&gt;How can we detect whether we are the cause of the performance issue? We could get stuck at a suboptimal rate limit due to normal fluctuations in traffic.&lt;/li&gt;
&lt;li&gt;Recording TTFB in Python requires low level access to connection data not readily exposed by &lt;code&gt;aiohttp&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Eventually, I realized this would be too much of a hassle and decided to use a simpler strategy.&lt;/p&gt;
&lt;p&gt;It turns out that the size of a website is typically correlated with infrastructure capabilities. The reasoning behind this is that if you are capable of hosting 450MM images, you are probably able to handle at least a couple hundred requests per second for serving traffic. In our case, we already know how many images a source has, so it's easy for us to peg our rate limit between a low minimum for small websites and a reasonable maximum for large websites, and then interpolate everything in-between.&lt;/p&gt;
&lt;p&gt;Of course, this is only a rough heuristic for approximating a site's capacity. We have to allow the possibility that we set our rate limit too aggressively in spite of our precautions.&lt;/p&gt;
&lt;h3&gt;Backing off with circuit breakers&lt;/h3&gt;
&lt;p&gt;If our heuristic fails to correctly approximate the bandwidth capabilities of a site, we are going to start encountering problems. For one, we might exceed the server-side rate limit, which means we will see &lt;code&gt;429 Rate Limit Exceeded&lt;/code&gt; and &lt;code&gt;403 Forbidden&lt;/code&gt; errors instead of the images we're trying to crawl. Worse yet, the upstream source might continue to happily serve requests while we suck up all of their traffic capacity, resulting in degraded quality for other users. Clearly, in either scenario, we need to either reduce our crawl rate or even give up crawling the source entirely if it appears that we are impacting their uptime.&lt;/p&gt;
&lt;p&gt;To handle these situations, we have two tools in our toolbox: a sliding window recording the status code of each request made we've made to each domain in the last 60 seconds, and a list of the last 50 statuses for each website. If the number of errors in our one minute window exceed 10%, something is wrong; we should wait a minute before trying again. If we have encountered many errors in a row, however, that suggests that we're having trouble with a particular site, so we ought to give up crawling the source and raise an alert.&lt;/p&gt;
&lt;p&gt;Workers can keep track of this information in sorted sets in Redis. For the sliding error window, we'll sort each request by its timestamp, which will make it easy and cheap for us to expire status codes beyond the sliding window interval. Maintaining a list of the last N response codes is even easier; we just stick the status code in a list associated with the source.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StatsManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;known_sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@staticmethod&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_record_window_samples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot; Insert a status into all sliding windows. &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# Time-based sliding windows&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;stat_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;WINDOW_PAIRS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stat_key&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Delete events from outside the window&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zremrangebyscore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;-inf&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# &amp;quot;Last n requests&amp;quot; window&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rpush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;LAST_50_REQUESTS&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;LAST_50_REQUESTS&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;Collecting status codes in aggregate&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Meanwhile, the crawl monitor process can keep tabs on the contents of each error threshold.&lt;/p&gt;
&lt;p&gt;When more than 10% of the requests made to a source in the last minute are errors, we'll set a halt condition in Redis and stop replenishing rate limit tokens (more on that below).&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;one_minute_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zrangebyscore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;one_minute_window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;-inf&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;one_minute_window&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;EXPECTED_STATUSES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;successful&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;tolerance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ERROR_TOLERANCE_PERCENT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;successful&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;successful&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tolerance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TEMP_HALTED_SET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;Detecting elevated crawl errors for a source&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For detecting more serious errors, where we've seen 50 failed requests in a row, we'll set a permanent halt condition. That will give us the chance to tune our software before resuming the crawl.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;last_50_statuses_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;statuslast50req:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;last_50_statuses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_50_statuses_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_50_statuses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;_every_request_failed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_50_statuses&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HALTED_SET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;Detecting persistent crawl errors&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In practice, keeping a sliding window for tracking error thresholds and setting reasonable crawl rates worked well enough that the circuit breaker never activated.&lt;/p&gt;
&lt;h3&gt;Enforcing rate limits with token buckets&lt;/h3&gt;
&lt;p&gt;It's one thing to set a policy for crawling; it's another thing entirely to actually enforce it. How can we coordinate our multiple crawling processes to prevent them from overstepping our rate limit?&lt;/p&gt;
&lt;p&gt;The answer is to implement a distributed token bucket system. The idea behind this is that each crawler has to obtain a token from Redis before making a request. Every second, the crawl monitor sets a variable containing the number of requests that can be made against a source. Each crawler process decrements the counter before making a request. If the decremented result is above zero, the worker is cleared to crawl. Otherwise, the rate limit has been reached and we should wait until a token has been obtained.&lt;/p&gt;
&lt;p&gt;The beauty of token buckets is their simplicity, performance, and resilience against failure. If our crawler monitor process dies, crawling halts completely; making a request is not possible without first acquiring a token. This fail-closed design is far more desirable than the guard rails completely disappearing with the crawl monitor and allowing unbounded crawling. Further, since decrementing a counter and retrieving the result is an atomic operation in Redis, there's no risk of race conditions and therefore no need for locking. The overhead of coordinating and blocking on every single request would rapidly bog down our crawling system.&lt;/p&gt;
&lt;p&gt;To ensure that all crawling is performed at the correct speed, I wrapped &lt;code&gt;aiohttp.ClientSession&lt;/code&gt; with a rate limited version of the class.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RateLimitedClientSession&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aioclient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aioclient&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;token_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CURRTOKEN_PREFIX&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;token_acquired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Out of tokens&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;token_acquired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;token_acquired&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;token_acquired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;token_acquired&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;token_acquired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_get_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Meanwhile, the crawl monitor process is filling up each bucket once per second.&lt;/p&gt;
&lt;h3&gt;Scheduling tasks somewhat intelligently&lt;/h3&gt;
&lt;p&gt;The final pitfall in the design of our crawler is that we want to crawl every single website at the same time at its prescribed rate limit. That sounds almost tautological, like something that we should be able to take for granted after implementing all of this logic for preventing our crawler from working too quickly, but it turns out our crawler's processing capacity itself is a limited and contentious resource. We can only schedule so many tasks simultaneously on each worker, and we need to ensure that tasks from a single website aren't starving other sources of crawl capacity.&lt;/p&gt;
&lt;p&gt;For instance, imagine that each worker is able to handle 5000 simultaneous crawling tasks, and every one of those tasks is tied to a tiny website with a very low rate limit. That means that our entire worker, which is capable of handling hundreds of crawl and analysis jobs per second, is stuck making one request per second until some faster tasks appear in the queue.&lt;/p&gt;
&lt;p&gt;In other words, we need to make sure that each worker process isn't jamming itself up with a single source. We have a &lt;a href="https://en.wikipedia.org/wiki/Scheduling_(computing%29"&gt;scheduling problem&lt;/a&gt;. We've naively implemented first-come-first-serve and need to switch to a different scheduling strategy.&lt;/p&gt;
&lt;p&gt;There are innumerable ways to address scheduling problems. Since there are only a few dozen sources in our system, we can get away with using a stupid scheduling algorithm: give each source equal capacity in every worker. In other words, if there are 5000 tasks to distribute and 30 sources, we can allocate 166 simultaneous tasks to each source per worker. That's plenty for our purposes. There are obvious drawbacks of this approach in that eventually there will be so many sources that we start starving high rate limit sources of work. We'll cross that bridge when we come to it; it's better to use the simplest possible approach we can get away with instead of spending all of our time on solving hypothetical future problems.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_schedule&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;raw_sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;smembers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;inbound_sources&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_sources&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;num_sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# A source never gets more than 1/4th of the worker&amp;#39;s capacity. This&lt;/span&gt;
        &lt;span class="c1"&gt;# helps prevent starvation of lower rate limit requests and ensures&lt;/span&gt;
        &lt;span class="c1"&gt;# that the first few sources to be discovered don&amp;#39;t get all of the&lt;/span&gt;
        &lt;span class="c1"&gt;# initial task slots.&lt;/span&gt;
        &lt;span class="n"&gt;max_share&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_TASKS&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
        &lt;span class="n"&gt;share&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_TASKS&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;num_sources&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;max_share&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;to_schedule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;num_unfinished&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_get_unfinished_tasks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_schedule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;num_to_schedule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;share&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;num_unfinished&lt;/span&gt;
            &lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_get_consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;source_msgs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_consume_n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_to_schedule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;to_schedule&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source_msgs&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;to_schedule&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;Scheduling tasks for every source&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The one implementation detail to deal with here is that our workers can't draw from a single inbound images queue anymore; we need to partition each source into its own queue so we can pull tasks from each source when we need it. This partitioning process can be handled transparently by the crawl monitor.&lt;/p&gt;
&lt;p&gt;&lt;img src="image_crawler.png" alt="A more complete diagram"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;A more complete diagram showing the system with a queue for each source&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Designing for testability&lt;/h3&gt;
&lt;p&gt;It is difficult to test I/O-heavy systems because of their many interactions with external systems. Often times it is necessary to write complex integration tests or run manual tests to be certain that the software works. The problem is that integration tests are hard to maintain and take a long time to execute. Relying on integration tests exclusively would make maintaining the crawler far more difficult. Instead, we should build a suite of unit tests. How can we simulate the crawler realistically without writing a full-blown integration test suite?&lt;/p&gt;
&lt;p&gt;The solution to this problem is to use dependency injection, which is a fancy way of saying that we never do I/O directly from within our application. Instead, we delegate I/O to external objects that can be passed in at run-time. This makes it easy to pass in fake objects that approximate real world behavior without real world consequences.&lt;/p&gt;
&lt;p&gt;For example, the crawl monitor usually has to talk to our CC Search API (for assessing source size), Redis, and Kafka to do its job of regulating the crawl; instead of setting up a brittle and complicated integration test with all of those dependencies, we just instantiate some mock objects and pass them in. Now we can easily test individual components such as the error circuit breaker.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@pytest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;source_fixture&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot; Mocks the /v1/sources endpoint response. &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;source_name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;example&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;image_count&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;display_name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Example&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;source_url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;example.com&amp;quot;&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;source_name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;another&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;image_count&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;display_name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Another&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;source_url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;whatever&amp;quot;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_mock_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FakeAioResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FakeAioSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FakeRedis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;regulator_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_limit_regulator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regulator_task&lt;/span&gt;


&lt;span class="nd"&gt;@pytest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_error_circuit_breaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_fixture&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source_fixture&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monitor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_mock_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;statuslast50req:example&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;500&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;statuslast50req:another&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;200&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;run_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;example&amp;#39;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;halted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;another&amp;#39;&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;halted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;Testing our crawl monitor's circuit breaking functionality with mock dependencies&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The main drawback of dependency injection is that initializing your objects will take some more ceremony. See the &lt;a href="https://github.com/creativecommons/image-crawler/blob/00b59aba9a15faccf203a53d73a98e8c06cb69e8/worker/scheduler.py#L162"&gt;initialization of the crawl scheduler&lt;/a&gt; for an example of wiring up an object with a lot of dependencies. You might also find that constructors will have a lot of arguments if care isn't taken to bundle external dependencies together. In my opinion, the price of a few extra lines of initialization code is well worth the benefits gained from testability and modularity, and number of arguments can be pared down with &lt;a href="https://docs.python.org/3/library/dataclasses.html"&gt;data classes&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Smoke testing&lt;/h2&gt;
&lt;p&gt;Even with our unit test coverage, we still need to do some basic small-scale manual tests to make sure our assumptions hold up in the real world. We'll need to write &lt;a href="https://www.terraform.io/"&gt;Terraform&lt;/a&gt; modules that provision a working version of the real system. Sadly, our Terraform infrastructure repository is private for now, but here's a taste of what the infra code looks like.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;module&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;image-crawler&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;../../modules/services/image-crawler&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;prod&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;docker_tag&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0.25.0&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;aws_access_key_id&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.aws_access_key_id}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;aws_secret_access_key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.aws_secret_access_key}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;zookeeper_endpoint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${module.kafka.zookeeper_brokers}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;kafka_brokers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${module.kafka.kafka_brokers}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;worker_instance_type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;m5.large&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;worker_count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;Initialization of crawler Terraform module in our production environment&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;aws_instance&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;crawler-workers&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;ami&lt;/span&gt;&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.ami}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;instance_type&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.worker_instance_type}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;user_data&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${data.template_file.worker_init.rendered}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;subnet_id&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${element(data.aws_subnet_ids.subnets.ids, 0)}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;vpc_security_group_ids&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${aws_security_group.image-crawler-sg.id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="w"&gt;                   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.worker_count}&amp;quot;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;tags&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;image-crawler-worker-${var.environment}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.environment}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:environment&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.environment == &amp;quot;dev&amp;quot; ? &amp;quot;staging&amp;quot; : &amp;quot;production&amp;quot;}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:product&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cccatalog-api&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:purpose&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Image crawler worker&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:team&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc-search&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;aws_instance&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;crawler-monitor&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;ami&lt;/span&gt;&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.ami}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;instance_type&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;c5.large&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;user_data&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${data.template_file.monitor_init.rendered}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;subnet_id&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${element(data.aws_subnet_ids.subnets.ids, 0)}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;vpc_security_group_ids&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${aws_security_group.image-crawler-sg.id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;tags&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;image-crawler-monitor-${var.environment}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.environment}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:environment&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${var.environment == &amp;quot;dev&amp;quot; ? &amp;quot;staging&amp;quot; : &amp;quot;production&amp;quot;}&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:product&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cccatalog-api&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:purpose&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Image crawler monitor&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc:team&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cc-search&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;&lt;center&gt;An excerpt of the crawler module definition&lt;/center&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One &lt;code&gt;terraform plan&lt;/code&gt; and &lt;code&gt;terraform apply&lt;/code&gt; cycle later, we're ready to feed a few million test URLs to the inbound image queue and see what happens. By my recollection, testing uncovered many glaring issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Basic network security configuration problems prevented communication between key components.&lt;/li&gt;
&lt;li&gt;The scheduling algorithm had to be completely overhauled.&lt;/li&gt;
&lt;li&gt;Workers exceeded the Redis maximum connection limit.&lt;/li&gt;
&lt;li&gt;Workers crashed due to hitting open file limit. &lt;/li&gt;
&lt;li&gt;A nasty memory leak in the &lt;code&gt;pykafka&lt;/code&gt; consumer prompted a late switch to &lt;code&gt;confluent-kafka&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After patching all of those issues and performing a larger smoke test, we're ready to start crawling on a large scale.&lt;/p&gt;
&lt;h3&gt;Monitoring the crawl&lt;/h3&gt;
&lt;p&gt;Regrettably, we can't just kick back and relax while the crawler does its thing for a few weeks. We need some idea of what the crawler is doing so we can be alerted when something breaks.&lt;/p&gt;
&lt;p&gt;How quickly are we crawling each website? What's our target rate limit for each source? How many errors have occurred? How many images have we successfully processed? Are we crawling right now, or are we finished?&lt;/p&gt;
&lt;p&gt;Ideally, we would build a reporting dashboard for this, but in the interest of time, we'll dump a giant JSON blob to &lt;code&gt;STDOUT&lt;/code&gt; every 5 seconds and call it a day. When we want to check on crawl progress, we read the logs. Since JSON is both trivially human and machine readable, we can easily build a more sophisticated monitoring system later should the need for that arise.&lt;/p&gt;
&lt;p&gt;Here's an example log line from one of our smoke tests, indicating that we've crawled 13,224 images successfully and nothing else is happening.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;event&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;monitoring_update&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2020-04-17T20:22:56.837232&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;general&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;global_max_rps&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;193.418869804698&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;error_rps&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;processing_rate&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;success_rps&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;circuit_breaker_tripped&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;num_resized&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;13224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;resize_errors&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;split_rate&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;specific&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;flickr&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;successful&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;13188&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;last_50_statuses&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;200&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;rate_limit&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;178.375147633876&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;error&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;animaldiversity&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;last_50_statuses&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;200&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;successful&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;error&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;rate_limit&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.206215440554406&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;phylopic&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;rate_limit&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;error&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;successful&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;last_50_statuses&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;200&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now that we can see what the crawler is up to, we can schedule the larger crawl and start collecting production quality data.&lt;/p&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;The result of our efforts is a lightweight, modular, highly concurrent, and polite distributed image crawler with only a handful of lines of code.&lt;/p&gt;
&lt;div class="hll"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;alden:~/code/image_crawler$ &lt;/span&gt;cloc&lt;span class="w"&gt; &lt;/span&gt;.
&lt;span class="go"&gt;      48 text files.&lt;/span&gt;
&lt;span class="go"&gt;      43 unique files.                              &lt;/span&gt;
&lt;span class="go"&gt;      25 files ignored.&lt;/span&gt;

&lt;span class="go"&gt;github.com/AlDanial/cloc v 1.81  T=0.02 s (1667.4 files/s, 130887.8 lines/s)&lt;/span&gt;
&lt;span class="go"&gt;------------------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;Language                     files          blank        comment           code&lt;/span&gt;
&lt;span class="go"&gt;------------------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;Python                          16            244            242           1324&lt;/span&gt;
&lt;span class="go"&gt;Markdown                         5             79              0            219&lt;/span&gt;
&lt;span class="go"&gt;YAML                             3              2              4             61&lt;/span&gt;
&lt;span class="go"&gt;XML                              3              0              0             18&lt;/span&gt;
&lt;span class="go"&gt;Bourne Shell                     1              0              1              4&lt;/span&gt;
&lt;span class="go"&gt;------------------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;SUM:                            28            325            247           1626&lt;/span&gt;
&lt;span class="go"&gt;------------------------------------------------------------------------------&lt;/span&gt;

&lt;span class="gp"&gt;alden:~/code/image_crawler$ &lt;/span&gt;tree&lt;span class="w"&gt; &lt;/span&gt;.
&lt;span class="go"&gt;.&lt;/span&gt;
&lt;span class="go"&gt;├── architecture.png&lt;/span&gt;
&lt;span class="go"&gt;├── CODE_OF_CONDUCT.md&lt;/span&gt;
&lt;span class="go"&gt;├── CONTRIBUTING.md&lt;/span&gt;
&lt;span class="go"&gt;├── crawl_monitor&lt;/span&gt;
&lt;span class="go"&gt;│   ├── __init__.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── monitor.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── rate_limit.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── README.md&lt;/span&gt;
&lt;span class="go"&gt;│   ├── settings.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── source_splitter.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── structured_logging.py&lt;/span&gt;
&lt;span class="go"&gt;│   └── tsv_producer.py&lt;/span&gt;
&lt;span class="go"&gt;├── docker-compose.yml&lt;/span&gt;
&lt;span class="go"&gt;├── Dockerfile-monitor&lt;/span&gt;
&lt;span class="go"&gt;├── Dockerfile-worker&lt;/span&gt;
&lt;span class="go"&gt;├── __init__.py&lt;/span&gt;
&lt;span class="go"&gt;├── LICENSE&lt;/span&gt;
&lt;span class="go"&gt;├── Pipfile&lt;/span&gt;
&lt;span class="go"&gt;├── Pipfile.lock&lt;/span&gt;
&lt;span class="go"&gt;├── publish_release.sh&lt;/span&gt;
&lt;span class="go"&gt;├── README.md&lt;/span&gt;
&lt;span class="go"&gt;├── test&lt;/span&gt;
&lt;span class="go"&gt;│   ├── corrupt.jpg&lt;/span&gt;
&lt;span class="go"&gt;│   ├── __init__.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── mocks.py&lt;/span&gt;
&lt;span class="go"&gt;│   ├── test_image.jpg&lt;/span&gt;
&lt;span class="go"&gt;│   ├── test_monitor.py&lt;/span&gt;
&lt;span class="go"&gt;│   └── test_worker.py&lt;/span&gt;
&lt;span class="go"&gt;└── worker&lt;/span&gt;
&lt;span class="go"&gt;    ├── image.py&lt;/span&gt;
&lt;span class="go"&gt;    ├── __init__.py&lt;/span&gt;
&lt;span class="go"&gt;    ├── message.py&lt;/span&gt;
&lt;span class="go"&gt;    ├── rate_limit.py&lt;/span&gt;
&lt;span class="go"&gt;    ├── scheduler.py&lt;/span&gt;
&lt;span class="go"&gt;    ├── settings.py&lt;/span&gt;
&lt;span class="go"&gt;    ├── stats_reporting.py&lt;/span&gt;
&lt;span class="go"&gt;    └── util.py&lt;/span&gt;

&lt;span class="go"&gt;3 directories, 34 files&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We now have loads of useful information about images that we were initially lacking. The next step is to take this metadata and integrate it into our search engine, as well as perform deeper analysis of images using computer vision.&lt;/p&gt;
&lt;h2&gt;Acknowledgments &amp;amp; Legal&lt;/h2&gt;
&lt;p&gt;I'd like to give a special thanks to my colleagues at Creative Commons who took the time to review this article and give useful feedback, including Brent Moran, Kriti Godey, and Zack Krida.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://opensource.creativecommons.org/blog/entries/crawling-500-million/"&gt;This article was originally published on the Creative Commons Open Source Blog on August 17th, 2020&lt;/a&gt;. The version appearing on this page has been revised and expanded. It was written by Alden Page and released under a &lt;a href="https://creativecommons.org/licenses/by/4.0/"&gt;Creative Commons Attribution 4.0 International license&lt;/a&gt;.&lt;/p&gt;
</content></entry></feed>