rdap crawler monitor

automated packet provenance for blacklisting impolite web crawlers

2024-08-09

As part of a long train of dependencies and distractions involved in setting the CORS headers for ocularium's schematics, I discovered a number of web crawlers accessing pages my robots.txt told them not to, so I decided to IP blacklist them in my reverse proxy.

tcpdump -> wireshark

I don't generally leave http access logs on, and I figured there was probably a less-invasive and -stateful way to capture the source IPs I as interested in than updating my nginx config. I went with this tcpdump invocation initially, inspecting HTTP traffic behind my TLS termination with wireshark, looking for X-Forwarded-For:

$ ssh $REMOTE_SYSTEM "tcpdump -s 0 -U -n -w - -i $IFACE" > out.pcap && wireshark

This became time-consuming as there were a lot of subnets to block, all coming from Huawei Cloud (Singapore) IPs, and I had to rerun, analyze the capture, and add to my blacklist in repeated batches.

tshark

Once I had enough of this manual process I found tshark, a commandline interface to wireshark's functionality, and was able to stream WHOIS results:

$ ssh $REMOTE_SYSTEM "tcpdump -s 0 -U -n -w - -i $IFACE" | \
    tshark \
        -i - -Y 'http.request.uri.path contains "/interesting" && http.x_forwarded_for' \
        -T fields -e 'http.x_forwarded_for' | \
    xargs -n1 whois | \
    egrep -i 'route|CIDR|Organization|Country|netname' --line-buffered
Country: SG
route: 94.74.80.0/20
Country: ZZ
Country: SG
netname: Huawei-Cloud-SG
...

This hacked approach worked but gave pretty ugly and imprecise results, and by this point I figured I was seeing enough volume consistently that it was worth going the whole nine yards and packaging it up into a systemd service whose logs I can grep at my leisure.

full automation

First, tshark can take over tcpdump's responsibilities:

$ tshark -i $IFACE -Y $FILTER -T fields -e 'http.x_forwarded_for' -l

And then to make the results nicer than the whois -> grep from above, so it'd be easy to read and the association ip -> owner, it turns out RDAP is the new iteration of WHOIS — basically the same info, just in a REST JSON API — and there are plenty of free services out there. I came across RIPE first and wrote a quick Python script that reads IPs off of stdin and logs (IP, CIDR, owner, country) with a Bash script to feed it:

#!/usr/bin/env bash

set -euo pipefail

tshark -i $IFACE -Y $FILTER -T fields -e $EXPR -l | rdap

And a systemd service unit, expressed as a NixOS module.

Results of the first few hours (I had already blocked the worst-offending subnets -- this is comparatively slow):

22:20:59 cidrs=['94.74.80.0/20'] country=SG ip=94.74.93.108 name=Huawei-Cloud-SG
22:22:28 cidrs=['114.119.172.0/22'] country=SG ip=114.119.174.108 name=Huawei-Cloud-Singapore
22:24:14 cidrs=['166.108.192.0/19'] country=SG ip=166.108.201.248 name=Huawei-Cloud-SG
23:43:56 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
00:11:57 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
00:17:50 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
00:38:07 cidrs=['159.138.80.0/20'] country=SG ip=159.138.88.165 name=Huawei-SG-CLOUDS
01:39:10 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
02:02:30 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
02:42:54 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS

I plan to publish the scripts and NixOS module sooner or later.

postscript: robots.txt

As it turns out, I had originally misconfigured nginx and only had robots.txt for the default http vhost, meaning none of the crawlers I was looking at ever saw it -- amazonbot probably wasn't being impolite.