rdap crawler monitor
automated packet provenance for blacklisting impolite web crawlers
As part of a long train of dependencies and distractions involved in setting the CORS headers for ocularium's schematics, I discovered a number of web crawlers accessing pages my robots.txt told them not to, so I decided to IP blacklist them in my reverse proxy.
tcpdump -> wireshark
I don't generally leave http access logs on, and I figured there was probably a
less-invasive and -stateful way to capture the source IPs I as interested in
than updating my nginx config. I went with this tcpdump
invocation initially,
inspecting HTTP traffic behind my TLS termination with wireshark
, looking for
X-Forwarded-For
:
&&
This became time-consuming as there were a lot of subnets to block, all coming from Huawei Cloud (Singapore) IPs, and I had to rerun, analyze the capture, and add to my blacklist in repeated batches.
tshark
Once I had enough of this manual process I found tshark
, a commandline
interface to wireshark's functionality, and was able to stream WHOIS results:
| \
| \
| \
This hacked approach worked but gave pretty ugly and imprecise results, and by this point I figured I was seeing enough volume consistently that it was worth going the whole nine yards and packaging it up into a systemd service whose logs I can grep at my leisure.
full automation
First, tshark
can take over tcpdump
's responsibilities:
And then to make the results nicer than the whois -> grep from above, so it'd
be easy to read and the association ip -> owner, it turns out
RDAP is the new iteration of WHOIS — basically
the same info, just in a REST JSON API — and there are plenty of free
services out there. I came across RIPE first and wrote
a quick Python script that reads IPs off of stdin and logs (IP, CIDR, owner, country)
with a Bash script to feed it:
#!/usr/bin/env bash
|
And a systemd service unit, expressed as a NixOS module.
Results of the first few hours (I had already blocked the worst-offending subnets -- this is comparatively slow):
22:20:59 cidrs=['94.74.80.0/20'] country=SG ip=94.74.93.108 name=Huawei-Cloud-SG
22:22:28 cidrs=['114.119.172.0/22'] country=SG ip=114.119.174.108 name=Huawei-Cloud-Singapore
22:24:14 cidrs=['166.108.192.0/19'] country=SG ip=166.108.201.248 name=Huawei-Cloud-SG
23:43:56 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
00:11:57 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
00:17:50 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
00:38:07 cidrs=['159.138.80.0/20'] country=SG ip=159.138.88.165 name=Huawei-SG-CLOUDS
01:39:10 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
02:02:30 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
02:42:54 cidrs=['159.138.80.0/20'] country=SG ip=159.138.85.139 name=Huawei-SG-CLOUDS
I plan to publish the scripts and NixOS module sooner or later.
postscript: robots.txt
As it turns out, I had originally misconfigured nginx and only had robots.txt
for the default http vhost, meaning none of the crawlers I was looking at ever
saw it -- amazonbot probably wasn't being impolite.