On the subject of raw computing power, if you live in the US you've probably hea...

tptacek · on Aug 13, 2011

The cluster is interesting, but not as interesting as the giant corpus of SWF files Google got to use. Do you think the government has a crawl as complete as Google's under its hat? How? People notice when the Googlebot does new things. Wouldn't we noticed the Fedbot?

nitrogen · on Aug 13, 2011

Quite true. Google does have a lot of data. But, I'd wager the NSA has just as much data, just from different sources. Maybe they couldn't fuzz Flash with the optimal set of .swf files, but they could mine vast numbers of voice conversations for correlations.

Additionally, years ago a friend of mine who I'd lost contact with caught up with me and told me he found a cached copy of a website I'd taken down in his employer's equivalent to the Wayback Machine. His employer was a branch of the federal government. I know my anecdote doesn't prove anything, let alone come close to addressing the difficulty of crawling the web without anyone noticing (intercept all http traffic in transit?), but the fact remains that there are literally tons of computers doing something for the government.

wisty · on Aug 13, 2011

Perhaps Fedbot crawls in a less deterministic manner, uses a lot of different ips, and sets user agent to IE?

bigiain · on Aug 13, 2011

I suspect "fedbot" works by calling up google and saying "Hi, it's us again, we've got another white van on the way to the googleplex, have a petabyte or two of the Internet ready for us to collect in 20 minutes. thanks"