I'm interested in the technology behind your crawler, you could potentially use it to discover many more things, like which sites use popular APIs etc. What language is it written in? What do you use for the backend/DB? How fast is it?
I'm working on a project involving similar large scale crawling and I would love to know more.
It's a C# console application that logs to a SQL Server Express database. It's fairly primitive and the source isn't anything I'd want to advertise. I'd be happy to share it with you if you don't mind C# and want to give it a whirl yourself though.
That site is nice, but they also want to charge me close to $2000 for a list of sites using a single technology, I could definitely do this myself for much less.
To be honest, I didn't dig into that site deeply enough to notice that. I pulled this out of an email I was sent just yesterday.
Sure, I'd be interested in collaborating if there's not one out there already.
The spider part is easy, you just need a web client (e.g. Ruby's or Python's Mechanize, Java's HTTPClient, even just wget or curl) coupled with an HTML parser (Hpricot, Nokogiri, Tidy, etc., or even some basic regular expressions). One can readily hack something rough together in an hour or two. Gabriel might have a lot of the data and certainly the code in order to produce DuckDuckGo, but he may have good reasons to keep that private.
The harder part, and the part that I wonder if builtwith is doing correctly, is to do the technology detection. Things like JavaScript libraries or CSS frameworks might be fairly easy to detect, but it is not trivial to reliably detect some of the server side technologies. I recently put together a script to survey the operating system and web server in use at a large number of domains from Alexa's top million list (similar to what Netcraft does) and there are plenty of servers that make that difficult, let alone determining whether a site is built with Ruby, Java or PHP. There are HTTP headers that could tell you, but not everyone uses them. There are certain signatures that give a pretty good clue, but those aren't always present and can be downright misleading. (I've seen sites that migrated from ASP to Java Servlets, for example, that kept .aspx URLs to avoid breaking links.)
If I remember correctly someone posted a JavaScript framework survey based on a similar spidering approach on HN a while back, you might be able to find it at searchyc.