Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Paul, I made something similar - http://bardagjy.com/?p=1639

I had the same problem with Javascript so I used I used Selenium to drive Chrome to take a screencap of the page. Then I used K-Means clustering with EM to convert the pages to their constituent colors.

I scraped 100 and 1000 of the Alexa top 1M. Cool to see another approach, great work!



Replying to myself a quote from my article

"It’s easy to notice a bug when examining the colors for Google (note, this is normal google.com not a doodle). Notice how the three colors are light gray, dark gray, and white – not the typical red, green, blue, yellow color scheme. Why? Well, when the image screenshot is resized to 320 x 240 pixels for processing, the colors are dithered. The number of pixels in the new image that lie between red, green, blue, yellow and white – the dominant background color – is much larger than the number of pixels that are colored. Because of dithering, those between pixels are closer to shades of gray, than colors, and thus the k-means clustering (with EM) finds shades of gray and white to be the “color of Google”. I’m not sure if this is a bug.. what do you think?"


Hey Andy,

That's awesome! I figured someone else must have had the same idea before me. :)

I think your screenshot scraping technique is probably more accurate than my text parsing. I also like that you used a larger sample size. I plan to experiment with groups of 100 and 1000.

Thanks for sharing! It's always interesting to see how different people achieve similar goals.

I'd like to begin scraping the images on the sites soon too. When I've got a good chunk of time I'll look through your source code for inspiration. Mind if I reach out with questions when I do?

EDIT: I also really enjoy those woodblock prints! Now I want to somehow print my data for the top ten sites onto canvas.


Sure - I think the git repo is dead, I'll resurrect it if you're interested.


Yeah, that would be great. Thanks!


Rather than resizing to 320x240, pick that number of pixels randomly. For even better results use some method of variance reduction e.g. divide the screen into n squarish rectangles and pick N/n pixels from every rectangle.


I'd second using selenium / webdriver for this. You wouldn't need a great deal of code to get the raw pixel data.

The area I suspect will need thought is how one gives "value" to distinct colours. It's tempting the go by the amount of screen space taken up, but that ignores the way the eye and mind works - a red dash in a sea of another colour needs to respect the presence of the red as it could be critical. But then there are sections where the colours might be low in importance (eg colour of text in legal text from a footer)


Yeah, that seems like a great option. I'm not sure how to get around the dithering bug that dynode ran into though. I'll have to do some research before I get started on V2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: