There is no stupid "hey I've an idea!" comment in this thread, so I'll offer one.
An alternative (probably already proposed?) could be the following, if you have a large set of human tagged images or videos you could show this images to users and, like, eight set of tags, and ask: what set of tags better apply to the image above (of course only one set is really about the image, other sets are random)? This are three bits per image, do this a few times and the probability of a computer random guessing is very low.
Every time you show an image you may crop + rotate it a bit and apply a filter, so that manually building a table is hard, but if you have a big set of images like google could have maybe this is not needed.
I think that the problem that we will always run into, is that any human-performable task will either be cracked by someone writing a bot, or made trivially inexpensive by apps that charge as little as $0.00139 per solved captcha (see [1]). Microsoft has implemented a tagging-type captcha, ASIRRA (see [2]), for which you can hire out the results for $0.004 per solved captcha[3].
I think the only real solution is to make it cost real money (say $0.25 or $0.10) to perform whatever action you are protecting, so that repeated attempts are prohibitively expensive, but one or two by a legitimate user is not too expensive. Otherwise, financially-driven spammers will always find a way to inexpensively circumvent the protection.
I wondered if it would be possible to use something like Hashcash to require a certain level of CPU usage from the user agent? It doesn't stop spam, but it's unobtrusive to the user and would slow spammers down.
Once I found a nice captcha replacement (can't remember where it was, tho). It worked like this: at the end of the form, there were 6 to 9 playing cards and the text "Click on the seven of spades". This fitted the theme of website which had something to do with poker or magic tricks (can't really remember what it was) but this can be done with a lot of other stuff that is well known by people and not by robots (e.g. 6 photos of animals and the question "Click on the dog", etc.)
An alternative (probably already proposed?) could be the following, if you have a large set of human tagged images or videos you could show this images to users and, like, eight set of tags, and ask: what set of tags better apply to the image above (of course only one set is really about the image, other sets are random)? This are three bits per image, do this a few times and the probability of a computer random guessing is very low.
Every time you show an image you may crop + rotate it a bit and apply a filter, so that manually building a table is hard, but if you have a big set of images like google could have maybe this is not needed.