Hacker Newsnew | past | comments | ask | show | jobs | submit | Alas1's commentslogin

What exactly does that 96% mean, though? It means that on some fixed dataset you're achieving 96% accuracy. I'm baffled by this stupidity of claiming results (even high-profile researchers do this) based on datasets with models that are nowhere near as robust as the actual intelligence that we take as reference: humans. Take the model that makes you think "sentiment analysis is at 96%", come up with your own examples to apply a narrow Turing test to the model, and see if you still think sentiment analysis (or any NLP task) is anywhere near being solved. Also see: [1].

I think continual Turing testing is the only way of concluding whether an agent exhibits intelligence or not. Consider the philosophical problem of the existence of other minds. We believe other humans are intelligent because they consistently show intelligent behavior. Things that people claim to be examples of AI right now lack this consistency (possibly excluding a few very specific examples such as AlphaZero). It is quite annoying to see all these senior researchers along with graduate students spend so much time pushing numbers on those datasets without paying enough attention to the fact that pushing numbers is all they are doing.

[1]: As a concrete example, consider the textual entailment (TE) task. In the deep learning era of TE there are two commonly used datasets on which the current state-of-the-art has been claimed to be near or exceeding human performance. What these models are performing seemingly exceptionally well is not the general task of TE, it is the task of TE evaluated on these fixed datasets. A recent paper by McCoy, Pavlick, and Linzen (https://arxiv.org/abs/1902.01007) shows how brittle these systems are that at this point the only sensible response to those insistent on claiming we are nearing human performance in AI is to laugh.


> I think continual Turing testing is the only way of concluding whether an agent exhibits intelligence or not.

So you think it's impossible to ever determine that a chimpanzee, or even a feral child, exhibits intelligence? This seems rather defeatist.


No, interpreting "continual" the way you did would mean I should believe that we can't conclude our friends to be intelligent either (I don't believe that). Maybe I should've said "prolonged" rather than "continual".

Let me elaborate on my previous point with an example. If you look at the recent works in machine translation, you can see that the commonly used evaluation metric of BLEU is being improved upon at least every few months. What I argue is that it's stupid to look at this trend and conclude that soon we will reach human performance in machine translation. Even when comparing against the translation quality of humans (judged again by BLEU on a fixed evaluation set) and showing that we can achieve higher BLEU than humans is not enough evidence. Because you also have Google Translate (let's say it represents the state-of-the-art), and you can easily get it to make mistakes that humans would never do. I consider our prolonged interaction with Google Translate to be a narrow Turing test that we continually apply to it. A major issue in research is that, at least in supervised learning, we're evaluating on datasets that are not different enough from the training sets.

Another subtle point is that we have strong priors about the intelligence of biological beings. I don't feel the need to Turing test every single human I meet to determine whether they are intelligent, it's a safe bet at this point to just assume that they are. The output of a machine learning algorithm, on the other hand, is wildly unstable with respect to its input, and we have no solid evidence to assume that it exhibits consistent intelligent behavior and often it is easy to show that it doesn't.

I don't believe that research in AI is worthless, but I think it's not wise to keep digging in the same direction that we've been moving in for the past few years. With deep learning, while accuracies and metrics are pushed further than before, I don't think we're significantly closer to general, human-like AI. In fact, I personally consider only AlphaZero to be an unambiguous win for this era of AI research, and it's not even clear whether it should be called AI or not.


My comment was not on ‘continual’ but on ‘Turing test’.

If you gave 100 chimps of the highest calibre 100 attempts each, not a single one would pass a single Turing test. Ask a feral child to translate even the most basic children's book, and their mistakes will be so systematic that Google Translate will look like professional discourse. ‘Humanlike mistakes’ and stability with respect to input in the sense you mean here are harder problems than intelligence, because a chimp is intelligent and functionally incapable of juggling more than the most primitive syntaxes in a restricted set of forms.

I agree it is foolish to just draw a trend line through a single weak measure and extrapolate to infinity, but the idea that no collation of weak measures has any bearing on fact rules out ever measuring weak or untrained intelligence. That is what I called defeatist.


I see your point, but you're simply contesting the definition of intelligence that I assumed we were operating with, which is humanlike intelligence. Regardless of its extent, I think we would agree that intelligent behavior is consistent. My main point is that the current way we evaluate the artificial agents is not emphasizing their inconsistency.

Wikipedia defines Turing test as "a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human". If we want to consider chimps intelligent, then in that context the definition of the Turing test should be adjusted accordingly. My point still stands: if we want to determine whether a chimp exhibits intelligence comparable to a human, we do the original Turing test. If we want to determine whether a chimp exhibits chimplike intelligence, we test not for, say, natural language but for whatever we want our definition of intelligence to include. If we want to determine whether an artificial agent has chimplike intelligence, we do the second Turing test. Unless the agent can display as consistent an intelligence as chimps, we shouldn't conclude that it's intelligent.

Regarding your point on weak measures: If I can find an endless stream of cases of failure with respect to a measure that we care about improving, then whatever collation of weak measures we had should be null. Wouldn't you agree? I'm not against using weak measures to detect intelligence, but only as long as it's not trivial to generate failures. If a chimp displays an ability for abstract reasoning when I'm observing it in a cage but suddenly loses this ability once set free in a forest, it's not intelligent.


I'm not interested in categorizing for the sake of categorizing, I'm interested in how AI researchers and those otherwise involved can get a measure of where they're at and where they can expect to be.

If AI researchers were growing neurons in vats and those neurons were displaying abilities on par with chimpanzees I'd want those researchers to be able to say ‘hold up, we might be getting close to par-human intelligence, let's make sure we do this right.’ And I want them to be able to do that even though their brains in vats can't pass a Turing test or write bad poetry or play basic Atari games and the naysayers around them continue to mock them for worrying when their brains in vats can't even pass a Turing test or write bad poetry or play basic Atari games.

Like, I don't particularly care that AI can't solve or even approach solving the Turing test now, because I already know it isn't human-par intelligent, and more data pointing that out tells me nothing about where we are and what's out of reach. All we really know is that we've been doing the real empirical work with fast computers for 20ish years now and gone from no results to many incredible results, and in the next 30 years our models are going to get vastly more sophisticated and probably four orders of magnitude larger.

Where does this end up? I don't know, but dismissing our measures of progress and improved generality with ‘nowhere near as robust as [...] humans’ is certainly not the way figure it out.

> If I can find an endless stream of cases of failure with respect to a measure that we care about improving, then whatever collation of weak measures we had should be null. Wouldn't you agree?

No? Isn't this obviously false? People can't multiply thousand-digit numbers in their heads; why should that in any way invalidate their other measures of intelligence?


>no results to many incredible results

What exactly is incredible (relatively) about the current state of things? I don't know how up-to-date you are on research, but how can you be claiming that we had no results previously? This is the kind of ignorance of previous work that we should be avoiding. We had the same kind of results previously, only with lower numbers. I keep trying to explain that increasing the numbers is not going to get us there because the numbers are measuring the wrong thing. There are other things that we should also focus on improving.

>dismissing our measures of progress and improved generality with ‘nowhere near as robust as [...] humans’ is certainly not the way figure it out.

It is the way to save this field from wasting so much money and time on coming up with the next small tweak to get that 0.001 improvement in whatever number you're trying to increase. It is not a naive or spiteful dismissal of the measures, it is a critique of the measures since they should not be the primary goal. The majority of this community is mindlessly tweaking architectures in pursuit of publications. Standards of publication should be higher to discourage this kind of behavior. With this much money and manpower, it should be exploding in orthogonal directions instead. But that requires taste and vision, which are unfortunately rare.

>People can't multiply thousand-digit numbers in their heads; why should that in any way invalidate their other measures of intelligence?

Is rote multiplication a task that we're interested in achieving with AI? You say that you aren't interested categorizing for the sake of categorizing, but this is a counterexample for the sake of giving a counterexample. Avoiding this kind of an example is precisely why I said "a measure that we care about improving".


> What exactly is incredible (relatively) about the current state of things?

Compared to 1999?

Watch https://www.youtube.com/watch?v=kSLJriaOumA

Hear https://audio-samples.github.io/#section-4

Read https://grover.allenai.org/

These are not just ‘increasing numbers’. These are fucking witchcraft, and if we didn't live in a world with 5 inch blocks of magical silicon that talk to us and giant tubes of aluminium that fly in the sky the average person would still have the sense to recognize it.

> It is the way to save this field from [...]

For us to have a productive conversation here you need to either respond to my criticisms of this line of argument or accept that it's wrong. Being disingenuous because you like what the argument would encourage if it were true doesn't help when your argument isn't true.

> Is rote multiplication a task that we're interested in achieving with AI?

It's a measure for which improvement would have meaningful positive impact on our ability to reason, so it's a measure we should wish to improve all else equal. Yes, it's marginal, yes, it's silly, that's the point: failure in one corner does not equate to failure in them all.


>These are not just ‘increasing numbers’. These are fucking witchcraft, and if we didn't live in a world with 5 inch blocks of magical silicon that talk to us and giant tubes of aluminium that fly in the sky the average person would still have the sense to recognize it.

What about generative models is really AI, other than the fact that they rely on some similar ideas from machine learning that are found in actual AI applications? Yes, maybe to an average person these are witchcraft, but any advanced technology can appear that way---Deep Blue beating Kasparov probably was witchcraft to the uninitiated. This is curve fitting, and the same approaches in 1999 were also trying to fit curves, it's just that we can fit them way better than before right now. Even the exact methods that are used to produce your examples are not fundamentally new, they are just the same old ideas with the same old weaknesses. What we have right now is a huge hammer, and a hammer is surely useful, but not the only thing needed to build AI. Calling these witchcraft is a marketing move that we definitely don't need, creates unnecessary hype, and hides the simplicity and the naivete of the methods used in producing them. If anybody else reads this, these are just increasing numbers, not witchcraft. But as the numbers increase it requires a little more effort and knowledge to debunk them.

I'm not dismissing things for the fun of it, but it pains me to see this community waste so many resources in pursuit of a local minimum due to lack of a better sense of direction. I feel like not much more is to be gained from this conversation, although it was fun, and thank you for responding.


I appreciate you're trying to wind it down so I'll try to get to the point, but there's a lot to unpack here.

I'm not evaluating these models on whether they are AGI, I am evaluating them on what they tell us about AGI in the future. They show that even tiny models, some 10000x to 1000000x times smaller than what I think are the comparable measures in the human brain, trained with incredibly simple single-pass methods, manage to extract semirobust and semantically meaningful structure from raw data, are able to operate on this data in semisophisticated ways, and do so vastly better than their size-comparable biological controls. I'm not looking for the human, I'm looking for small scale proofs of concepts of the principles we have good reasons to expect are required for AGI.

The curve fitting meme[1] has gotten popular recently, but it's no more accurate than calling Firefox ‘just symbols on the head of a tape’. Yes, at some level these systems reduce to hugely-dimensional mathematical curves, but the intuitions this brings are pretty much all wrong. I believe this meme has gained popularity due to adversarial examples, but those are typically misinterpreted[2]. If you can take a system trained to predict English text, prime it (not train it) with translations, and get nontrivial quality French-English translations, dismissing it as ‘just’ curve fitting is ‘just’ the noncentral fallacy.

Fundamental to this risk evaluation is the ‘simplicity and the naivete of the methods used in producing them’. That simple systems, at tiny scales, with only inexact analogies to the brain, based on research younger than the people working on it, is solving major blockers in what good heuristics predict AGI needs is a major indicator about the non-implausibility of AGI. AGI skeptics have their own heuristics instead, with reasons those heuristics should be hard, but when you calibrate with the only existence proof we have of AGI development—human evolution—, those heuristics are clearly and overtly bad heuristics that would have failed to trigger. Thus we should ignore them.

[1] Similar comments on ‘the same approaches in 1999’, another meme only true at the barest of surface levels. Scale up 1999 models and you get poor results.

[2] See http://gradientscience.org/adv/. I don't agree with everything they say, since I think the issue relates more to the NN's structure encoding the wrong priors, but that's an aside.


It is no longer great (it used to be so around Mavericks). Current pdfkit is broken, and renders PDFs blurry, with the effect enhanced on low PPI screens. This holds for iOS as well since the same pdfkit is used, but since the devices running those are extremely high PPI this bug goes unnoticed by all but those looking for it or those with sensitivity to such imperfections. I think the state of PDF rendering on macOS right now is unnacceptable.


Do you have screenshots that can demonstrate this? I'm surprised to hear it, I'm fairly picky about visual quality but I've never noticed an issue with PDFs on MacOS or iOS. Never used a low PPI screen with them though.


Some of the many reports:

https://apple.stackexchange.com/questions/295427/what-causes...

https://discussions.apple.com/thread/8531638

https://twitter.com/deivi_dragon/status/938663751158501376

https://www.reddit.com/r/apple/comments/72lxjx/macos_high_si...

It is less visible on a high PPI screen although still there and noticable if you know it exists. Unfortunately I know it exists, can see it and am bothered, so I stopped using my Macbook and left it to collect dust.


I've never seen this issue myself. I'm on High Sierra now.

In Yosemite PDF rendering performance was crap for me in some high resolution documents for print, but other than that it has been flawless for as long as I've been using macOS.

I've only seen the issue shown in the Reddit post (which looks like some rendering lag) in Android actually.


As I said it's difficult to see on a high PPI screen unless you're looking for it. High Sierra, I think, is the first version where this bug was introduced, so you most certainly have it but it just isn't drastic enough a change to bother most people on a Macbook's screen. The rendering lag is another problem that was introduced with the one I'm referring to, which is a separate issue but they are probably related.


I have to admit I haven't used a low DPI screen in macOS since Maverick times, so you might be right. PDFs in Preview look perfectly fine to me.

Maybe Apple actually optimized rendering for hiDPI displays since probably that's where the majority of users are these days.


Oh wow, that definitely looks terrible. I totally understand not liking MacOS due to that. Thankfully it's much less pronounced on high DPI screens? I've never owned a low DPI macbook so I've never really encountered it before.


Don't you ever present anything on a projector?

Many professors in our department have a MBP, and their LaTeX presentations look bad, just because macOS is bad. I notice it every single time, and sometimes (without me even saying, I just tolerate it, don't make a sound) they themselves do, too, asking themselves whether they've grown that so old or something.

I only have the leftovers of my girlfriend, 2015 MBA, as a macOS device. The PDFs look like crap on Preview and many other ".app"s I have tried. SumatraPDF running on Wine works properly though. Yeah, I'd say Preview simply does not work properly at this point. Shame, but also fun to watch from the Windows's side.


Many windows users consider macOS font rendering at low dpi blurry, and windows font rendering crispy.

It's just what they are used to.


This has nothing to do with that. In fact I am a "Mac user" driven to Windows solely because of this issue.

Aside from blurry PDFs, I also used to like font rendering in OS X and prefer it to Windows but unfortunately it also got screwed up with Mojave after they disabled subpixel rendering.


Are you serious about the PDF viewers? Windows has Sumatra, Xodo, Drawboard all of which are fine and the first two are free.

Mac OS on the other hand, has had a broken pdfkit implementation since around Sierra, and any PDF viewer that relies on it (Preview, Skim) has broken and blurry rendering of which there are several reports online. This makes any PDF reader other than Adobe's (which is an abomination) unusable. This has gone mostly unnoticed or ignored as the blur is reduced by the high PPI screens, although it is there and becomes much more apparent on a lower PPI external screen.

This single fact makes Mac OS unusable for me. Windows has its own faults, but at least I can read crisp PDFs on it.


Although it's a bit of a pig, Acrobat Reader DC is still the best thing out there. I bought and really love Drawboard, but since there's no reliable way to gracefully make it persist as the default PDF app for the OS (and consequently, the browsers), I just went back to DC. (This is a general problem with "modern" apps - they can't really be made to act as reasonable default apps for Win32 apps, AFAIK.)

That said, for all its warts, the Win10/WSL combo is light-years ahead of MacOS. (I had to use a Mac for a recent client engagement, and truly felt like I'd been cast back to the turn of the century. Seriously, MacOS is really primitive, especially for dev, compared to Win10/WSL.) But more than that, I'll just never go back to a computer without really awesome pen and touch support. The only problem I've had with several versions of Surface Pro hardware is Intel's execrable Skylake power management cluster-foxtrot.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: