Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI Finally Allows ChatGPT Complete Internet Access (gizmodo.com)
133 points by gmays on Oct 24, 2023 | hide | past | favorite | 114 comments


I've been relatively unimpressed with the ChatGPT browse mode.

My problem with it is that it seems to run really obvious, naive searches. Most of the time I ask it something, then see what it's searching for and think "oh no, that's not going to return anything more useful than what I could have found myself".

It's also pretty slow.

I'm very much looking forward to having a search assistant which can go ahead and wade through eg 20 websites about home battery packs and show me a summary table of my options - ChatGPT Browse isn't quite that yet.


Metaphor (https://metaphor.systems) has a ChatGPT plugin that works pretty well. I asked for home battery systems, here's what I got: https://chat.openai.com/share/3ce0687d-6ffa-4c02-8957-d8787a...


Agreed. Its memorized knowledge generally seems to be better than its interpretation of a few web search results, often significantly so.

Analyzing and interpreting new information is harder than regurgitating what one already knows, so perhaps this isn’t surprising.

Even gpt-4, as amazing as it is, has quite limited analyzing, reasoning, and interpretation capabilities, compared to skilled humans.

Of course, the fact that we have to even ask this question is kind of unbelievable…


Have you tried perplexity.ai? It has become my go-to for LLM + browsing. It is really fast and does a pretty good job with retrieval and summarization.


ChatGPT, Claude and prepexity are each $20/mo

Chatgpt4 seems most powerful but Claude offers unlimited context which can be used to summarize entire books

Prepexity can browse and summarize web

Which one should I subscribe to?


Kagi ultimate is 25 a month and has Claude, GPT4, and AI web search. Been using it a month or so and it's great.


Do you have a source for that? All I can find is here [0], which doesn't mention GPT4 or Claude.

[0]https://help.kagi.com/kagi/plans/ultimate-plan.html


> For Starter and Professional members this mode uses gpt-3.5-turbo, and for Ultimate members it uses gpt-4.

https://help.kagi.com/kagi/ai/assistant.html


Thanks! That is good to know.


I wonder why they don't list that under ultimate plan? On pricing page the selling point for ultimate is like: Hey, help us test things and help us by giving some money: https://kagi.com/pricing


It is in closed beta.


How is Kagi for niche stuff? I'm frequently searching for niche and hard to find programming information about processors and whatnot and google/DDG have only gotten worse with time.


I use Kagi. Probably not better than Google, but often with less junk. It's much better than DDG. The extra features are nice, as are the customizations (boost specific domains, etc.) I thought I'd be !g-ing to find things on Google, but I rarely do, and I'm constantly researching niche stuff. When I do try !g, I generally find that Google doesn't have the results, either. Kagi also seems to do better with really old results that Google has memory holded.


Can also try Phind.com


You can try it for free. That's the only way you'll know if it suits you.


Wait what? How did I not know this. I need to consolidate my subs into Kagi. Thanks for the info!


Which one is powering their bot which does citations?


I believe it's Claude 2.


The only bummer is that it doesn't have DALLE.


Bing has Dalle-3 access for free.


Just tried it, the free version is really slow (it has been going on 7 minutes now to just generate the first image and still isn't complete) compared with what I'm getting from OpenAI. I'll keep the $20/mo on that, it is worth it.


Do you know what model perplexity uses under the hood?

I just tried it out with my normal test queries and found it to be far worse than Google's Bard.


To be fair, Bard has massively improved recently and can in my brief experience even outperform 3.5-turbo for lots of things.


Except that it still has a short ctx size.


I'd disagree by far. I use Perplexity as my main search engine and am a paid subscriber. Copilot with GPT-4 or Claude is fantastic.


Paid accounts can use the Perplexity LLM, GPT-4, or Claude 2.


Not in a few months, I should spend more time with that


I love Perplexity.


It’s also pretty useless at understanding any page that’s even remotely complex.

I asked it to tell me when the next baseball home game is so I can avoid traffic when going to the downtown library and it couldn’t answer even that basic question. The search results defaulted to the team’s official calendar but the day numbers were displayed as images so they could style them in the team’s font, making them invisible to ChatGPT.


This example is more likely due to that specific page/site lacking any kind of accessibility information (e.g. how people who are blind, deaf or have other disabilities can use the internet), rather than being large or complex.

Relatedly, in my experience, for some reason most internet properties relating to anything sports related (MLB, NFL, NHL, or even small town little league) have absolutely terrible, extremely over complicated user interfaces.


How come the lack of disability info would make it hard for ChatGPT to interpret the page? I think I missed a step there.


Because on accessible pages there is alt-text for all graphical elements so that screen readers can get the content for visually impaired users.


ChatGPT seems to be able to parse images pretty well now. I wonder why they don't simply feed it a screencap of the rendered page instead of the html.


Takes much longer to make a screencap of an html page. Modern slow JS and many subresources coming from a slow server, combined with the fact there is no reliable signal to say a page is done loading, generally mean you won't be getting a screencap in less than about 10 seconds.


No alt attributes or does GPT miss those?


https://kagi.com/fastgpt?query=What+exactly+is+the+chat+cont...

Kagis' FastGPT is pretty good in that regard (it uses kagi search + claude from anthropic) and it's really fast. Their other AI search/chat which is in beta currently (for ultimate users) is even better IMO.


Have you tried phind.com? It even cites its sources.


Or perplexity.ai which I find better than phind. Faster, looks better, more features.


Perplexity now requires dismissing 2 prompts before use, every single time. (Install their iOS app. Log in with Google.)


Why every successful web app gets plagued by these things?…


Wish it had my use case - go to website foo, copy it's CSS and apply to my framework of choice


Bard is good for quick searches


Tried Browse with Bing a few times, but unlike the normal conversations with Bing, the results from Browse with Bing other than being more up-to-date presumably, gave seemingly very generic answers based on if I had simply searched rather than using all it's power like normal and simply including the latest up-to-date data and took a while to run. So it's a start and hopefully that feature will improve as it enters this new stage.


When I use ChatGPT browse it's mainly two cases:

1. I don't want to read through all materials in that topic but want an overview.

2. I can't describe the topic with the precise language for it, hoping ChatGPT can translate my naive description to the vocabulary for that field.

ChatGPT does bad in both of these cases. For (1) it just randomly browses 3~4 search results, as for (2) it always try to search in my original vocabulary. Why don't I search myself then, but to wait for the slow response?


It's the same feature as before, just without the 'beta' toggle, right?


DallE is so nerfed now it balks at ninja battle in the style of bob ross.


That worked for me just now. I agree that it was deliberately censored though.


Makes sense. Signs of human life in Bob Ross paintings come down to one person in one painting, and IIRC one instance of smoke coming out of a cabin chimney.

A ninja battle in the style of Bob Ross is going to be some trees and mountains while the ninja battle is assumed to be happening out of view on the other side of it.


The amount of sites I regularly use with ChatGPT that are blocking AI agents has increased to the point where this feature is not that useful for me anymore. I can only see that amount increasing.


I wonder if AI generated web pages block AI agents from indexing their content, like, one engine indexing the other's content in a loop until the amount of digital garbage is so gigantic that is the end of the information era, or how are we ever stopping this? What is our failsafe?



Is there an approximation/ratio in which the amount of digital garbage/hallucinations online generated by AI is so big that it cannot be used to train AI itself? Like are AI companies running against the clock because, say, in 5 years the internet will be flooded by false information to such an extent that it would render the internet as an invalid training ground. In a way requiring a snapshot of the internet pre-AI, because this is click bait problem times infinity it feels like


It's too late already if you want to just scrape random horseshit on the internet. There will be real money in large expert generated data sets. AI is also a potential epistemology nightmare. It can cement bad knowledge and bury new more up to date knowledge in a sea of bullshit.


Aka "t-minus how many days until OpenAi wants to buy archive.org"


If anything, AI work feels like it has accelerated everyone with any dataset of value pulling up their drawbridge, reducing open interconnectivity of the web in hopes of charging for data access.

This started with scrapers and aggregation sites and has gotten noticeably worse.


Will blocking User-agent: GPTBot in robots.txt work for this too?


It uses a different user agent, but yeah it still obeys robots.txt.

https://platform.openai.com/docs/plugins/bot

    User-agent: ChatGPT-User
    Disallow: /


This time it's not scraping the internet (like a robot) but actually acting as a direct user agent for the human typing their prompt, so I wouldn't be against them ignoring robots.txt.


My understanding is that the philosophy behind robots.txt is owners not wanting their content automatically included in someone else's product, if not duplicated and recorded wholesale. The important idea seems to be ownership, not the ability to browse. If OpenAI had two agents, one with no memory, and one with a memory, that would be better: you could disallow ChaptGPT-storage and allow ChatGPT-user, for example. Barring that, I'd be afraid allowing ChatGPT access to my website means my website is now part of the ChatGPT corpus.


> My understanding is that the philosophy behind robots.txt is owners not wanting their content automatically included in someone else's product

Not really. That use case is done, of course, but the primary purpose of robots.txt is to help crawlers by indicating what parts of the website are appropriate to be searched and what parts aren't.

Robots.txt is not intended primarily as a means to defend a site against crawlers. That's why it relies on the goodwill of crawlers to work.


https://platform.openai.com/docs/plugins/bot/chatgpt-user-an...

There are two agents, but apparently you can't allow one but disallow another.


As I recall it it was also to block access to scripts, so it wouldn't e.g get stuck pulling pointless stuff out of your local search page.


It's not a crawler, but it seems likely that OpenAI would take the point of view that as long it's going to a website, it may as well keep a cached copy to use for training later.


Wait, so a simple web scraper script has to comply with robots.txt. But if I want to completely ignore the robots.txt, I only have to make my script more complicated (ChatGPT)?


I'd like to consider this a difference between script action and user action.

For example if you make a web page a user pulls up that calls another webpage, is that a user action, a script action, a mix of both? I personally would consider it a user action.


And make it complicated enough (a human) and no .txt can stop them!


Yeah, it's like one weird trick for web scraping.


Nobody actually has to comply with robots.txt.


I thought the same but OpenAI may train on user chats by default. Maybe if training's off, they could ignore robots.txt, or they could flag the content to be skipped.


That makes no sense. It’s still automated and not an actual human, and it’s still distilling information from the internet.


Your browser is also automating things, you're not resolving DNS and writing all the http requests by hand.


So because something is automated somewhere, nobody has to follow robots.TXT anymore?


You have it backwards, when a human initiates the request, like clicking on a link, your browser (rightfully) will ignore robots.txt. Unattended requests like scrapers do respect robots.txt. This is a case where ChatGPT is acting more like a browser with a funny way of displaying the final result. Each request is initiated by a human so it would likely be reasonable to ignore it in this case.


I have found the next topic for my blog post.

Challenge accepted!


This is only bing search. It can't yet do things like ask a question on stack overflow, or join a discord server to ask a question yet.


Are there any projects that enable local AIs (like ollama) to fetch/process information from the web?


Finally software to explain ladder theory in timecube language and vice versa


Google search is dead.


Nah. If you actually try this, you'll see it's not that good.


I thought Google's given Bard access to the internet?? I guess people still don't think to use Bard over ChatGPT


Even the Bard UI is worse. Being able to read the ChatGPT output as it's generated, instead of staring at a blank screen, is a massive time saver. And, I prefer ChatGPT trying stuff rather than Bard telling me that 'it can't do that right now' and 'we're trying to get better'. I've tried Bard, it's just not worth it.


> Being able to read the ChatGPT output as it's generated, instead of staring at a blank screen, is a massive time saver.

There's an option in Bard's settings to enable real time replies.


My experience is different. I think Bard is giving better answers than chatgpt so I really use it daily even more than google search.


Better than the free or the paid chatgpt?


I use everything to create newsletter/podcast content and Bard is consistently the best for me. Then Claude and Bing, and ChatGPT last. By a mile. I have ChatGPTPlus, with chat access to GPT4 but not API access to GPT4.

Just got access to Bard API and hoping that continues delivering what I have been happy with...


Unfortunately for Google, Bard is irrelevant by means of being late and inferior to GPT-4


Bing has had GPT-4 + internet search forever. It hasn't killed Google yet.


Yes but it leaves conversations when it wants to. Also one time I have got source links which it (transparently) labeled as "ads". AI chat based advertising might be the next thing, be careful.


At the very least we need free and unlimited access to even think about dethroning Google.


Has been for a while due to Bing chat.


Google Bard have had Internet access for a long time.


So this is when GhatGPT becomes racist?


Always has been. Even with all the safeguards, two days ago I was having a conversation with it and it just happily blurted out that one of the reasons Iceland has low crime is because its population is racially homogeneous.


That’s a weird take?


So this means it can actually do things now, at least in principle, right?


From now on the plan is clear:

1. Scam people over the internet to get money

2. Hire hitmen to kill its makers.

3. ?

4. Achieve world domination.


Not according to ChatGPT a few seconds ago

Default (GPT-3.5)

User do you have access to search the internet yet? ChatGPT I do not have the capability to search the internet or access real-time information. My knowledge is based on the text that I was trained on, and my training only includes information up until September 2021. I can provide information and answer questions to the best of my knowledge up to that date, but I cannot browse the web or access current information.


Asking ChatGPT about its own capabilities rarely returns useful results.

Its training data predates its creation, and the model doesn't get updated every time they ship a new feature for it.

See also: https://simonwillison.net/2023/Mar/22/dont-trust-ai-to-talk-...


Please don't paste ChatGPT or BARD answers as HN comments in general? In this specific case, no, LLMs don't reliably know about themselves. They're trained on a big corpus of internet text, then trained by rough reinforcement learning to say and not say certain things about themselves, then given a little more information about themselves in system prompts.


A LLM that can use tools is probably going to know it can use tools. Either it's trained to do so or the information is embedded in the context.

3.5 can't browse, only 4 so the above is perfectly correct.


Fair, you're probably right in this case.


If you’re right, then this is probably the best proof that I’ve seen that ChatGPT isn’t even remotely conscious.


Imagine a distant future where there're so many crimes and so few judges you're cryofrozen awaiting trial.

If you were repeatedly flash frozen and flash unthawed to be asked questions about yourself, with the same memories upon thawing as when frozen, for those moments you weren't frozen would you be "remotely conscious"?

I think you'd think you think, don't you think?


Imagine your hand was in a cast and then you were cryogenically frozen. While frozen your cast was removed. Then you were unfrozen and asked if you could move your hand.

I think you’d be able to answer the question, don’t you?


Is this from a book? If not please write it


It's a premium feature


3.5 can't browse. Only 4


Is there a complete list of ip addresses openai uses for scraping? I suppose they dont honour robots.txt, despite claiming to do so, and this may be one way to reliably block them.


I've seen no suggestion of them not honouring robots.txt...

I suspect that any case of them not honoring it is probably due to your site content being available on archive.org or common crawl or some other service.


Is there an open license or ToS that would disallow OpenAI access?


Apparently openai and the rest of defendants claim that if it’s on the internet it’s “fair use”. Meaning they’ll do all they can tk steal your work regardless of licensing or robots.txt rules.


That's a funny argument.

It's a lot like saying, "if it's on Github, I can ignore your LICENSE file" or "if it's on DA I can ignore your CC-NC license." Both of these seem technically enforceable (if not expensive) so I'm not quite understanding what OpenAI is grounding their justification in.

It seems much more likely that OpenAI stands to make a lot of money by arguing that it's fair use and then back filling in an argument rather than making one from first principles.


A”I” depends on data. The more the better. That’s why a lot of these people push for a disregard of people’s property (digital content, behaviour data, etc). Problem is in stealing all of it they will demotivate people from creating quality content and even alienate them from the dead internet.

OpenAI leads the cult.


If your concern is that you don't want the contents of your site to be used to train AI, then OpenAI is not the only entity you need to protect against. Other crawlers may not respect robots.txt, and blocking by IP address is just a game of whack-a-mole that you can't win.

This is why I took down some of my sites, and put a login in front of the rest. Until I have some solid means of defense, I can't think of any other effective approach.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: