I've been relatively unimpressed with the ChatGPT browse mode.
My problem with it is that it seems to run really obvious, naive searches. Most of the time I ask it something, then see what it's searching for and think "oh no, that's not going to return anything more useful than what I could have found myself".
It's also pretty slow.
I'm very much looking forward to having a search assistant which can go ahead and wade through eg 20 websites about home battery packs and show me a summary table of my options - ChatGPT Browse isn't quite that yet.
Have you tried perplexity.ai? It has become my go-to for LLM + browsing. It is really fast and does a pretty good job with retrieval and summarization.
I wonder why they don't list that under ultimate plan? On pricing page the selling point for ultimate is like: Hey, help us test things and help us by giving some money: https://kagi.com/pricing
How is Kagi for niche stuff? I'm frequently searching for niche and hard to find programming information about processors and whatnot and google/DDG have only gotten worse with time.
I use Kagi. Probably not better than Google, but often with less junk. It's much better than DDG. The extra features are nice, as are the customizations (boost specific domains, etc.) I thought I'd be !g-ing to find things on Google, but I rarely do, and I'm constantly researching niche stuff. When I do try !g, I generally find that Google doesn't have the results, either. Kagi also seems to do better with really old results that Google has memory holded.
Just tried it, the free version is really slow (it has been going on 7 minutes now to just generate the first image and still isn't complete) compared with what I'm getting from OpenAI. I'll keep the $20/mo on that, it is worth it.
It’s also pretty useless at understanding any page that’s even remotely complex.
I asked it to tell me when the next baseball home game is so I can avoid traffic when going to the downtown library and it couldn’t answer even that basic question. The search results defaulted to the team’s official calendar but the day numbers were displayed as images so they could style them in the team’s font, making them invisible to ChatGPT.
This example is more likely due to that specific page/site lacking any kind of accessibility information (e.g. how people who are blind, deaf or have other disabilities can use the internet), rather than being large or complex.
Relatedly, in my experience, for some reason most internet properties relating to anything sports related (MLB, NFL, NHL, or even small town little league) have absolutely terrible, extremely over complicated user interfaces.
Takes much longer to make a screencap of an html page. Modern slow JS and many subresources coming from a slow server, combined with the fact there is no reliable signal to say a page is done loading, generally mean you won't be getting a screencap in less than about 10 seconds.
Kagis' FastGPT is pretty good in that regard (it uses kagi search + claude from anthropic) and it's really fast. Their other AI search/chat which is in beta currently (for ultimate users) is even better IMO.
Tried Browse with Bing a few times, but unlike the normal conversations with Bing, the results from Browse with Bing other than being more up-to-date presumably, gave seemingly very generic answers based on if I had simply searched rather than using all it's power like normal and simply including the latest up-to-date data and took a while to run. So it's a start and hopefully that feature will improve as it enters this new stage.
1. I don't want to read through all materials in that topic but want an overview.
2. I can't describe the topic with the precise language for it, hoping ChatGPT can translate my naive description to the vocabulary for that field.
ChatGPT does bad in both of these cases. For (1) it just randomly browses 3~4 search results, as for (2) it always try to search in my original vocabulary. Why don't I search myself then, but to wait for the slow response?
Makes sense. Signs of human life in Bob Ross paintings come down to one person in one painting, and IIRC one instance of smoke coming out of a cabin chimney.
A ninja battle in the style of Bob Ross is going to be some trees and mountains while the ninja battle is assumed to be happening out of view on the other side of it.
The amount of sites I regularly use with ChatGPT that are blocking AI agents has increased to the point where this feature is not that useful for me anymore. I can only see that amount increasing.
I wonder if AI generated web pages block AI agents from indexing their content, like, one engine indexing the other's content in a loop until the amount of digital garbage is so gigantic that is the end of the information era, or how are we ever stopping this? What is our failsafe?
Is there an approximation/ratio in which the amount of digital garbage/hallucinations online generated by AI is so big that it cannot be used to train AI itself? Like are AI companies running against the clock because, say, in 5 years the internet will be flooded by false information to such an extent that it would render the internet as an invalid training ground. In a way requiring a snapshot of the internet pre-AI, because this is click bait problem times infinity it feels like
It's too late already if you want to just scrape random horseshit on the internet. There will be real money in large expert generated data sets. AI is also a potential epistemology nightmare. It can cement bad knowledge and bury new more up to date knowledge in a sea of bullshit.
If anything, AI work feels like it has accelerated everyone with any dataset of value pulling up their drawbridge, reducing open interconnectivity of the web in hopes of charging for data access.
This started with scrapers and aggregation sites and has gotten noticeably worse.
This time it's not scraping the internet (like a robot) but actually acting as a direct user agent for the human typing their prompt, so I wouldn't be against them ignoring robots.txt.
My understanding is that the philosophy behind robots.txt is owners not wanting their content automatically included in someone else's product, if not duplicated and recorded wholesale. The important idea seems to be ownership, not the ability to browse. If OpenAI had two agents, one with no memory, and one with a memory, that would be better: you could disallow ChaptGPT-storage and allow ChatGPT-user, for example. Barring that, I'd be afraid allowing ChatGPT access to my website means my website is now part of the ChatGPT corpus.
> My understanding is that the philosophy behind robots.txt is owners not wanting their content automatically included in someone else's product
Not really. That use case is done, of course, but the primary purpose of robots.txt is to help crawlers by indicating what parts of the website are appropriate to be searched and what parts aren't.
Robots.txt is not intended primarily as a means to defend a site against crawlers. That's why it relies on the goodwill of crawlers to work.
It's not a crawler, but it seems likely that OpenAI would take the point of view that as long it's going to a website, it may as well keep a cached copy to use for training later.
Wait, so a simple web scraper script has to comply with robots.txt.
But if I want to completely ignore the robots.txt, I only have to make my script more complicated (ChatGPT)?
I'd like to consider this a difference between script action and user action.
For example if you make a web page a user pulls up that calls another webpage, is that a user action, a script action, a mix of both? I personally would consider it a user action.
I thought the same but OpenAI may train on user chats by default. Maybe if training's off, they could ignore robots.txt, or they could flag the content to be skipped.
You have it backwards, when a human initiates the request, like clicking on a link, your browser (rightfully) will ignore robots.txt. Unattended requests like scrapers do respect robots.txt. This is a case where ChatGPT is acting more like a browser with a funny way of displaying the final result. Each request is initiated by a human so it would likely be reasonable to ignore it in this case.
Even the Bard UI is worse. Being able to read the ChatGPT output as it's generated, instead of staring at a blank screen, is a massive time saver. And, I prefer ChatGPT trying stuff rather than Bard telling me that 'it can't do that right now' and 'we're trying to get better'. I've tried Bard, it's just not worth it.
I use everything to create newsletter/podcast content and Bard is consistently the best for me. Then Claude and Bing, and ChatGPT last. By a mile. I have ChatGPTPlus, with chat access to GPT4 but not API access to GPT4.
Just got access to Bard API and hoping that continues delivering what I have been happy with...
Yes but it leaves conversations when it wants to. Also one time I have got source links which it (transparently) labeled as "ads". AI chat based advertising might be the next thing, be careful.
Always has been. Even with all the safeguards, two days ago I was having a conversation with it and it just happily blurted out that one of the reasons Iceland has low crime is because its population is racially homogeneous.
User
do you have access to search the internet yet?
ChatGPT
I do not have the capability to search the internet or access real-time information. My knowledge is based on the text that I was trained on, and my training only includes information up until September 2021. I can provide information and answer questions to the best of my knowledge up to that date, but I cannot browse the web or access current information.
Please don't paste ChatGPT or BARD answers as HN comments in general? In this specific case, no, LLMs don't reliably know about themselves. They're trained on a big corpus of internet text, then trained by rough reinforcement learning to say and not say certain things about themselves, then given a little more information about themselves in system prompts.
Imagine a distant future where there're so many crimes and so few judges you're cryofrozen awaiting trial.
If you were repeatedly flash frozen and flash unthawed to be asked questions about yourself, with the same memories upon thawing as when frozen, for those moments you weren't frozen would you be "remotely conscious"?
Imagine your hand was in a cast and then you were cryogenically frozen. While frozen your cast was removed. Then you were unfrozen and asked if you could move your hand.
I think you’d be able to answer the question, don’t you?
Is there a complete list of ip addresses openai uses for scraping? I suppose they dont honour robots.txt, despite claiming to do so, and this may be one way to reliably block them.
I've seen no suggestion of them not honouring robots.txt...
I suspect that any case of them not honoring it is probably due to your site content being available on archive.org or common crawl or some other service.
Apparently openai and the rest of defendants claim that if it’s on the internet it’s “fair use”. Meaning they’ll do all they can tk steal your work regardless of licensing or robots.txt rules.
It's a lot like saying, "if it's on Github, I can ignore your LICENSE file" or "if it's on DA I can ignore your CC-NC license." Both of these seem technically enforceable (if not expensive) so I'm not quite understanding what OpenAI is grounding their justification in.
It seems much more likely that OpenAI stands to make a lot of money by arguing that it's fair use and then back filling in an argument rather than making one from first principles.
A”I” depends on data. The more the better. That’s why a lot of these people push for a disregard of people’s property (digital content, behaviour data, etc). Problem is in stealing all of it they will demotivate people from creating quality content and even alienate them from the dead internet.
If your concern is that you don't want the contents of your site to be used to train AI, then OpenAI is not the only entity you need to protect against. Other crawlers may not respect robots.txt, and blocking by IP address is just a game of whack-a-mole that you can't win.
This is why I took down some of my sites, and put a login in front of the rest. Until I have some solid means of defense, I can't think of any other effective approach.
My problem with it is that it seems to run really obvious, naive searches. Most of the time I ask it something, then see what it's searching for and think "oh no, that's not going to return anything more useful than what I could have found myself".
It's also pretty slow.
I'm very much looking forward to having a search assistant which can go ahead and wade through eg 20 websites about home battery packs and show me a summary table of my options - ChatGPT Browse isn't quite that yet.