It's roughly the same problem as letting a search engine build indexes (with previews!) of sites without authentication. It's kinda crazy that things were allowed to go this far with such a fundamental flaw.
Yep. Many years ago I worked at one of the top brokerage houses in the United States, they had a phenomenal Google search engine in house that made it really easy to navigate the whole company and find information.
Then someone discovered production passwords on a site that was supposed to be secured but wasn’t.
Found such things in several places.
The solution was to make searching work only if you opted-in your website.
After that internal search was effectively broken and useless.
All because a few actors did not think about or care about proper authentication and authorization controls.
I'm unclear on what the "flaw" is - isn't this precisely the "feature" that search engines provide to both sides and that site owners put a ton of SEO effort into optimizing?
If you have public documents, you can obviously let a public search engine index them and show previews. All is good.
If you have private documents, you can't let a public search engine index and show previews of those private documents. Even if you add an authentication wall for normal users if they try to open the document directly. They could still see part of the document in google's preview.
My explanation sounds silly because surely nobody is that dumb, but this is exactly what they have done. They gave access to ALL documents, both public and private, to an AI, and then got surprised when the AI leaked some private document details. They thought they were safe because users would be faced with an authentication wall if they tried to open the document directly. But that doesn't help if copilot simply tells you all the secret in it's own words.
You say that, but it happens — "Experts Exchange", for example, certainly used to try to hide the answers from users who hadn't paid while encouraging search engines to index them.
That's not quite the same. Experts Exchange wanted the content publicly searchable, and explicitly allowed search engines to index it. In this case, many customers probably aren't aware that there is a separate search index that contains much of the data in their private documents that may be searchable and accessible by entities that otherwise shouldn't have access.