I think there's a widely held misconception that anything you paste into GPT-4 will be used as raw training data by the model.
Some people even seem to believe that it's learning continuously, so something you paste in could show up in an answer for another user a few minutes later.
My mental model of how this works is somewhat different:
- It takes months to train a model on raw data, and OpenAI train new ones (that get released to the public) quite infrequently.
- OpenAI DO NOT WANT your private data in their training data. They put a great deal of work into stripping out PII from the training data that they do use already (this is described in their papers). They're not going to just paste in anything that anyone typed into that box.
Here's the problem though: they DO use ChatGPT interactions to "improve" their services. I don't think that means piping the data directly into training, but they clearly log everything and use those interactions as part of subsequent rounds for things like fine-tuning and RLHF.
Also they had that embarrassing bug a few weeks ago where some users could see the titles of conversations had by other users.
So it's not irrational to worry about pasting data into GPT-4 - it gets logged, and it could leak by accident.
But I'm confident that data passed to ChatGPT isn't being piped in as raw training data for subsequent versions of their live models.
(I hope I'm right about this though - I thought about blogging it, but OpenAI's transparency isn't good enough that I'd feel comfortable staking my reputation on this)
I'm absolutely horrified by people's willingness to submit private information (personal or corporate) even if it's not used for training. Data breaches happen all the time (targeted or accidental), and OpenAI is becoming a juicier target by the day.
You're right that OpenAI doesn't want the information. Consequently, OpenAI will not have security policies and processes geared for anonymization, or handling financial and health data as those are not a design goals. If I were an attacker, I'd go for the raw data rather than try to glean information off the model (in the hypothetical where user input were to be used for training)
Usually BAAs are required for IT vendors from healthcare companies before they start getting paid. It doesn't mean that they are claiming that their systems are HIPAA compliant
Someone might ask it:
"How do you I figure out if this person killed someone?"
and it responds:
"I can't be certain if they killed them but last week they asked me where they should hide the body."
But seriously, I think best argument for this is that the EU(or other euro nations) would not hesitate to go after a US company for collecting user data in violation of their data privacy laws. Even in the US, certain professionals are required to maintain confidentiality of certain records or face rather extreme penalties. OpenAI also doesn't have FAANG capital to grease Washington with yet and we know how kleptocrats love to leverage justice against newly emergent companies with valuable IP.
So if they say they don't, they had better not be or it would the likely be the end of them.
> last week they asked me where they should hide the body.
ChatGPT is a static model and has zero memory. It can't even "remember" anything word-to-word as it generates its output! It starts its processing over from scratch for each word.
I think it's more about being against the principle of piping potentially sensitive data to any third party.
True, OpenAI doesn't have any real motivation to randomly pluck your data and decide to do something horrible to you with it... but they could. More importantly, circumstances can and will change as time goes on. If your logs change hands as part of a buyout or cyberattack, you'll have no recourse.
> OpenAI doesn’t have any real motivation to randomly pluck your data and decide to do something horrible to you with it
They do have a motivation to use it for training, which could result in it being externally exposed to third parties, who might, OTOH, have the motivation when encountering it to do something horrible to you with it.
Yes re treating interactions as RLHF. Could imagine them developing a flow to automatically catalog interactions as successful and unsuccessful, then cluster those by domain + interaction flow. If someone has a successful interaction in a cluster that is normally unsuccessful, treat that as a 'wild-type' prompt engineering innovation that needs to be domesticated into the model.
I think you're right that blindly training on chats would bring back the olden days of google bombing ('santorum')
And also that any company with 'improve' in their TOS isn't committing to perfect privacy
> OpenAI DO NOT WANT your private data in their training data
But they do want it. I can see many old chat logs.
Data is a liability. Does "clear conversations" in chat.openai.com actually remove them? Or jst mark them as "deleted", but they remain in a database. I just did a data export, then a clear conversation, then another data export. The second export was empty, which seems suspiciously fast to me
I'm genuinely trying to understand (based on this and another comment above): wouldn't storing data for pre-training vs fine-tuning carry the same risks?
If you mean the risk that OpenAI will have their own security hole that leaks that stored data then yes.
If you mean the risk that someone will ask a question about your company and ChatGPT will answer with some corporate secrets then no.
This all depends very much on what they are using the ChatGPT data for. My theory is that they treat it very carefully to avoid "facts" from it being absorbed into the model - so even "fine tuning" may be inaccurate terminology here.
I really, really wish they would be more transparent about how they use this data.
> But I'm confident that data passed to ChatGPT isn't being piped in as raw training data for subsequent versions of their live models.
right now sure but they are almost certainly saving that data to send you targeted ads down the line. maybe not this company... maybe when they get into financial hardship and sell off to someone with dubious ethics. maybe not ads but something like that.
Why do you think your email is private? Is your email provider more aligned with your interests or more secure than OpenAI? I doubt either Google or Microsoft care about your privacy (no difference).
Of course Google and Microsoft is more secure than OpenAI.
And Google and Microsoft don't care about your privacy but they do care about being the only ones to profit off your data. OpenAI don't make profit off of your data, but they are collecting it; how they choose to make profit off of it in the future could be completely orthogonal to your interests.
Hell, China or any other of the numerous wealthy baddys could buy OpenAI and have access to all the data they're storing.
Homie, Bard from Google is trained on your Google Emails. They read your emails and build data profiles based on that shit and sell it. What are you on exactly? The US Government is more of a direct threat to you and me than the CCP and they actively buy your data and were reading all your emails not too long ago.
Huh? Google Bard is trained on your email data, and Microsoft de facto controls OpenAI and by extension ChatGPT via their 49% investment in the company.
Mine isn't private. I hand my email out to anyone who wants it, including search engines and presumably AIs. It's right there on my website. If you want my email, I'll happily give it to you.
Emails are "personally identifiable".
Your email can be used to link almost every online purchase you've ever made for example. That is what makes them dangerous, and it's what we need to change to improve privacy. It should be possible for companies to send invoices and shipping notices without linking the order to the customer's email address (or their name, or street address, or any other personally identifiable information).
We're a long way from being able to do that with invoices and shipping notifications but there's a lot of other systems where emails aren't necessary and shouldn't be associated with a record - even though emails are not private.
I’m only sending GPT message snippets, which Gmail’s API make available - probably using a language model (LoL). So there is very little PII in the API calls to GPT. That being said, I think OpenAI has a very large incentive to carefully destroy any query data sent through its APIs. The last thing they need is for some employee to quit and spill the beans about how Sam Altman stays up late at night laughing at all the API calls revealing the trivial problems of humans with an IQ of less than 140.
To do this, you had to feed your email into GPT-4, right?