Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We block ChatGPT, as do most federal contractors. I think it’s a horrible exploit waiting to happen:

- there’s no way they’re manually scrubbing out sensitive data so its bound to spill out from the training data when prompting the model

- OpenAI is openly storing all this data they’re collecting to the extent that they’ve had several leaks now where people can see others’ conversations and data. We are one step away if it hasn’t already happened from an exploit of their systems (that likely weren’t built with security as the top priority as opposed to scale and performance) that could leak a monumental amount of data from users.

In the most innocent case they could leak the personal info of naive users. But largely if Linkedin is any indication, the business world is filled with dopes who genuinely believe the AI is free thinking and better than their employees. For every org that restricts ChatGPT use, there are fifty others that don’t, most of which have at least one of said dopes who are ready to upload confidential data at a moments notice.

Wouldn’t even put it past military personnel putting S/TS information into it at this point. OpenAI should include more brazen warnings against providing this type of data if they want to keep up this facade of “we can’t release it because ethics” because cybersecurity is a much more real liability than a supervised LM turning into terminator.



This really depends on the cost/benefit tradeoff for the entity in question. If using ChatGPT makes you X% more productive (shipping faster / lowers labor costs / etc), but comes with Y% risk of data leakage, is that worth it in expectation or not? I would argue that there definitely exist companies for which it's worth the tradeoff.

By the way, OpenAI says they wont use data submitted through its API for model training - https://techcrunch.com/2023/03/01/addressing-criticism-opena...


To anyone who may be pasting code along the lines of 'convert this sql table schema into a [pydantic model|JSON Schema]' where you're pasting in the text, just ask it instead to write you a [python|go|bash|...] function that reads in a text file and 'converts an sql table schema to output x' or whatever. Related/not-related--great pandas docs replacement is another great+safe use-case.

Point is, for a meaningful subset of high-value use-cases you don't need to move your important private stuff across any trust boundaries, and it still can be pretty helpful...so just calling that out in case that's useful to anyone...


At first I was impressed by how easy it was to reach a data model with chatgpt, then I laughed as I tried to tweak it and use it. I realized it didn't really have any model concepts and was just using its various KB.

I am unsure if the so called AI can think in models but so far, not but still an impressive assisting tool if you take care of its limitations.

Another point where it lacks is in logic, my daughter has a lot of fun with the book "what is the name of this book?" but she was struggling with the "map of baal" explanation, her the answer was a certain map, yet the book had another answer, I had a third one as I interpreted a proposition. I never got an answer without a contradiction in chatgpt reasoning, and the book had been mistranslated to French so one of its propositions was changed (C, both A and B were knaves) but not the answer.


> At first I was impressed by how easy it was to reach a data model with chatgpt, then I laughed as I tried to tweak it and use it. I realized it didn't really have any model concepts and was just using its various KB.

> I am unsure if the so called AI can think in models but so far, not but still an impressive assisting tool if you take care of its limitations.

I don't know. I'm using it for exactly that ("here's a problem, come up with a data model") and it gives a great starting point.[0]

Not perfect, but after that it's easy to tweak it the old-fashioned way.

I find its data modelling capabilities (in the domain I'm using it for - API services) to be rougly on par with a mid-level developer (for a handwavy definition of "midlevel").

[0] https://apibakery.com/demo/ai/


Did you prime it before asking, so it was answering in the appropriate context?


I'm doing that since day one. I can't believe people are pasting real data into this corporate black boxes.


What about Google Docs, Office 365, Github, AWS, Azure, Google Cloud, JIRA, Zendesk, etc?

What is different about ChatGPT (if anything)?


We have data standards and agreements with those companies, we pay them to have expectations. Even then, we're strict about what touches vendor servers and it's audited and monitored. Accounts are managed by us and tied into onboarding and offboarding. If they have a security incident, they notify, there's response and remediation.

ChatGPT seems to be used more like a fast stackoverflow, except people aren't thinking of it like a forum where others will see their question so they aren't as cautious. We're just waiting for some company's data to show up remixed into an answer for someone else and then plastered all over the internet for the infosec lulz of the week.


> We have data standards and agreements with those companies, we pay them to have expectations. Even then, we're strict about what touches vendor servers and it's audited and monitored. Accounts are managed by us and tied into onboarding and offboarding.

For every company like yours there are hundreds that don't. People use free gmail address for sensitive company stuff, paste random things in random pastebins, put their private keys in public repos, etc.

Yes, data leaks from OpenAI are bound to happen (again), and they should beef up their security practices.

But thinking people are using only ChatGPT in an insecure way vastly overestimates their security practices elsewhere.

The solution is education, not avoiding new tools.


Doesn't OpenAI explicitly say that your Q/A on the free ChatGPT are stored and sent to human reviewers to be put in their RL database? Now of course we can't be sure what google, AWS etc do with the data on disks there, but it would be a pretty big scandal if some whistleblower eventually comes out and say that google employees sit and laugh at private bucket contents on GCP or private Google Docs. So there's a difference in stated intention at least..


Who in their right mind is using free ChatGPT through that shitty no good web interface of theirs, that can barely handle two queries-and-replies before grinding down to a halt? Surely everyone is using the pay-as-you-go API keys and any one of the alternative ffrontends or integrations?

And, IIRC, pay-as-you-go API requests are explicitly not used for training data. I'm sad GPT-4 isn't there yet - except for those who won the waitlist lottery.


It's really funny to see these types of comments. I would assume a vast majority of users are using the Web interface, particularly in a corporate context where an account for the API could take ages or not be accepted.

If people were smart and performed according to best practices, articles like this one would not be necessary.


I mean, if you're using a free web interface in corporate context, you may just as well use a paid API with your personal account - either way, you're using it of your own volition, and not as approved by your employer. And getting API keys to ChatGPT equivalent (i.e. GPT-3.5) takes... a minute, maybe less.

I am honestly confused how people can use this thing with the interface OpenAI runs. The app has been near-unusable for me, for months, on every device I tried it on.


> and any one of the alternative ffrontends or integrations?

And what sort of understanding do you have with the alternative frontends/integrations about how they handle your API keys and data? This might be a better solution for a variety of reasons but it doesn't automatically mean your data is being handled any better or worse than by openai.com


I wonder what the distribution of tokens / sec at OpenAI is between the free ChatGPT, paid ChatGPT, and APIs. I’d have to think the free interface is getting slammed. Quite the scaling project, and still nowhere near peaking.


To quote a children's TV show: "Which ones of these things are not like the other ones?"

Some of those are document tools working on language / knowledge. Others are infrastructure, working on ... whatever your infra does, and your infra manages your data (knowledge).

If you read their data policies, you'll find they are not the same.


I wouldn't put sensitive work data/employer IP in a personal Google Doc (et al.) either, no?


Dont use any of it


To your average user who interfaces with these figurative black boxes with a black box in their hand, how is this particular black box any different than the other black boxes that this user hands their data to every second of every day?


there are plenty of disallowed 'black boxes' within the federal sphere; chatgpt is just yet another.

to take a stab at your question, though : my cell phone doesn't learn to get better by absorbing my telecommunications; it's just used as a means to spy on my personal life by The Powers That Be. The primary purpose of my cell phone is for the conveyance of telecommunications.

chatGPT hordes data for training and self-improvement in its' current state. It's whole modus operandi involves the capture of data, rather than it being used for that tangentially. It could not meaningfully exist without training on something, and at this stage of the game it's the trend to self-train with user data.

Until that trend changes people should probably be a bit more suspect about what kind of stuff gets thrown into the training bin.


Those typically have MSAs with legalese where parties stipulate what they will and will not do and often whether or not it’s zero knowledge and often option to have your own instance encryption keys.

If people are using the free version of chatGPT then it’s unlikely there is a contract between the companies and more likely just a terms of use applied by chatGPT and ignored by the users.


No idea


I simply don't give a crap if my employer loses data. I don't care if my carelessness costs my employer a billion bucks down the line as I won't be working for them next year.


Writing that is a really good way to end up on the wrong side of a civil suit.


I have a addon, were every other sentence is generated by Chat GPT. Good luck holding me liable for a robots actions.


"I do not take any kind of responsibility about what I'm doing, or not doing, or thinking about doing or not doing, or thinking about whenever I should be doing or not doing, or thinking about whenever I should be thinking about doing or not doing".


Unless you can prove a given sentence was generated by ChatGPT, it will be assumed it wasn't.


As a moral questionable answering robot however, i must aks, why all things else should be tainted by the machinery, but evidence like text should not?


Why don’t you feel any responsibility?


I am treating my employment like a corporation would. Risks I do not pay for and do not benefit from mitigating are waste that could allow me to transfer time back to my own priorities, increasing my personal "profit."


Not who you replied to, but if you agree, even a little, with the phrase, "the social contract between employees & employers is broken in the US"... well it goes both ways.


Do you really think the people asking ChatGPT to write their code can make that abstraction?

The fact that the can't do this is the whole reason they have to use ChatGPT.


I use it because it's 10-100x more interesting, fun, and fast as a way to program, instead of me having to personally hand-craft hundreds of lines of boilerplate API interaction code every time I want to get something done.

Besides, it's not like it puts out great code (or even always working code), so I still have to read everything and debug it. And sometimes it writes code that is just fine and fit for purpose and horrendously ugly, so I still have to scrap everything and do it myself.

(And then sometimes I spend 10x as long doing that, because it turns out it's also just plain good fun to grow an aesthetic corner of the code just for the hell of it, too — as long as I don't have to.)

And even after all that extra time is factored back in: it's still way faster and more fun than the before-times. I'm actually enjoying building things again.


Pair-programming with ChatGPT is like having an idiot-savant friend who always surprises you. Doesn’t matter if the code is horrible, amazing, or something inbetween. It’s always interesting.

And I agree it’s fun. Maybe it’s the simulated social interaction without consequences. I can be completely honest with my robot friend about the shitty or awesome code and no one’s feelings are going to get hurt. ChatGPT will just keep trying to be helpful.


People aren’t using ChatGPT because they can’t do it themselves, they’re using it to save time.


You can be an experienced developers with years building complex applications behind you and still find ChatGPT useful. I've found it useful for documenting individual methods or simply explaining my own/other's code or writing unit test methods or just using it to add boilerplate stuff that saves me an hour that I use elsewhere.


I think many people find ChatGPT useful specifically because they have years of experience building complex applications.

If you know exactly what you want to ask of it, and have the ability to evaluate and verify what it produces, it's incredible what you can get out of it. Sure it's nothing I couldn't have done otherwise... eventually. The productivity it enables is worth every cent.

Easily the best $20 I've spent in ages, they should have run with the initial idea of charging $42.

But holy moly anyone putting confidential information into it needs to stop


I’ve been doing this kind of thing pretty regularly for the past few weeks, even though I know how to do any of the tasks in question. It’s usually still faster, even when taking the time to anonymize the details; and I don’t paste anything I wouldn’t put on a public gist (lots of “foo, bar”, etc)


Precisely because I can abstract it is why I use ChatGPT. It can do the boring, tedious, repetitive stuff instead of me and has shown me the joy of using programming to solve ACTUAL problems yet again, instead of having to spend hours on unimportant problems like "how do I do X with library Y".


But that's the API, not the Chat input or Playground.

Companies can use Azure OpenAI Services to get around this -- there's data privacy, encryption, SLAs even. The problem is it's very hard to get access to (right now).


the #1 problem with corporations saying things is that many things they say are not regulated or are taken on good faith. What happens with OpenAI are acquired and the rules change? These comments are often entirely worthless.


These are contractual terms.


> If using ChatGPT makes you X% more productive (shipping faster / lowers labor costs / etc), but comes with Y% risk of data leakage

X and Y are not alike, and should not be compared. X is a benefit to you(r employer), whereas Y is a risk to the customer who has entrusted you with their data.


You've certainly not worked with _real_ sensitive data. The kind that can bankrupt your business.

I do and if it could be leaked through ChatGPT I would have it blocked.

Risk isn't a single dimension, it's a combination of exposure (chance of happening) and impact (how much will you lose)


Mate. You aren’t special. It’s the nature of the profession that most of us are in, that we end up dealing with the “sensitive” data that you’re describing, barring most people working in Big Companies with proper internal controls.

Nothing you’ve said negates anything OP said. It’s simply an elaboration wrapped in elitism.


Risk of leakage? It is not a risk, it is a matter of time.


Let's also not discount that for every "dope" there is at least one "bad actor" who is willing to take the risk to get an edge in their workplace or appease their managers demands. The warnings will only deter the first group.


> Wouldn’t even put it past military personnel putting S/TS information into it at this point.

Hey, they need someone to proofread their War Thunder forum posts to make sure they're using correct spelling and grammar when leaking classified info. ;-)

(Ref if you don't get the joke: https://taskandpurpose.com/news/war-thunder-forum-military-t...)


Not only that, but the European theater nuclear forces leaking security arragements and even door PIN-codes for nuclear weapons bunkers via online flash card sites might be a better example.

As the leaks ware more inadvertent.


I am curious, do you block MS Edge? It has a grammar check for all input boxes that sends data to MS servers to check. Similar to what Grammarly does.

MS also "helpfully" asks you if you want to use that enhanced grammar check in MS Word(as far as I have seen, might be there in other office products too). I cannot imagine sending all my documents to MS. But I am not sure most users will realize what is happening.

All these companies offer helpful services but are hovering up data and no one knows the consequences yet. It feels like ChatGPT is just one symptom of a bigger problem.


You can disable the feature entirely in group policies, I imagine organizations with a decent IT org will do so before deploying the update.


We don't but the grammar check is disabled. In general anything cloud-based services are vetted before being allowed.

I think MS Edge is getting even worse about this, with the big fucking Bing icon in the corner and making it impossibly hard to get rid of it.


Some military folks put nuclear weapons storage training materials onto Quizlet, so I don’t doubt for a second people would try to put ChatGPT onto a classified computer system.


Possibly I don't know how this all works, but I think if the host of a ChatGPT interface were willing to provide their own API key (and pay), they could then provide a "service" to others (and collect all input).

In that case, you wouldn't know to block them until it was too late.

Ultimately either you must watch/block all outgoing traffic, or you must train your people so thoroughly that they become suspicious of everything. Sadly, being paranoid is probably the most economical attitude these days if IP and company secrets have any value.


> Possibly I don’t know how this all works, but I think if the host of a ChatGPT interface were willing to provide their own API key (and pay), they could then provide a “service” to others (and collect all input).

Well, GP was referring to blocking ChatGPT as a federal contractor. I suspect that as a federal contractor, they are also vetting other people that they share data with, not just blocking ChatGPT as a one-off thing. I mean, generic federal data isn’t as tightly regulated as, say, HIPAA PHI (having spent quite a lot of time working for a place that handles both), but there are externally-imposed rules and consequences, unlike simple internal-proprietary data.


But it really seems like a cat and mouse game. For example, a very determined bad actor could infiltrate some lesser approved government contractor and provide an additional interface/API which would invite such information leaking, and possibly nobody would notice for a long time.


And then they could face death penalty for espionage if they leaked sensitive enough data. You would have to be really stupid to build such a service for government contractors unless you actually are a foreign spy.


At least then we would finally find out if it is constituional to execute someone for espionage.


If someone is determined to break the rules then yes they break the rules. Network blocking is really just a thing to stop casual mistakes.


> We block ChatGPT, as do most federal contractors. I think it’s a horrible exploit waiting to happen:

Do you also block pastebin? Anything else that has a web form? How is ChatGPT special compared to any other service on the Internet where people can paste data in a form?

I mean... I see the problem, but I think one needs to realize that it's a far more generic problem that has basically nothing to do with ChatGPT and AI. If people paste confidential data into random webpages that's of course bad. But if you block ChatGPT because you fear that, it means you expect that people might do that. And then your problem is not ChatGPT, but lack of awareness what is confidential data and what to do with it.


> Do you also block pastebin? Anything else that has a web form?

pastebin and indeed most things that has some sort of public webform is blocked in all the companies I have worked with.

It is probably a losing battle though, as it is very hard to block everything without default deny.

Paradoxically, maybe GPT could be used to veto websites on first access :)


> pastebin and indeed most things that has some sort of public webform is blocked in all the companies I have worked with.

Search engines too? And these days, that means web browsers, because the (IMHO stupid) idea of combining address and search bars into one means everything you type while trying to open a website gets leaked to some party (most likely Google).


Search engines (and url bars) are indeed not blocked, but I do worry every time I use them. Internal url leaks to google must be extremely common.


I imagine they must be. I'm habitually careful to either click on a link, paste the entirety of the internal URL at once, or enter only the most generic word or words that will surface the URL I want as a history suggestion - all to minimize the chances of leaking anything this way.


You can turn the auto search off


There's a lot of things that you can turn off, but nobody actually does - which is the very reason they ship turned on in the first place.


Because "awareness only" has such a great track record when it comes to security-adjacent issues, and totally satisfies auditors/customers/regulators/...?


I don't think awareness only has any reasonable track record and I would always prefer a technical control if there is one. But I have a hard time seeing any alternative here.

I don't think the idea that you can give people access to the www and at the same time preventing them from putting things in forms can be done. That's simply not how it works. And if you're blocking access to a few services where they might do that, well, they have a million others, and you're deceiving yourself that you've done something.


By using Azure, you can access ChatGPT and GPT, which come with enterprise-grade security and established data agreements, setting them apart from OpenAI. I'm not entirely sure of the technical details, but you can explore this option.


Problem is you will have to heavily advertise it in your org because people will not understand why, where etc.

They will go to OpenAI directly and do stuff because "they want it now" and they don't understand why not.

Microsoft is already building it into Office 365 with the same enterprise-grade agreements so it might get easier that way.


Does US intelligence have access to OpenAI data? Private organizations is one thing. But with all the dopes in government positions around the world, OpenAI logs would probably be a treasure trove for intelligence gathering.


They are just one national security letter away from all US-held data.


Microsoft is well known for piping data to US intelligence as a service. It's almost certainly why they bought Skype, then removed all the end to end encryption.


USA has the patriot act and the cloud act to request the data from any USA company. Like AWS, Microsoft 365, Google…


> they’ve had several leaks now where people can see others’ conversations and data

Do you have a source for this? I know some people have claimed to see others' data, but I haven't seen any evidence that that's what's actually being seen, vs LLM hallucinations. OpenAI claims, and I can't imagine they're lying, that the training data is fixed and ends in 2021, so I don't see how it would be possible for user prompts to be leaking into output, absent a massive and very unlikely bug (compared to the much more likely AI hallucination explanation).


Yep: https://openai.com/blog/march-20-chatgpt-outage

Some kind of concurrency bug in a library they were using to retrieve cached data from Redis led to this leak.

> We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history. It’s also possible that the first message of a newly-created conversation was visible in someone else’s chat history if both users were active around the same time.


I think they’re referencing this: https://openai.com/blog/march-20-chatgpt-outage


> there’s no way they’re manually scrubbing out sensitive data

I was under the impression OpenAI weren't using questions as training data for future models. I recall Sam Altman saying they delete questions after 1 month, but I can't locate the source for that.


Even if they do today, they could change their mind tomorrow, or even tonight.


Even without training the model, it will still end up in logs. For example you can now see your previous questions in the UI and it'll probably be stored in other places of their backend too.

This is still a serious data loss risk.


Right. They also haven't been caught lying about everything else from cuttoff dates to live functionality.


Does blocking ever work? People are smart and usually just work around them.


It works in the sense that it does add an extra "reminder" and requires specific intent. I mean, in this scenario all the people already have been informed that they're absolutely not allowed to do things like that, but if someone has forgotten that, or simply is careless and just wants to "try something out" then if it's unblocked they might actually do it, but if they need to work around a restriction, that forces them to acknowledge that there is a restriction and they shouldn't try to work around it even if they can.


The smart ones don’t paste in all their private data.

And yes, if the bypassing the block is combined with disciplinary action, it does work. It’s not worth getting fired over. This is likely what heavily regulated industries like financial services and defense are doing.


Blocks are effective reminders of policies.


I remember someone trying to look up winning lottery numbers at work. The site came up "Blocked: Gambling". It was a little reminder that they're watching our web browsing at work..


Those are pre-configured firewall rules. These firewalls can go deep packet inspection and block traffic.

It's a fairly standard practice. I wouldn't associate it with overreaching surveillance.


Well, a firewall rule based on a cloud-populated access control list interrupted traffic. More likely vendor-related than employer-related.


If your competitor use ChatGPT to compete with you and they're 10x productive than yours, are you still willing to insist? If the productive is 100x, will you?


It might be just as likely that ChapGPT will cause a mistake like Knight Capital because no one bothered to thoroughly verify the AI's looks-good-but-deeply-flawed answer, and the two aren't mutually exclusive possibilities.


Right. I've had ChatGPT completely fail at something as simple as writing a batch file to find and replace text in a text file.


Sure, but humans do that all the time as well


Humans are a lot better at "I don't know how to do this; hey Alice, can you look this over if you've got a sec and tell me if I'm making a noob mistake"


Perhaps the actual phenomenon is that humans are much better at saying "Alice wrote this code, she's pretty good at scripting but she might have made a noob mistake, better check it", or even "I wrote this code.." than they are at saying "ChatGPT wrote this code, but that application is not guaranteed to have correctly identified my problem, but may have just returned something that seems right both to the statistical model and to me, but which is actually deeply flawed, better check it".


The Knight meltdown was more of a disfunction of change management and trading system operations than it was of using a decommissioned feature flag.

Source: worked there after the meltdown.


This isn't an argument of ChatGPT vs nothing. This is an argument of "external" ChatGPT vs some other AI sitting on your own secured hardware, maybe even a branch of ChatGPT.


> some other AI sitting on your own secured hardware, maybe even a branch of ChatGPT.

Where can I, a random employee, get that? I know how to get ChatGPT.


You can't. So maybe you as a random employee should just do without whatever IT hasn't approved whether you agree or not.


Right, thus meaning your employer gets outcompeted by a company willing to take the risk of handing their data to OpenAI.


Uh, there's no sign of that yet.


[flagged]


> Think about how long it's taken tools like pandas to reach the point that it is now. That entire package can be built to the level it is now in a couple of days.

I am having trouble parsing this statement. You're saying a person equipped with chatGPT trained on data prior to December 2007 (the month before the initial pandas release) could have put together the entire pandas library in a couple of days?

That seems obviously wrong, starting with the fact that one would need to know "what" to build in the first place. If you're saying that chatGPT in 2023 can spit out pandas library source code when asked for it directly, that's obvious.

Somewhere between the impossible statement and the obvious statement I made above, there must be something interesting that you were trying to claim. What was it?


They couldn't even do it today with pandas being in the training set. People are being crazy about this tech.


> Think about how long it's taken tools like pandas to reach the point that it is now. That entire package can be built to the level it is now in a couple of days.

I don't think that is true at all. Do you have an example of a significant project being duplicated in days, or even months, with ANY of these tools?

By significant, I mean something on the order of pandas which you claimed.


And this is completely ignoring the fact, that the real hard problem is the design. Spitting boilerplate code is not. How pandas could be designed perfectly in one afternoon (and generated with GPT) is beyond my comprehension.


I guess they are thinking that ChatGPT would also handle that part ...

Prompt 1: What would be an amazing tech project that would make me rich?

P2: Produce an excellent design for that project. Should be elegant and use microservices and scale to billions of users.

P3: Write all the code for this design.

P4: Tell me how to test and deploy all that code.

P5: How to sell all this for billions?


Cool, please provide a link to a library of similar size and complexity to pandas which was written using ChatGPT in the span of a few days. We'll be waiting.


> Think about how long it's taken tools like pandas to reach the point that it is now. That entire package can be built to the level it is now in a couple of days.

Let me hand you a mirror: You're absolutely and completely wrong


> There is an immense amount of evidence of that

Then it should be easy to provide some?


Oh course not!


Security and privacy should be table stakes. Speaking for my country, we needs privacy laws with teeth to punish bad actors from shitting people's private information where ever they want in the name of a dollar.


Man the fanboyism is out of control here.


Welcome to Sam Altman News. You must be new here.


So you block internet access for all employees? Cos anything you think is being pasted into ChatGPT is being pasted everywhere, whether its Google, Slack, Chrome Plugins, public Wifi.


Yes, these things are sometimes blocked in higher security workplaces… up to and including the public internet. Honestly airgapped systems are not all that uncommon anywhere that human life is at risk.


Or all the ChatGPT clones that have sprung up and will continue to spring up every other day.

It's a stupid and patronizing position, but corporate IT are sadly incentivised to be stupid and patronizing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: