That was my initial reaction as well – it's a vulnerability in MS software, not ours, not our problem. Unfortunately, reality quickly came to bear: our customers and employees ubiquitously use excel and other similar spreadsheet software, which exposes us and them to risk regardless where the issue lies. We're inherently vulnerable because of the environment we're operating in, by using CSV.
"don't trust customer input without stripping or escaping it" feels obvious, but I don't think it stands up to scrutiny. What exactly do you strip or escape when you're trying to prevent an unknown multitude of legacy spreadsheet clients that you don't control from mishandling data in an unknown variety of ways? How do you know you're not disrupting downstream customer data flows with your escaping? The core issue, as I understand it, stems from possible unintended formula execution – which can be prevented by prefixing certain cells with a space or some invisible character (mentioned in the linked post above). This _does_ modify customer data, but hopefully in a way that unobtrusive enough to be acceptable. All in all, it seems to be a problem without a perfect solution.
Hey, I'm the author of the linked article, cool to see this is still getting passed around.
Definitely agree there's no perfect solution. There's some escaping that seems to work ok, but that's going to break CSV-imports.
An imperfect solutions is that applications should be designed with task-driven UIs so that they know the intended purpose of a CSV export and can make the decision to escape/not escape then. Libraries can help drive this by designing their interfaces in a similar manner. Something like `export_csv_for_eventual_import()`, `export_csv_for_spreadsheet_viewing()`.
Another imperfect solution would be to ... ugh...generate exports in Excel format rather than CSV. I know, I know, but it does solve the problem.
Or we could just get everyone in the world to switch to emacs csv-mode as a csv viewer. I'm down with that as well.
Appreciate your work! Your piece was pivotal in changing my mind about whether this should be considered in our purview to address.
The intention-based philosophy of all this makes a lot of sense, was eye opening, and I agree it should be the first approach. Unfortunately after considering our use cases, we quickly realized that we'd have no way of knowing how customers intend to use the csv exports they've requested - we've talked to some of them and it's a mix. We could approach things case by case but we really just want a setup which works well 99% of the time and mitigates known risk. We settled on the prefixing approach and have yet to receive any complaints about it, specifically using a space character with the mind that something unobtrusive (eg. easily strippable) but also visible, would be best - to avoid quirks stemming from something completely hidden.
Thank again for your writing and thoughts, like I said above I haven't found much else of quality on the topic.
I’ve almost always found the simple way around Excel users not knowing how to safely use CSV files is to just give the file another extension: I prefer .txt or .dat
Then, the user doesn’t have Excel has the default program for opening the file and has to jump through a couple safety hoops
If your customers and employees are using Excel then stop going against the grain with your niche software developer focused formats that need a lot of explanations.
I need to interface with a lot of non-technical people who exclusively use Excel. I give them .xlsx files. It's just as easy to export .xlsx as it is to export .CSV and my customers are happy.
How is .csv a niche dev-focused format? Our customers use our exports for a mix of purposes, some of them involving spreadsheet clients (not just excel) and some of them integrating with their own data pipelines. Csv conveniently works with these use cases across the board, without explanation, and is inconveniently saddled with these legacy security flaws in Excel (and probably other clients).
If xlsx works for all your use cases that's great, a much better solution that trying to sidestep these issues by lightly modifying the data. It's not an option for us, and (I'd imagine) a large contingent of export tools which can't make assumptions about downstream usage.
"don't trust customer input without stripping or escaping it" feels obvious, but I don't think it stands up to scrutiny. What exactly do you strip or escape when you're trying to prevent an unknown multitude of legacy spreadsheet clients that you don't control from mishandling data in an unknown variety of ways? How do you know you're not disrupting downstream customer data flows with your escaping? The core issue, as I understand it, stems from possible unintended formula execution – which can be prevented by prefixing certain cells with a space or some invisible character (mentioned in the linked post above). This _does_ modify customer data, but hopefully in a way that unobtrusive enough to be acceptable. All in all, it seems to be a problem without a perfect solution.