Software Arts published an official technical specification of the DIF file format in 1983. [1] (to save a file in the DIF format, /S#S in VC)
It is certainly an interesting format, and Excel completely mangles the data (whether it was intentional is up for debate). For example, the value
1,0
"0.3"
represents the literal string 0.3 in VisiCalc (the 1,0 indicates that the next line is a string). Excel ignores the typing and tries to parse as a value, interpreting it as the number 0.3. Like with CSV, there's an awkward formula workaround for generating a file with the text "0.3":
Maybe a story from a maintainer would help. To contextualize, the main SheetJS open source project https://github.com/SheetJS/sheetjs has over 28K stars.
tl;dr: the project involves "crowdsourced research" which benefits from popularity.
The main social goal with the project is data preservation and integrity. Large-scale economic and political decisions are made from data and analyses in spreadsheets. For example, last year in the UK, COVID cases were underreported thanks to Excel minutiae https://www.bbc.com/news/technology-54423988
Due to various corporate stratagems, the older data representations were intentionally obfuscated. To support Excel, many developers poked around at Excel files and guessed at the structures.
In this environment, the biggest challenge is finding worksheets with random corner cases. These types of files are not easy to create and fuzzing has limited effectiveness. This is where open source and popularity come into play. The open source and JS nature of the project helps reduce testing friction (https://oss.sheetjs.com/ runs in the web browser, no need to install anything) and encourage bug reports with test cases.
There will always be "entitled users" and "low quality bug reports" but that comes with the territory. There are also meaningful issues and code contributions. Efforts at trying to prevent the low quality contributions also discourage higher quality contributions.
Pre-Unicode issues still haunt us today, kept alive by various file formats that rely on system encoding.
Under the Apple "Mac-Roman" encoding [1], the standard MacOS encoding before OSX switched to Unicode, byte 0xBD currently is capital omega (U+03A9 Ω). However, in the original 1994 release of the character set, they erroneously mapped to the ohm sign (U+2126 Ω) Apple eventually fixed this in 1997, as noted in the changelog:
# n04 1997-Dec-01 Update to match internal utom<n3>, ufrm<n22>:
# Change standard mapping for 0xBD from U+2126
# to its canonical decomposition, U+03A9.
However, in 1996, Microsoft copied over the mac encoding to CP10000 using the incorrect character [2]. Unfortunately the codepage was not corrected when Apple realized their mistake.
This discrepancy leads to a huge number of strange issues with various versions of Excel for Mac (BOM-less CSV, SYLK and other plaintext formats default to system encoding) and other software that use Microsoft's interpretation of Apple's Mac-Roman encoding rather than Apple's official character set mapping.
Our story is very similar. I wrote a small library for converting XLSX and XLS files to CSV. Over the years, that grew into one of the most popular open source libraries on npm/github: https://github.com/SheetJS/sheetjs
Excel has to contend with nearly 40 years of backwards compatibility (MultiPlan, the predecessor to Excel, was released in 1982) and a deep userbase that literally has decades of experience and muscle memory with the software. The Symbolic Link "SYLK" file format introduced in MultiPlan is still supported in recent versions of Excel, leading to the infamous CSV "ID" issue.
Many of our users still run very old versions of Excel and Windows (e.g. Excel 5.0 on Windows 95) because a change in a future version of Excel caused problems or gave different results.
When saving as CSV, Excel will use the regional "List separator" setting. You can override this in Windows 7 with Region and Language > Additional Settings > List separator.
If you are trying to generate a file that plays nice with Excel, there is a way to force a specific delimiter with the "sep" pragma:
We have a few customers in reinsurance, and for the most part the goal is to do the opposite of what the python solutions try to do. Instead of integrating foreign stuff into existing workbooks, the goal is to retain the existing worksheets as source of truth and build modern tools around the files. The most common use case is building out a web interface to replicate the Excel formula engine.
In the python space, there are libraries like openpyxl and xlrd, but the real hurdle is introducing python into an ecosystem which otherwise has no natural knowledge. JavaScript is the language of choice for modern Excel addins as Excel provides an actual API for it https://docs.microsoft.com/en-us/office/dev/add-ins/referenc...
> The most common use case is building out a web interface to replicate the Excel formula engine.
This is one of the projects that I'd worked on. We implemented a pretty thorough version of the Excel engine in JS. Load data and expressions as 2d arrays and get a nice api for the output.
In principle I agree strongly with this approach, especially when an extant "working" solution already exists. However, some of the cargo-cultery that goes on almost defies belief.
I once heard of a company that created a database table with columns "workbook","sheet name","row","col","value" that they would extract all of their spreadsheets into as a "Database backend" for their spreadsheets.
Thanks! Haven't sunk my teeth into the new round of features yet, but have found great utility from the last batch (unique, filter, sort) and the prior batch.
SheetJS was created because of a Microsoft licensing issue with a library (https://github.com/stephen-hardy/xlsx.js/issues/8). The other project had a nonstandard license with a clause that only applied to browsers run on Microsoft Windows, which is really bizarre for a JS library that can run on any browser. Apparently the original developer was working for Microsoft at the time, and Microsoft mandated the license clause.
In a funny twist of fate, Microsoft now uses SheetJS open source to power some Excel exports in Office 365! https://tasks.office.com/License.html is the license disclosure, and you can actually see it in action in the exported files.
We were blown away when we found out. It's the ultimate endorsement! Oftentimes we just wonder what would have happened if Microsoft just let the original project adopt a standard open source license.
It is certainly an interesting format, and Excel completely mangles the data (whether it was intentional is up for debate). For example, the value
represents the literal string 0.3 in VisiCalc (the 1,0 indicates that the next line is a string). Excel ignores the typing and tries to parse as a value, interpreting it as the number 0.3. Like with CSV, there's an awkward formula workaround for generating a file with the text "0.3":