I'll add that for Haskell, the library everyone uses for JSON parses numbers into Scientific types with almost unlimited size and precision. I say almost unlimited because they use a decimal coefficient-and-exponent representation where the exponent is a 64-bit integer.
The documentation is quite paranoid that if you are dealing with untrusted inputs, you could parse two JSON numbers from the untrusted source fine and then performing an addition on them could cause your memory to fill up. Exciting new DoS vector.
Of course in practice people end up parsing them into custom types with 64-bit integers, so this is only a problem if you are manipulating JSON directly which is very rare in Haskell.
I was attempting to solve this very problem in the Rust BigDecimal crate this weekend. Is it better to just let it crash with an out of memory error, or have a compile-time constant limit (I was thinking ~8 billion digits) and panic if any operation would exceed that limit with a more specific error-message (does that mean it's no longer arbitrary-precision?). Or keep some kind of overflow-state/nan, but then the complexity is shifted into checking for NaNs, which I've been trying to avoid.
Sounds like Haskell made the right call: put warnings in the docs and steer the user in the right direction. Keeps implementation simple and users in control.
To the point of the article, serde_json support is improving in the next version of BigDecimal, so you'll be able to decorate your BigDecimal fields and it'll parse numeric fields from the JSON source, rather than json -> f64 -> BigDecimal.
Serde has an interface that allows failing. That one should fail. There is also another that panics, and AFAIK it will automatically panic on any parser that fails.
Do not try to handle huge values, do not pretend your parser is total, and do not pretend it's a correct value.
If you want to create an specialized parser that handles huge numbers, that's great. But any general one must fail on them.
This isn't about parsing so much as letting the users do "dangerous" math operations. The obvious one is diving by zero, but when the library offers arbitrary precision, addition becomes dangerous with regard to allocating all the digits between a small and large value
It's tough to know where to draw the lines between "safety", "speed", and "functionality" for the user.
[EDIT]: Oh I see, fix the parser to disallow such large numbers from entering the system in the first place, then you don't have to worry about adding them together. Yeah that could be a good first step towards safety. Though, I don't know how to parametrize the serde call.
If you are using a library with this kind of number representation, computing any rational number with a repeating decimal representation will use up all your memory. 1/3=0.33333… It will keep allocating memory to store infinite copies of the digit 3. (In practice it stores it using binary representation but you get the idea.)
For the Rust crate, there is already an arbitrary limit (defaults to 100 digits) for "unbounded operations" like square_root, inverting, division. That's a compile time constant. And there's a Context object for runtime-configuration you can set with a precision (stop after `prec` digits).
But for addition, the idea is to give the complete number if you do `a + b`, otherwise you could use the context to keep the numbers within your `ctx.add(a, b)`. But after the discussions here, maybe this is too unsafe... and it should use the default precision (or a slightly larger one) in the name of safety? With a compile time flag to disable it? hmm...
I'd strongly recommend against this default - it's a major blocker for using the Haskell library with web APIs as it transforms JSON RPC into into readily available denial of service attacks.
8 billion digits (~100 bits?) is far more than should be used.
Would it possible to use const generics to expose a `BigDecimal<N>` or `BigDecimal<MinExp, MaxExp, Precision>` type with bounded precision for serde, and disallow this unsafe `BigDecimal` entirely?
If not, I expect BigDecimal will be flagged in a CVE in the near future for causing a denial of service.
I think that's the use-case for the rust_decimal crate, which is a 96-bit floating number (~28 decimal digits) which is safer and faster than the bigdecimal crate (which at its heart is a Vec<u64>, unbounded, and geared more for things like calculating sqrt(2) to 10000 places, that kind of thing). Still, people are using it for serialization, and I try to oblige.
Having user-set generic limits would be cool, and something I considered when const generics came out, but there's a lot more work to do on the basics, and I'm worried about making the interface too complicated. (And I don't want to reimplement everything.) D
I also would like a customizable parser struct, with things like localization, allowing grouping-delimiters and such (1_000_000 or 1'000'000 or 10,00,000). That could also return some kind of OutOfRange parsing error to disallow "suspicious" values, out of range. I'm not sure how that to make that generic with the serde parser, but I may some safe limits to the auto serialization code.
Especially with JSON, I'd expect there's only two kinds of numbers: normal "human" numbers, and exploit attempts.
I think Haskell's warning-in-the-doc approach is not strong enough. I'd be in favor of distinguishing small and huge values using the type system. Have a Rust enum that contains either a small-ish number (the absolute value being 10^100 or less, but the threshold should be configurable preferably as a type parameter) or a huge number. Then the user will be required to handle it. Most of the time the user does not want huge numbers, so they will fail the parse explicitly when they do a match and find it.
I don't think there is any "sensible limit" which is big enough for everyone's needs, but low enough you won't blow out memory.
An 8 billion digit number is 2.5G? (Did I do my maths right?) All I need to do is shove 1,000 of those in a JSON array, and I'll cause an out-of-memory anyway.
On the other hand, any limit low enough that I can't blow up memory by making an array of 100K or so is going to be too low for some people (including me, I often make numbers of low-million numbers of digits).
Providing some method of putting a limit on seems sensible, but maybe just make a LimitedBigDecimal type, so then through the whole program there is a limit on how much memory BigDecimals can take up? (I haven't looked at the library in detail, sorry).
If I understand the situation correctly, in Haskell an unbounded number is the default that you get if you do something similar to JSON.parse(mystr). That means you can have issues basically anywhere. Whereas in Rust with Serde you would only get an unbounded number if you explicitly ask for one. That's a pretty major difference. Only a small number of places will explicitly ask for BigDecimal, and in those cases they probably want an actual unbounded number. And they should be prepared to deal with the consequences of that.
Nope you didn't understand the situation correctly. First, almost nobody directly parses from a string to JSON AST: people almost always parse into a custom type using either Template Haskell or generics. Second, parsing isn't the issue; doing arithmetic on the number is the issue.
It just handles it natively. The internal representation is coefficient and exponent. Parsing `1e100` results in storing 1 and 100 separately. That's why parsing huge JSON numbers is not a problem. The problem comes when you do arithmetic on it, which is when it needs to convert the number into the libgmp representation.
One of the first Ajax projects I worked on was multi tenant, and someone decided to solve the industrial espionage problem by using random 64 bit identifiers for all records in the system. You have about a .1% chance of generating an ID that gets truncated in JavaScript, which is just enough that you might make it past MVP before anyone figures out it’s broken, and that’s exactly what happened to us.
So we had to go through all the code adding quotes to all the ID fields. That was a giant pain in my ass.
I've been burned by a similar issue too. Lesson here is never to use numbers for things you are not planning to do math on. Ids should always be strings.
Isn't the lesson only that ids shouldn't be floats? If they were integers everything would be fine, but JS numbers aren't integers, even if they look like them sometimes.
Nah, the lesson is broader than that, cause numbers as IDs have a whole bunch of problems and this is just one of them. Eg Twitter has incrementing number IDs and back when they had this whole ecosystem of 3rd party twitter apps (that they have since ruined), half the apps failed when the IDs became too large to fit into a 32-bit int.
If it looks like a number, and it quacks like a number, sooner or later people are going to treat it like a number.
We're talking about deserialising JSONs in the application server here, nobody stops you from treating ids as numbers on the database side of things.
But also, this sounds like a premature optimisation. Most applications will never reach a level where their performance is actually impacted by string comparison, and when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there, and we shouldn't judge "regular people" advice because it doesn't usually apply to you anyway.
Out of curiosity, have you ever seen an application that was meaningfully impacted by this? How gigantic was it?
----
Scratch that. I've actually thought about it some more, and now I'm not 100% sure it's premature, I have to investigate further to be sure. Question still stands though.
I work primarily in data analytics. It tends to become noticeable in my experience as soon as you're at a few million records[0] on at least one side of a relationship. Especially as we see more columnar databases in analytics, the row count accounts for more than total data size for this sort of thing.
Due to the type of aggregate queries that typify analytics workloads, almost everything turns into a scan, whether it be of the a table, field, or index. Strings occupy more space on disk, or in RAM, so scanning a whole column or table simply takes longer, because you have to shovel more bytes through the CPU. This doesn't even take into account the relative CPU time to actually do the comparisons.
I've never personally worked with a system that has string keys shorter than 10 [1][2] characters. At that point, regardless of how you pack characters into a register, you're occupying more bits with two strings of character data than you would with two 64-bit integers[3]. This shows through in join time.
[0]: Even modestly sized companies tend to have at least a few tables that get into the millions of records.
[1]: I've heard of systems with shorter string keys
[2]: Most systems with string keys I've encountered have more than 10 characters.
[3]: The vast majority of systems I've seen since the mid-2010s use 64-bit integers for keys for analytics. 32-bit integers seemed to phase out for new systems I've seen since ~2015, but were more common prior to that.
> when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there
Don't most databases set a length limit on ID strings?
If you're setting a length limit, and it's made out of digits with no leading zeroes, then you might as well store it as a number. Is there a downside?
If you're using a 64 bit integers because you've got some super high precision math you need to do over an enormous space of addressable numbers, like maybe you're firing unguided kinetic energy weapons at enemies on other planets... sure, use big numbers. I'm sure you've got some clever libraries able to do such things reliably, and I won't question why you're using json as your serialization format.
If you're using 64 bit numbers as a high cardinality identity that can be randomly generated without concern for collision (like a MAC address with more noise) -- well, that's an identity and doesn't need to have math applied to it. For example: "What's the mean IP address that's connected to cloudflare in the last 10 minutes" or "what's the sum of every mac address in this subnet?" are both nonsense properties because these "numbers" are identities not numbers, and using a data type that treats them as numbers invites surprising, sometimes unpleasantly so, results.
Of course, because these are computers, all strings are ultimately numbers but their numberness is without real meaning.
UUIDs are great for this. It’s really just a random 128-bit integer, which makes comparisons about as fast as variable-length integers on modern hardware. And they decode to strings which means no application code or API end-user code is going to assume it’s a number.
Absolutely agreed on all points. I like UUIDs. There is still a surprising number of data processing systems which don't have support for 128 bit integers. This makes me sad.
With 10 times the memory usage and 100 times the compute power, maybe you could replace floats with something that behaves more like real numbers and covers mostly the same range.
But the resulting type is still going to have its own limitations and sharp edges. Floats are not the right tool for every job but they are quite good at the jobs they are right for. Learning how they work is more useful than lamenting their existence.
With densely packed decimals (3 digits in 10 bits), you can reduce the space overhead to as little as 2.4% (1024/1000). The IEEE has even standardized various base-10 floating-point formats (e.g. decimal64). I'd suspect that with dedicated hardware, you could bring down the compute difference to 2-3x binary FP.
However I read the post I responded to as decrying all floating-point formats, regardless of base. That leaves only fixed-point (fancy integers) and rationals. To represent numbers with the same range as double precision, you'd need about 2048 bits for either alternative. And rational arithmetic is really slow due to heavy reliance on the GCD operation.
Long story short: don't use JSON numbers to represent money or monetary rates. Always use decimals encoded as string. It's surprising how many APIs fall short of this basic bar.
We've used XML for interchange of order-like data. Customers have started demanding JSON, so I built a tool to generate XML <-> JSON converters, along with JSON Schema file, based on an XSD, so we could continue to use our existing XML infrastructure on the inside.
I must admit I totally forgot about the JSON number issue. Our files include fields for various monetary amounts and similar, and in XML we just used "xs:decimal".
Most will be less than a million and requires less than four decimal digits. But I guess someone might decide to use it for the equivalent of a space station or similar, ya never know...
Precision should be part of the spec for integrations. With the integer multiple of minimal unit, that makes it clear in the API what it is.
e.g. it doesn't make sense to support billing in sub-currency unit amounts just by allowing it in your API definition, as you're going to need to batch that until you get a billable amount which is larger than the fee for issuing a bill. Even for something like $100,000.1234, the bank doesn't let you do a transfer for 0.34c.
For cases where sub-currency unit billing is a thing, it should be agreed what the minimal unit is (e.g. advertising has largely standardised on millicents)
Yeah I am more laughing that once encoded in JSON as { "p": 2256, "dp": 2 } you are using 2 floating point numbers. But JSON, and indeed JS wasn't designed.
To be clear, I wasn't advocating for flexible decimal points. There is no "dp" parameter in the solution I was proposing. It's just documented in the API that "price" is denominated in cents (or satoshis or whatever you want)
Then you should store the time as well, because the number of decimals in a currency can change (see ISK). Also, some systems disagree on the number of decimals, so be careful. And of course prices can have more decimals. And then you have cryptocurrencies, so make sure you use bigints
You store it as an integer, but as we just saw in the OP, for general interop with any system that parses JSON you have to assume that it will be parsed as a double. So to avoid precision loss you are going to have to store it as a string anyway. At that point it's upto you whether you want to reinvent the wheel and implement all the required arithmetic operations for your new fixed-point type. Or you could just use the existing decimal type that ships on almost every mature platform: Java, C#, Python, Ruby, etc.
In dollars, what do you get up to with a double of cents without precision loss? It's in the trillions, I figure? So a very large space of applications where smallest-denomination-as-JSON-number is going to be fine.
Depends on the language. On the JVM you are fine. With Javascript, doing math on big numbers is probably going to end in tears unless you know what you are doing. Either way, have some tests for this and make sure your code is doing what you expect.
Encoding numbers as string because you are using a language and parser that can't deal with numbers properly (even 64 bit doubles), is a bit of a hack. Basically the rest of the world giving up because Javascript can't get its shit together is not a great plan.
Accounting for the lowest common denominator that has a huge share in it is always a great plan. Every trading platform out there uses "+-ddd.ddd" format, even binary-born protocols completely unrelated to js used it since forever.
> RFC 8259 raises the important point that ultimately implementations decide what a JSON number is.
Any implementation dealing with money or monetary rates should know that it needs to deal with precision and act accordingly. If you want to use JavaScript to work with money, you need to get a library that allows you to represent high precision numbers. It's not unreasonable to also expect that you get a JSON parsing library that supports the same.
The only problem with this attitude is that JSON APIs are meant to be interoperable, and as the OP showed, you can't rely on the systems you interoperate with to uniformly have the same understanding of JSON numbers, and misinterpreting numbers because of system incompatibilities will cause some really bad headaches that are totally avoidable by just forcing everyone to interop in terms of decimal numbers encoded as strings.
I tend to end up encoding everything as an integer (multiply by 1000, 10000 etc) and then turn it back into a float/decimal on decode. For instance if I am building a system dealing with dollar amounts I will store cent amounts everywhere, communicate cent amounts over the wire, etc. then treat it as a presentation concern to render it as a dollar amount.
It's worth bearing in mind when you do that that the largest integer that is "generally safe" in JSON is 2^53-1, so if you scale by a factor of 10000 you're taking 13-14 more bits off that maximum. That leaves you about 2^40, or about a trillion, before you may start losing precision or seeing systems disagree about the decoded values. Whether that's a problem depends on your domain.
But do note, that in currency, there are multiple, actively used currencies that have zero, three, five (rare) or even eight (BTC) decimals. That some decimals cannot be divided by all numbers (e.g. only 0.5)
Point being: floats are dangerously naive for currency. But integers are naive too. You'll most probably want a "currency" or "money" type. Some Value Object, or even Domain Model.
XML offered all this, but in JSON there's little to convey this, other than some nested "object" with at least the decimal amt (as int), and the ISO4217 currency. And maybe -depending on how HATEOAS you wanna be- a formatting string to be used in locales, a rule on divisibility and/or how many decimal places your int or decimal might be.
(FWIW, I built backbends for financial systems and apps. It gets worse than this if you do math on the currencies. Some legislatioins or bookkeeping rules state that calculation uses more or less decimals. E.g. that ($10/3)*3 == $10 vs == $9.99. or that $0.03/2 == 0.1 + 0.2, e.g. when splitting a bill. This stuff is complex, but real domain logic)
When I say dangerously naive, I mean in a way that people can go to jail¹ for "loosing" or "inventing" cents. Which your software will do if you use floats.
¹IANAL. But this was told when legal people looked at our architecture.
Your software will still "lose" cents if you use integers, for operations such as dividing a bill (e.g. divide by 3), or applying 3% APR in monthly increments.
The goal is not to avoid rounding errors (which would be quite difficult when the true account value can be an irrational number, as with 3% APR compounding monthly), but to have the exact same rounding errors that are prescribed by the accounting practices. Which may vary depending on legislation.
A decimal floating point is usually a better starting point than integers are.
> for operations such as dividing a bill (e.g. divide by 3), or applying 3% APR in monthly increments.
Which is why passing around ints is not the solution. And why I specifically mention Domain Models and/or Value Object.
A domain model would throw an exeption or otherwise dissalow certain divisions for example. What I often do, is something like `expense.amount.divide_over(3, leftover_to_last)` or `savings.balance_at(today).percentage_of(3.1337)`.
Sometimes, in simpler setups and when the language allows, I'll re-implement operators like *, / and even + and -. But when actual business logic is needed, I'll avoid these and implement actual domain methods that use the language the business uses.
But never, ever, do I allow just math-ing over the inner values.
So, I disagree: Both decimal floating point and integers are just as "bad". Maybe for the inner values in the domain model or value object, they are fine, but often there integers are a slightly better starting point because they make rounding and leftovers very explicit.
The problem with that (which I have seen in practice) is that you are essentially hard coding the maximum precision you will accept for every client that needs to interpret your JSON.
For example, you say you store monetary amounts as cents. What if you needed to store US gas prices, which are normally priced in amounts ending in 9/10ths of a cent? If you want to keep your values as integers you need to change your precision, which will likely mess up a lot of your code.
and different currencies have different default precisions. So if you're dealing with multiple currencies, now you need both client and server to have a map of all currency precisions for formatting purposes that they agree on.
What's worse is that these things can also change over time and there is sometimes disagreement over what the canonical value is.
E.g. ISO 4217 (used by Safari, Firefox and NodeJS) will say that the Indonesian Rupiah (IDR) uses 2 decimal digits, while Unicode CLDR (used by Chrome) will say that they use 0 decimal digits. The former is the more "legalistic" definition, while the latter matches how people use the currency in reality.
This is not a real issue if you transfer amounts as decimal strings and then pass those to the Intl API for formatting (the formatting will just be different but still correct), but it's catastrophic if you use scaled-up integers (all amounts will be off by magnitudes).
For this reason I would always store currency amounts in an appropriate DECIMAL type in the DB and send currency amounts as strings over the wire.
Pedantically, IEEE 754 defines decimal floating point formats (like decimal128) which are appropriate for representing currency. Representing currency in non-integer values in any of the binary floating point formats is indeed a recipe for disaster though.
I have tried to encode all non-trivial numbers as strings. If it's too big (or small), or if it's a float, I'll have to change my JSON schema. Bake the need to decode numbers into the transforms for consistency.
This is great as long as you always make clear which value is pre post encoding. I remember one of my first production bugs was giving users 100 times the credit they actually bought. Oops.
I would guess even most of time people using epsilon don't understand it. Its not like there is universal constant error with floating point numbers. I feel that saying just use epsilon is not much better than x == 0 and could be harder to find bugs if it sometimes works and othertimes does not.
I think funny enough a sure sign of an inexperienced programmer in bigco application programming is the other way around, that they wrongly learn a metal model of "floating point is approximate, never ever do ==" in school.
I often store it as smaller than cents, because anything with division or a basket of summed parts with taxes can start to get funky if you round down (and some places have laws about that.)
My opinion is that a safe approach is to use either 52-bit integer number or 64-bit floating number to keep JavaScript compatibility. JavaScript is too important and at the same time, the errors are too terrific (JS will silently round to the nearest 52-bit integer number which could lead to various exploits) to skip on that. If you need anything else, just use strings.
I have added this note, thanks! In the blog I am mostly trying to show the behavior you get using the (maybe defacto) stdlib with its default configuration, but this is useful data to call out.
FWIW, when I've cared about interop and controlled the schema, I've specified JSON strings for numbers, along with the range, precision, and representation. This is no worse (nor better) than using RFC 3339 for dates.
I'm just a JS guy trying to understand the world around me and documenting what I find, not trying to be discourteous (or even courteous). I'll add the note about Python, thanks for calling it out. FWIW JS does not have a similar capability so I can't add a note there.
> FWIW JS does not have a similar capability so I can't add a note there
This example on MDN seems to indicate that you can, am I misunderstanding it?
const bigJSON = '{"gross_gdp": 12345678901234567890}';
const bigObj = JSON.parse(bigJSON, (key, value, context) => {
if (key === "gross_gdp") {
// Ignore the value because it has already lost precision
return BigInt(context.source);
}
return value;
});
The optional `context` parameter is a tc39 proposal. The feature compatibility matrix on the bottom of the MDN page is really confusing because it's showing only when `JSON.parse` was added, not whether the optional `context` parameter is supported.
In JS, it's a good idea anyway to use some JSON parsing library instead of JSON.parse.
With Zod, you can use z.bigint() parser. If you take the "parse any JSON" snippet https://zod.dev/?id=json-type and change z.number() to z.bigint(), it should do what you are looking for.
I think the spec just means, assume IEEE 754. In the case of 0.1, which cannot be represented exactly, software should assume that `0.1` will be represented as `0.100000000000000005551115123126`. Depending on `0.1` being parsed as the exact value `0.1` is not widely interoperable.
Relatedly, what about integers like 9007199254740995. Is that a legal integer since it rounds to 9007199254740996?
It does seem unclear what it means to exceed precision (given rounding is such an expected part of the way we use these numbers). Magnitude feels easier as at least you definitely run out of bits in the exponent.
I think the spec is saying that it is the message that should not express greater magnitude or precision, not 'the number'.
So including the string "0.1" in a message is fine because v = 0.1 implies 0.05 < v < 0.15, but including 0.100000000000000000000000000000000000 would not be.
First thing that I check before using a JSON parser library is that if it lets me to get the number as a string and let me do my own conversion. Libraries that try to treat the number as double or bring in a large bigint/decimal library gets usually pass from me.
If you need a specific exotic JSON parser to parse the numbers you have correctly, I would argue that you should serialise them as strings and not as numbers.
When I wrote my jsonptr tool a few years ago, I noticed that some JSON libraries (in both C++ and Rust) don't even do "parse a string of decimal digits as a float64" properly. I don't mean that in the "0.3 isn't exactly representable; have 0.30000000000000004 instead" sense.
I mean that rapidjson (C++) parsed the string "0.99999999999999999" as the number 1.0000000000000003. Apart from just looking weird, it's a different float64 bit-pattern: 0x3FF0000000000000 vs 0x3FF0000000000001.
Similarly, serde-json (Rust) parsed "122.416294033786585" as 122.4162940337866. This isn't as obvious a difference, but the bit-patterns differ by one: 0x405E9AA48FBB2888 vs 0x405E9AA48FBB2889. Serde-json does have an "float_roundtrip" feature flag, but it's opt-in, not enabled by default.
this requires multiple precision to do properly and isn't useful most of the time. its odd to describe this as "not properly". you might say "with exact rounding", but that makes it clearer that this isn't that useful a feature, especially since we usually expect floats to be inexact in the first place.
With JSON, there's essentially no such thing as "properly" when it comes to parsing numbers, since the spec doesn't limit the ability of the implementation to constrain width and precision. It only says that float64 is common and therefore "good interoperability can be achieved by implementations that expect no more precision or range than these provide", but note the complete absence of any guarantees in that wording.
The only sane thing with JSON is to avoid numbers altogether and just use decimal-encoded strings. This forces the person parsing it on the other end to at least look up the actual limits defined by your schema.
I think the thing folk miss is when there’s an error like divide by zero, or the calculation would return NaN. I feel like this is the main gap/concern with using JSON and it seems to be rarely discussed.
Agreed, this can be a pain. Python by default serialize and de-serialize the `NaN` literal, making you pay some cleanup cost once you need to interopt with other systems. (same for `Inf`)
Say what you want about NaN, but IEEE 754 is the facto way of dealing with floating points in computers and even if NaNs and Infs are a bit "fringe" it's unfortunate that the most popular serialization format can not represent these.
There are so many things that are poorly thought out or underspecified in JSON, it's amazing that it got so widely adopted for interop. No wonder that it became a perpetual source of serialization bugs.
Especially annoying given that they could have been easily adopted. Infinity could've been encoded as `1/0` (among most other possibilities). NaN could've been encoded as `0/0` (again, among most other possibilities). JSON doesn't allow all possible JavaScript literals anyway, so these encodings might have been worked if they were somehow standardized.
I like to think of floating point values as noisy analog voltages, with the extra propery that they can store small integers perfectly, and they can be copied within code but not round trip serialized and deserializer without noise.
They're not really noisy, but if an application would work with some random noise added, it will probably work with floats, and if it wouldn't work with noise added, it's probably easier to just not use floats and expext people to reason about IEEE details, while risking subtle bugs if different float representations get mixed.
Of course I'm not doing a lot of high performance algorithms, I would imagine in some applications you really do need to reason about floats.
I think that good decision regarding numbers in API (as it was made in my project) is to put meaningful decimal numbers into string and let them be handled by exact decimal calculation framework, e.g. BigDecimal in Java etc.
Does anyone have any idea why Crockford decided that at least one digit is required after the decimal point, as opposed to JavaScript which has zero or more?
It's tolerated because the JSON spec explicitly allows it:
This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
available and widely used, good interoperability can be achieved by
implementations that expect no more precision or range than these
provide, in the sense that implementations will approximate JSON
numbers within the expected precision. A JSON number such as 1E400
or 3.141592653589793238462643383279 may indicate potential
interoperability problems, since it suggests that the software that
created it expects receiving software to have greater capabilities
for numeric magnitude and precision than is widely available.
Note that when such software is used, numbers that are integers and
are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.
And yes, this is completely insane for a format that supposed to be specifically for serialization and interop. Needless to say, the industry has enthusiastically adopted it to the point where it became the standard.
I miss XML these days. Sure, it was verbose and had a bunch of different and probably excessive numeric types defined for XML Schema... but at least they were well-defined (https://www.w3.org/TR/xmlschema-2/#built-in-datatypes). And, on the other hand, without a schema, all you had were strings. Either way, no mismatched expectations.
That's true for every floating point number in every programming language you have ever used, though.
$ python3
Python 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 100000.000000000017
100000.00000000001
Python 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from decimal import Decimal
>>> Decimal('100000.000000000017')
Decimal('100000.000000000017')
And not every programming language offers a Decimal type and on most of those, there’s usually a performance penalty associated with it not to mention issues of interoperability and developer knowledge of its existence. For financial calculations, usually using integers with an implicit decimal offset (e.g., US currency amounts being expressed in cents rather than dollars), while other contexts will often determine that the inherent inaccuracy of IEEE floating types is a non-issue. The biggest potential problem lies in treating values that act kind of like numbers and look like numbers as numbers, e.g., Dewey Decimal classification numbers or the topic in a Library of Congress classification.¹
⸻
1. This is a bit on my mind lately as I discovered that LibraryThing’s sort by LoC classification seems to be broken so I exported my library (discovering that they export as ISO8859-1 with no option for UTF-8) and wrote a custom sorter for LOC classification codes for use in finally arranging the books on my shelves after my move last year.
> That's true for every floating point number in every programming language you have ever used, though.
Alright, if "you" have only ever used python. In C, for example, we have hexadecimal floating point literals that represent all floats and doubles exactly (including infinities and nans that make the json parser fail miserably).
Yes, I was talking about what is described in your resources. You can do this:
// define a floating-point literal in hex and print it in decimal
float x = 0x1p-8; // x = 1.0/256
printf("x = %g\n", x); // prints 0.00390625
// define a floating point literal in decimal and print it in various ways
float y = 0.3; // non-representable, rounded to closest float
printf("y = %g\n", y); // 0.3 (the %g format does some heuristics)
printf("y = %.10f\n", y); // 0.3000000119
printf("y = %.20f\n", y); // 0.30000001192092895508
printf("y = %a\n", f); // 0x1.333334p-2
Your question is ambiguous for two different reasons. First, this value is not representable as a floating-point number, so there's no way that you can even store it in a float. Second, once you have a float variable, you can print it in many different ways. So, the answer to your question is, irremediably, "it depends what you mean by exact value".
If you print your variable with the %a format, then YES, the exact value is preserved and there is no loss of information. The problem is that the literal that you wrote cannot be represented exactly. But this is hardly a fault of the floats. Ints have exactly the same problem:
int x = 2.5; // x gets the value 2
int y = 7/3; // same thing
If you would like to contribute Swift tests, I would be happy to take it! You can send a PR into this document, updating the data tables and adding a code sample at the end: https://github.com/bterlson/blog/blob/main/content/blog/what.... No need to test openapi-tools swift codegen unless you really want to!
JS is likely to get a hook to be able to handle serialization/deserialization of such values without swapping out the entire implementation[1]. Native support for these types, without additional code or configuration, would likely break the Internet badly, so is unlikely to happen unfortunately.
Much more valuable than any such extension would be a way to annotate types and byte lengths of keys and values so that parsers could work more efficiently. I’ve spent a lot of time making a fast JSON parser in Java and the thing that makes it so hard is you don’t know how many bytes anything is, or what type. It’s hard to do better than naive byte-by-byte parsing.
If you control the underlying data, I must reccomend Amazon Ion! Its text format is a strict superset of JSON, but they also maintain binary format that will round-trip data and is designed for efficient scanning. There's even prefixed annotations if you want them :)
It also specs proper decimal values, mitigating the issues presented in the OP.
"ID numbers start from 2^53 and are allocated sequentially including odd numbers that are not compatible with "double" types. Please ensure you are reading this value as a 64-bit integer."
The documentation is quite paranoid that if you are dealing with untrusted inputs, you could parse two JSON numbers from the untrusted source fine and then performing an addition on them could cause your memory to fill up. Exciting new DoS vector.
Of course in practice people end up parsing them into custom types with 64-bit integers, so this is only a problem if you are manipulating JSON directly which is very rare in Haskell.