Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What even is a JSON number? (trl.sn)
179 points by bterlson on April 1, 2024 | hide | past | favorite | 147 comments


I'll add that for Haskell, the library everyone uses for JSON parses numbers into Scientific types with almost unlimited size and precision. I say almost unlimited because they use a decimal coefficient-and-exponent representation where the exponent is a 64-bit integer.

The documentation is quite paranoid that if you are dealing with untrusted inputs, you could parse two JSON numbers from the untrusted source fine and then performing an addition on them could cause your memory to fill up. Exciting new DoS vector.

Of course in practice people end up parsing them into custom types with 64-bit integers, so this is only a problem if you are manipulating JSON directly which is very rare in Haskell.


I was attempting to solve this very problem in the Rust BigDecimal crate this weekend. Is it better to just let it crash with an out of memory error, or have a compile-time constant limit (I was thinking ~8 billion digits) and panic if any operation would exceed that limit with a more specific error-message (does that mean it's no longer arbitrary-precision?). Or keep some kind of overflow-state/nan, but then the complexity is shifted into checking for NaNs, which I've been trying to avoid.

Sounds like Haskell made the right call: put warnings in the docs and steer the user in the right direction. Keeps implementation simple and users in control.

To the point of the article, serde_json support is improving in the next version of BigDecimal, so you'll be able to decorate your BigDecimal fields and it'll parse numeric fields from the JSON source, rather than json -> f64 -> BigDecimal.

    #[derive(Serialize, Deserialize)]
    pub struct MyStruct {
      #[serde(with = "bigdecimal::serde::json_num")]
      value: BigDecimal,
    }
Whether or not this is a good idea is debatable[^], but it's certainly something people have been asking for.

[^] Is every part of your system, or your users' systems, going to parse with full precision?


It's best if your parser fails.

Serde has an interface that allows failing. That one should fail. There is also another that panics, and AFAIK it will automatically panic on any parser that fails.

Do not try to handle huge values, do not pretend your parser is total, and do not pretend it's a correct value.

If you want to create an specialized parser that handles huge numbers, that's great. But any general one must fail on them.


This isn't about parsing so much as letting the users do "dangerous" math operations. The obvious one is diving by zero, but when the library offers arbitrary precision, addition becomes dangerous with regard to allocating all the digits between a small and large value

  1e10 + 1e-10 = 10000000000.0000000001
  1e10000000000000000000 + 1e-10000000000000000000 = ...
It's tough to know where to draw the lines between "safety", "speed", and "functionality" for the user.

[EDIT]: Oh I see, fix the parser to disallow such large numbers from entering the system in the first place, then you don't have to worry about adding them together. Yeah that could be a good first step towards safety. Though, I don't know how to parametrize the serde call.


If you are using a library with this kind of number representation, computing any rational number with a repeating decimal representation will use up all your memory. 1/3=0.33333… It will keep allocating memory to store infinite copies of the digit 3. (In practice it stores it using binary representation but you get the idea.)


For the Rust crate, there is already an arbitrary limit (defaults to 100 digits) for "unbounded operations" like square_root, inverting, division. That's a compile time constant. And there's a Context object for runtime-configuration you can set with a precision (stop after `prec` digits).

But for addition, the idea is to give the complete number if you do `a + b`, otherwise you could use the context to keep the numbers within your `ctx.add(a, b)`. But after the discussions here, maybe this is too unsafe... and it should use the default precision (or a slightly larger one) in the name of safety? With a compile time flag to disable it? hmm...


I'd strongly recommend against this default - it's a major blocker for using the Haskell library with web APIs as it transforms JSON RPC into into readily available denial of service attacks.

8 billion digits (~100 bits?) is far more than should be used.

Would it possible to use const generics to expose a `BigDecimal<N>` or `BigDecimal<MinExp, MaxExp, Precision>` type with bounded precision for serde, and disallow this unsafe `BigDecimal` entirely?

If not, I expect BigDecimal will be flagged in a CVE in the near future for causing a denial of service.


I think that's the use-case for the rust_decimal crate, which is a 96-bit floating number (~28 decimal digits) which is safer and faster than the bigdecimal crate (which at its heart is a Vec<u64>, unbounded, and geared more for things like calculating sqrt(2) to 10000 places, that kind of thing). Still, people are using it for serialization, and I try to oblige.

Having user-set generic limits would be cool, and something I considered when const generics came out, but there's a lot more work to do on the basics, and I'm worried about making the interface too complicated. (And I don't want to reimplement everything.) D

I also would like a customizable parser struct, with things like localization, allowing grouping-delimiters and such (1_000_000 or 1'000'000 or 10,00,000). That could also return some kind of OutOfRange parsing error to disallow "suspicious" values, out of range. I'm not sure how that to make that generic with the serde parser, but I may some safe limits to the auto serialization code.

Especially with JSON, I'd expect there's only two kinds of numbers: normal "human" numbers, and exploit attempts.


I think Haskell's warning-in-the-doc approach is not strong enough. I'd be in favor of distinguishing small and huge values using the type system. Have a Rust enum that contains either a small-ish number (the absolute value being 10^100 or less, but the threshold should be configurable preferably as a type parameter) or a huge number. Then the user will be required to handle it. Most of the time the user does not want huge numbers, so they will fail the parse explicitly when they do a match and find it.


That seems to be the sentiment here. I'll take it into consideration. Thanks.


I don't think there is any "sensible limit" which is big enough for everyone's needs, but low enough you won't blow out memory.

An 8 billion digit number is 2.5G? (Did I do my maths right?) All I need to do is shove 1,000 of those in a JSON array, and I'll cause an out-of-memory anyway.

On the other hand, any limit low enough that I can't blow up memory by making an array of 100K or so is going to be too low for some people (including me, I often make numbers of low-million numbers of digits).

Providing some method of putting a limit on seems sensible, but maybe just make a LimitedBigDecimal type, so then through the whole program there is a limit on how much memory BigDecimals can take up? (I haven't looked at the library in detail, sorry).


If I understand the situation correctly, in Haskell an unbounded number is the default that you get if you do something similar to JSON.parse(mystr). That means you can have issues basically anywhere. Whereas in Rust with Serde you would only get an unbounded number if you explicitly ask for one. That's a pretty major difference. Only a small number of places will explicitly ask for BigDecimal, and in those cases they probably want an actual unbounded number. And they should be prepared to deal with the consequences of that.

My 2cent anyway.


Nope you didn't understand the situation correctly. First, almost nobody directly parses from a string to JSON AST: people almost always parse into a custom type using either Template Haskell or generics. Second, parsing isn't the issue; doing arithmetic on the number is the issue.


Surely the generics approach would go via an aeson Value as an intermediate format, and thus possibly store an unbounded Scientific.


Storing it isn't the problem.


How does it handle exponent notation?

https://news.ycombinator.com/item?id=36027871


It just handles it natively. The internal representation is coefficient and exponent. Parsing `1e100` results in storing 1 and 100 separately. That's why parsing huge JSON numbers is not a problem. The problem comes when you do arithmetic on it, which is when it needs to convert the number into the libgmp representation.


One of the first Ajax projects I worked on was multi tenant, and someone decided to solve the industrial espionage problem by using random 64 bit identifiers for all records in the system. You have about a .1% chance of generating an ID that gets truncated in JavaScript, which is just enough that you might make it past MVP before anyone figures out it’s broken, and that’s exactly what happened to us.

So we had to go through all the code adding quotes to all the ID fields. That was a giant pain in my ass.


I've been burned by a similar issue too. Lesson here is never to use numbers for things you are not planning to do math on. Ids should always be strings.


Isn't the lesson only that ids shouldn't be floats? If they were integers everything would be fine, but JS numbers aren't integers, even if they look like them sometimes.


Nah, the lesson is broader than that, cause numbers as IDs have a whole bunch of problems and this is just one of them. Eg Twitter has incrementing number IDs and back when they had this whole ecosystem of 3rd party twitter apps (that they have since ruined), half the apps failed when the IDs became too large to fit into a 32-bit int.

If it looks like a number, and it quacks like a number, sooner or later people are going to treat it like a number.


> If it looks like a number, and it quacks like a number, sooner or later people are going to treat it like a number.

Which is perfectly fine; just don't treat it like an int32. :-)


Until you want faster joins, in which case, comparisons of integers tend to be much faster on hardware I am aware of than string comparisons.


We're talking about deserialising JSONs in the application server here, nobody stops you from treating ids as numbers on the database side of things.

But also, this sounds like a premature optimisation. Most applications will never reach a level where their performance is actually impacted by string comparison, and when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there, and we shouldn't judge "regular people" advice because it doesn't usually apply to you anyway.

Out of curiosity, have you ever seen an application that was meaningfully impacted by this? How gigantic was it?

----

Scratch that. I've actually thought about it some more, and now I'm not 100% sure it's premature, I have to investigate further to be sure. Question still stands though.


I work primarily in data analytics. It tends to become noticeable in my experience as soon as you're at a few million records[0] on at least one side of a relationship. Especially as we see more columnar databases in analytics, the row count accounts for more than total data size for this sort of thing.

Due to the type of aggregate queries that typify analytics workloads, almost everything turns into a scan, whether it be of the a table, field, or index. Strings occupy more space on disk, or in RAM, so scanning a whole column or table simply takes longer, because you have to shovel more bytes through the CPU. This doesn't even take into account the relative CPU time to actually do the comparisons.

I've never personally worked with a system that has string keys shorter than 10 [1][2] characters. At that point, regardless of how you pack characters into a register, you're occupying more bits with two strings of character data than you would with two 64-bit integers[3]. This shows through in join time.

[0]: Even modestly sized companies tend to have at least a few tables that get into the millions of records.

[1]: I've heard of systems with shorter string keys

[2]: Most systems with string keys I've encountered have more than 10 characters.

[3]: The vast majority of systems I've seen since the mid-2010s use 64-bit integers for keys for analytics. 32-bit integers seemed to phase out for new systems I've seen since ~2015, but were more common prior to that.


People use integer data types for primary keys in databases all the time. There is nothing wrong with it.


Mostly a deal of defaults on our stack. Tweaked a couple of things in a few places to stop the bleeding. Then had to fix all of the tests.


> when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there

Don't most databases set a length limit on ID strings?

If you're setting a length limit, and it's made out of digits with no leading zeroes, then you might as well store it as a number. Is there a downside?


Don't care.

A numeric identity is an identity and so is a string.

If you want to math it, it is a number, otherwise... string.

"Will you ever want the 95th percentile PID? Then it is not a number. Move on."


Double precision floats can't represent every 64-bit integer. If you want to math it, what kind of number will you accept?


They are saying not to use numbers unless you need to do math with th thing.

If you need to do math with the thing, use an appropriate type of number, of course.


If you're using a 64 bit integers because you've got some super high precision math you need to do over an enormous space of addressable numbers, like maybe you're firing unguided kinetic energy weapons at enemies on other planets... sure, use big numbers. I'm sure you've got some clever libraries able to do such things reliably, and I won't question why you're using json as your serialization format.

If you're using 64 bit numbers as a high cardinality identity that can be randomly generated without concern for collision (like a MAC address with more noise) -- well, that's an identity and doesn't need to have math applied to it. For example: "What's the mean IP address that's connected to cloudflare in the last 10 minutes" or "what's the sum of every mac address in this subnet?" are both nonsense properties because these "numbers" are identities not numbers, and using a data type that treats them as numbers invites surprising, sometimes unpleasantly so, results.

Of course, because these are computers, all strings are ultimately numbers but their numberness is without real meaning.


UUIDs are great for this. It’s really just a random 128-bit integer, which makes comparisons about as fast as variable-length integers on modern hardware. And they decode to strings which means no application code or API end-user code is going to assume it’s a number.


Absolutely agreed on all points. I like UUIDs. There is still a surprising number of data processing systems which don't have support for 128 bit integers. This makes me sad.


> You have about a .1% chance of generating an ID that gets truncated in JavaScript

I don't follow. 1-(Number.MAX_SAFE_INTEGER / 2*63) ~ 99.9%, so don't you have a >99% chance of generating an ID that gets truncated in js?


IEEE 754 can represent integers larger than MAX_SAFE_INTEGER, just not all of them:

https://en.wikipedia.org/wiki/Double-precision_floating-poin...

That's still going to be a greater than 0.1% chance of hitting a non-representable value though.


It’s been a long long time. I may be remembering the ratio wrong, or we might have been clipping the range a bit.


> or we might have been clipping the range a bit

Well it's a pretty abrupt change. 53 bits work fine, at 54 bits a quarter of numbers get truncated, at 55 it's half.


Why would the value get truncated?


FTA: JavaScript's built-in JSON implementation is limited to the range and precision of a double.

Obviously, not all int64 values are representable in float64 (double).


We have ample computing power today to be rid of floats altogether and use integers, fractions and natural numbers.


With 10 times the memory usage and 100 times the compute power, maybe you could replace floats with something that behaves more like real numbers and covers mostly the same range.

But the resulting type is still going to have its own limitations and sharp edges. Floats are not the right tool for every job but they are quite good at the jobs they are right for. Learning how they work is more useful than lamenting their existence.


It’s about 2.5 times as much memory, if you do base 10. 65k is a little less than 5 bytes to represent 2.

But floats are not the right representation for values that need to exactly match, like an ID, to be sure.

If I’m off by half a cent it’s annoying. If I’m off by half a row I get nothing.

The thing is that almost all of the problems we had in my initial story came from choosing system defaults. All except the PK algorithm.


With densely packed decimals (3 digits in 10 bits), you can reduce the space overhead to as little as 2.4% (1024/1000). The IEEE has even standardized various base-10 floating-point formats (e.g. decimal64). I'd suspect that with dedicated hardware, you could bring down the compute difference to 2-3x binary FP.

However I read the post I responded to as decrying all floating-point formats, regardless of base. That leaves only fixed-point (fancy integers) and rationals. To represent numbers with the same range as double precision, you'd need about 2048 bits for either alternative. And rational arithmetic is really slow due to heavy reliance on the GCD operation.


I was speaking in the context of JSON, where all numbers are decimal.


not all numbers are representable as the particular type of floating point number that js uses

nice pics here: https://en.wikipedia.org/wiki/Floating-point_arithmetic


Long story short: don't use JSON numbers to represent money or monetary rates. Always use decimals encoded as string. It's surprising how many APIs fall short of this basic bar.


We've used XML for interchange of order-like data. Customers have started demanding JSON, so I built a tool to generate XML <-> JSON converters, along with JSON Schema file, based on an XSD, so we could continue to use our existing XML infrastructure on the inside.

I must admit I totally forgot about the JSON number issue. Our files include fields for various monetary amounts and similar, and in XML we just used "xs:decimal".

Most will be less than a million and requires less than four decimal digits. But I guess someone might decide to use it for the equivalent of a space station or similar, ya never know...


No. Use integers to store the smallest money decimal, and store the currency name alongside.


What happens if you’re sure that four decimal places is the smallest, then suddenly a partner system starts sending you 6 decimal places?


Precision should be part of the spec for integrations. With the integer multiple of minimal unit, that makes it clear in the API what it is.

e.g. it doesn't make sense to support billing in sub-currency unit amounts just by allowing it in your API definition, as you're going to need to batch that until you get a billable amount which is larger than the fee for issuing a bill. Even for something like $100,000.1234, the bank doesn't let you do a transfer for 0.34c.

For cases where sub-currency unit billing is a thing, it should be agreed what the minimal unit is (e.g. advertising has largely standardised on millicents)


Just a note that precision is a part of the standard if you’re using the ISO 4217 standard, which defines minor unit fractions.

https://en.wikipedia.org/wiki/ISO_4217

Or choose a different standard, I don’t know what else is out there, but you probably should choose an existing one.


Implementation of fixed point decimals, using multiple integer representations encoded within a floating point system. Nice.


Well I mean, if your minimal unit is 1c, then a price like $22.56 should be encoded as 2256 cents.

If you're doing ads and going for millicents, something like $0.01234 should be encoded as 1234 millicents.

Obviously you have to agree on what you're measuring in the API, you can't have some values be millicents and others cents.


Yeah I am more laughing that once encoded in JSON as { "p": 2256, "dp": 2 } you are using 2 floating point numbers. But JSON, and indeed JS wasn't designed.


To be clear, I wasn't advocating for flexible decimal points. There is no "dp" parameter in the solution I was proposing. It's just documented in the API that "price" is denominated in cents (or satoshis or whatever you want)


Then you should store the time as well, because the number of decimals in a currency can change (see ISK). Also, some systems disagree on the number of decimals, so be careful. And of course prices can have more decimals. And then you have cryptocurrencies, so make sure you use bigints


You store it as an integer, but as we just saw in the OP, for general interop with any system that parses JSON you have to assume that it will be parsed as a double. So to avoid precision loss you are going to have to store it as a string anyway. At that point it's upto you whether you want to reinvent the wheel and implement all the required arithmetic operations for your new fixed-point type. Or you could just use the existing decimal type that ships on almost every mature platform: Java, C#, Python, Ruby, etc.


In dollars, what do you get up to with a double of cents without precision loss? It's in the trillions, I figure? So a very large space of applications where smallest-denomination-as-JSON-number is going to be fine.


Prices can certainly have more decimals that cents.

If you just store cents you can't represent them. You either have to guess at the beginning the smallest unit or store the precision along with it.

Just use strings, it's much simpler.


Depends on the language. On the JVM you are fine. With Javascript, doing math on big numbers is probably going to end in tears unless you know what you are doing. Either way, have some tests for this and make sure your code is doing what you expect.

Encoding numbers as string because you are using a language and parser that can't deal with numbers properly (even 64 bit doubles), is a bit of a hack. Basically the rest of the world giving up because Javascript can't get its shit together is not a great plan.


Accounting for the lowest common denominator that has a huge share in it is always a great plan. Every trading platform out there uses "+-ddd.ddd" format, even binary-born protocols completely unrelated to js used it since forever.


Yeah I disagree.

As the article said

> RFC 8259 raises the important point that ultimately implementations decide what a JSON number is.

Any implementation dealing with money or monetary rates should know that it needs to deal with precision and act accordingly. If you want to use JavaScript to work with money, you need to get a library that allows you to represent high precision numbers. It's not unreasonable to also expect that you get a JSON parsing library that supports the same.

oh, TIL that you can support large numbers with the default JavaScript JSON library https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


The only problem with this attitude is that JSON APIs are meant to be interoperable, and as the OP showed, you can't rely on the systems you interoperate with to uniformly have the same understanding of JSON numbers, and misinterpreting numbers because of system incompatibilities will cause some really bad headaches that are totally avoidable by just forcing everyone to interop in terms of decimal numbers encoded as strings.


I tend to end up encoding everything as an integer (multiply by 1000, 10000 etc) and then turn it back into a float/decimal on decode. For instance if I am building a system dealing with dollar amounts I will store cent amounts everywhere, communicate cent amounts over the wire, etc. then treat it as a presentation concern to render it as a dollar amount.


It's worth bearing in mind when you do that that the largest integer that is "generally safe" in JSON is 2^53-1, so if you scale by a factor of 10000 you're taking 13-14 more bits off that maximum. That leaves you about 2^40, or about a trillion, before you may start losing precision or seeing systems disagree about the decoded values. Whether that's a problem depends on your domain.


For money, that's a sane setup.

But do note, that in currency, there are multiple, actively used currencies that have zero, three, five (rare) or even eight (BTC) decimals. That some decimals cannot be divided by all numbers (e.g. only 0.5)

Point being: floats are dangerously naive for currency. But integers are naive too. You'll most probably want a "currency" or "money" type. Some Value Object, or even Domain Model.

XML offered all this, but in JSON there's little to convey this, other than some nested "object" with at least the decimal amt (as int), and the ISO4217 currency. And maybe -depending on how HATEOAS you wanna be- a formatting string to be used in locales, a rule on divisibility and/or how many decimal places your int or decimal might be.

(FWIW, I built backbends for financial systems and apps. It gets worse than this if you do math on the currencies. Some legislatioins or bookkeeping rules state that calculation uses more or less decimals. E.g. that ($10/3)*3 == $10 vs == $9.99. or that $0.03/2 == 0.1 + 0.2, e.g. when splitting a bill. This stuff is complex, but real domain logic)


When I say dangerously naive, I mean in a way that people can go to jail¹ for "loosing" or "inventing" cents. Which your software will do if you use floats.

¹IANAL. But this was told when legal people looked at our architecture.


Your software will still "lose" cents if you use integers, for operations such as dividing a bill (e.g. divide by 3), or applying 3% APR in monthly increments.

The goal is not to avoid rounding errors (which would be quite difficult when the true account value can be an irrational number, as with 3% APR compounding monthly), but to have the exact same rounding errors that are prescribed by the accounting practices. Which may vary depending on legislation.

A decimal floating point is usually a better starting point than integers are.


> for operations such as dividing a bill (e.g. divide by 3), or applying 3% APR in monthly increments.

Which is why passing around ints is not the solution. And why I specifically mention Domain Models and/or Value Object.

A domain model would throw an exeption or otherwise dissalow certain divisions for example. What I often do, is something like `expense.amount.divide_over(3, leftover_to_last)` or `savings.balance_at(today).percentage_of(3.1337)`.

Sometimes, in simpler setups and when the language allows, I'll re-implement operators like *, / and even + and -. But when actual business logic is needed, I'll avoid these and implement actual domain methods that use the language the business uses.

But never, ever, do I allow just math-ing over the inner values.

So, I disagree: Both decimal floating point and integers are just as "bad". Maybe for the inner values in the domain model or value object, they are fine, but often there integers are a slightly better starting point because they make rounding and leftovers very explicit.


The problem with that (which I have seen in practice) is that you are essentially hard coding the maximum precision you will accept for every client that needs to interpret your JSON.

For example, you say you store monetary amounts as cents. What if you needed to store US gas prices, which are normally priced in amounts ending in 9/10ths of a cent? If you want to keep your values as integers you need to change your precision, which will likely mess up a lot of your code.


and different currencies have different default precisions. So if you're dealing with multiple currencies, now you need both client and server to have a map of all currency precisions for formatting purposes that they agree on.

What's worse is that these things can also change over time and there is sometimes disagreement over what the canonical value is.

E.g. ISO 4217 (used by Safari, Firefox and NodeJS) will say that the Indonesian Rupiah (IDR) uses 2 decimal digits, while Unicode CLDR (used by Chrome) will say that they use 0 decimal digits. The former is the more "legalistic" definition, while the latter matches how people use the currency in reality.

This is not a real issue if you transfer amounts as decimal strings and then pass those to the Intl API for formatting (the formatting will just be different but still correct), but it's catastrophic if you use scaled-up integers (all amounts will be off by magnitudes).

For this reason I would always store currency amounts in an appropriate DECIMAL type in the DB and send currency amounts as strings over the wire.


This is a good point.

It's not widely known, but US gasoline prices are actually in a defined currency unit, the mill (https://en.m.wikipedia.org/wiki/Mill_(currency)).

For most purposes, using mills as the base unit would be sufficient resolution.


So basically you use fixpoint numbers. Especially for currency that’s a very good idea anyway, because of rounding errors, even more so in IEEE 754


Pedantically, IEEE 754 defines decimal floating point formats (like decimal128) which are appropriate for representing currency. Representing currency in non-integer values in any of the binary floating point formats is indeed a recipe for disaster though.


I have tried to encode all non-trivial numbers as strings. If it's too big (or small), or if it's a float, I'll have to change my JSON schema. Bake the need to decode numbers into the transforms for consistency.


This is great as long as you always make clear which value is pre post encoding. I remember one of my first production bugs was giving users 100 times the credit they actually bought. Oops.


Makes sense for dollars, but for anything like graphics or physics I'd consider a power of two like 1,024 as the fixed-point factor instead.

My intuition tells me that "x * 1000 / 1000 == x" might not be true for all numbers if you're using floats.


A sure sign of an inexperienced programmer in numerical computing is when they check for equality to zero of a floating-point number as

if (x == 0) ...

instead of something like

if (abs(x) < eps) ...

where eps is a suitably defined small number.


Sometimes it is fine. For example, reference BLAS will check if the input scalars in DGEMM are exactly zero, for

    C <- alpha*AB + beta*C 
If beta is exactly 0, you don’t have to read C, just write to it.

The key here is that beta is likely to be an exact value that is entered as a constant, and detecting it allows for a worthwhile optimization.


I would guess even most of time people using epsilon don't understand it. Its not like there is universal constant error with floating point numbers. I feel that saying just use epsilon is not much better than x == 0 and could be harder to find bugs if it sometimes works and othertimes does not.


I think funny enough a sure sign of an inexperienced programmer in bigco application programming is the other way around, that they wrongly learn a metal model of "floating point is approximate, never ever do ==" in school.


I often store it as smaller than cents, because anything with division or a basket of summed parts with taxes can start to get funky if you round down (and some places have laws about that.)


My opinion is that a safe approach is to use either 52-bit integer number or 64-bit floating number to keep JavaScript compatibility. JavaScript is too important and at the same time, the errors are too terrific (JS will silently round to the nearest 52-bit integer number which could lead to various exploits) to skip on that. If you need anything else, just use strings.


I think the description for Go is inaccurate/incomplete. You can call this function to instruct the decoder to leave numbers in unparsed string form:

https://pkg.go.dev/encoding/json#Decoder.UseNumber

That allows you to capture/forward numbers without any loss of precision.


I have added this note, thanks! In the blog I am mostly trying to show the behavior you get using the (maybe defacto) stdlib with its default configuration, but this is useful data to call out.


If you're going to extend Go the courtesy of customizing the parser, oughtn't you do the same for Python (and all the languages)?

To wit, Python's json module has `parse_float` and `parse_int` hooks:

https://docs.python.org/3/library/json.html#encoders-and-dec...

Example:

  >>> json.loads('{"int":12345,"float":123.45}', parse_int=str, parse_float=str)
  {'int': '12345', 'float': '123.45'}
FWIW, when I've cared about interop and controlled the schema, I've specified JSON strings for numbers, along with the range, precision, and representation. This is no worse (nor better) than using RFC 3339 for dates.


I'm just a JS guy trying to understand the world around me and documenting what I find, not trying to be discourteous (or even courteous). I'll add the note about Python, thanks for calling it out. FWIW JS does not have a similar capability so I can't add a note there.


> FWIW JS does not have a similar capability so I can't add a note there

This example on MDN seems to indicate that you can, am I misunderstanding it?

  const bigJSON = '{"gross_gdp": 12345678901234567890}';
  const bigObj = JSON.parse(bigJSON, (key, value, context) => {
    if (key === "gross_gdp") {
      // Ignore the value because it has already lost precision
      return BigInt(context.source);
    }
    return value;
  });
[0]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


> am I misunderstanding it?

The optional `context` parameter is a tc39 proposal. The feature compatibility matrix on the bottom of the MDN page is really confusing because it's showing only when `JSON.parse` was added, not whether the optional `context` parameter is supported.

I've confirmed it's available in:

  - V8 11.4.31
  - Node 20.12.0 (with `--harmony`)
  - Node 21.7.1 (without requiring `--harmony`)
  - Chrome 123.0.6312.107
But not available in:

  Firefox 124.0.2 (ironically) 
  Safari 17.3.1
The original blog post linked to the proposal:

https://tc39.es/proposal-json-parse-with-source/

https://github.com/tc39/proposal-json-parse-with-source

This issue links to the various browser engine tracking bugs:

https://github.com/tc39/proposal-json-parse-with-source/issu...

Which are:

• Chrome/V8: https://bugs.chromium.org/p/v8/issues/detail?id=12955

• Firefox/SpiderMonkey: https://bugzilla.mozilla.org/show_bug.cgi?id=1658310

• Safari/JavaScriptCore: https://bugs.webkit.org/show_bug.cgi?id=248031


The MDN PR documenting the optional `context` parameter was merged just 2 weeks ago:

https://github.com/mdn/content/pull/32697/files


In JS, it's a good idea anyway to use some JSON parsing library instead of JSON.parse.

With Zod, you can use z.bigint() parser. If you take the "parse any JSON" snippet https://zod.dev/?id=json-type and change z.number() to z.bigint(), it should do what you are looking for.


Fair enough! Thank you for the writeup.


Try to get a DECIMAL value out of a Postgres database into a JSON API response and you’ll learn all this and more in the most painful way possible!


Other values one could test for:

- “+1” (not a valid number, according to ECMA-404 and RFC-8259)

- “+0” (also not a valid number, but trickier than “+1” because IEEE floats have “+0” and “-0”)

- “070” (not a valid number, but may get parsed as octal 56)

- “1.” (not a valid number in json)

- “.1” (not a valid number in json)

- “0E-0” (a valid number in json)

There probably are others.


> I-JSON messages SHOULD NOT include numbers that express greater magnitude or precision than an IEEE 754 double precision number provides

I'm confused by this.

What is the precision of 0.1, relative to IEEE 754?

If I read it correctly, that statement is saying:

  json_number_precision(json_number) <= ieee_754_precision
^ How do I calculate these values?


I think the spec just means, assume IEEE 754. In the case of 0.1, which cannot be represented exactly, software should assume that `0.1` will be represented as `0.100000000000000005551115123126`. Depending on `0.1` being parsed as the exact value `0.1` is not widely interoperable.


Relatedly, what about integers like 9007199254740995. Is that a legal integer since it rounds to 9007199254740996?

It does seem unclear what it means to exceed precision (given rounding is such an expected part of the way we use these numbers). Magnitude feels easier as at least you definitely run out of bits in the exponent.


I think the spec is saying that it is the message that should not express greater magnitude or precision, not 'the number'.

So including the string "0.1" in a message is fine because v = 0.1 implies 0.05 < v < 0.15, but including 0.100000000000000000000000000000000000 would not be.


First thing that I check before using a JSON parser library is that if it lets me to get the number as a string and let me do my own conversion. Libraries that try to treat the number as double or bring in a large bigint/decimal library gets usually pass from me.


If you need a specific exotic JSON parser to parse the numbers you have correctly, I would argue that you should serialise them as strings and not as numbers.

That's what prometheus is doing for example. https://prometheus.io/docs/prometheus/latest/querying/api/


That only works if you are the one who serializes the json in the first place.


When I wrote my jsonptr tool a few years ago, I noticed that some JSON libraries (in both C++ and Rust) don't even do "parse a string of decimal digits as a float64" properly. I don't mean that in the "0.3 isn't exactly representable; have 0.30000000000000004 instead" sense.

I mean that rapidjson (C++) parsed the string "0.99999999999999999" as the number 1.0000000000000003. Apart from just looking weird, it's a different float64 bit-pattern: 0x3FF0000000000000 vs 0x3FF0000000000001.

Similarly, serde-json (Rust) parsed "122.416294033786585" as 122.4162940337866. This isn't as obvious a difference, but the bit-patterns differ by one: 0x405E9AA48FBB2888 vs 0x405E9AA48FBB2889. Serde-json does have an "float_roundtrip" feature flag, but it's opt-in, not enabled by default.

For details, look for "rapidjson issue #1773" and "serde_json issue #707" at https://nigeltao.github.io/blog/2020/jsonptr.html


Oh wow. So serde_json doesn't roundtrip floats by default, it uses some imprecise faster algorithm https://github.com/serde-rs/json/issues/707

Good thing there's msgpack I guess.


this requires multiple precision to do properly and isn't useful most of the time. its odd to describe this as "not properly". you might say "with exact rounding", but that makes it clearer that this isn't that useful a feature, especially since we usually expect floats to be inexact in the first place.


With JSON, there's essentially no such thing as "properly" when it comes to parsing numbers, since the spec doesn't limit the ability of the implementation to constrain width and precision. It only says that float64 is common and therefore "good interoperability can be achieved by implementations that expect no more precision or range than these provide", but note the complete absence of any guarantees in that wording.

The only sane thing with JSON is to avoid numbers altogether and just use decimal-encoded strings. This forces the person parsing it on the other end to at least look up the actual limits defined by your schema.


Rounding by more than an ULP is pretty bad. I don't think it's odd at all to describe rapidjson's behavior as improper.

At least 122.416294033786585 is between ...888 and ...889, though it's much closer to the former.


I think the thing folk miss is when there’s an error like divide by zero, or the calculation would return NaN. I feel like this is the main gap/concern with using JSON and it seems to be rarely discussed.


Agreed, this can be a pain. Python by default serialize and de-serialize the `NaN` literal, making you pay some cleanup cost once you need to interopt with other systems. (same for `Inf`)

Say what you want about NaN, but IEEE 754 is the facto way of dealing with floating points in computers and even if NaNs and Infs are a bit "fringe" it's unfortunate that the most popular serialization format can not represent these.


There are so many things that are poorly thought out or underspecified in JSON, it's amazing that it got so widely adopted for interop. No wonder that it became a perpetual source of serialization bugs.


Especially annoying given that they could have been easily adopted. Infinity could've been encoded as `1/0` (among most other possibilities). NaN could've been encoded as `0/0` (again, among most other possibilities). JSON doesn't allow all possible JavaScript literals anyway, so these encodings might have been worked if they were somehow standardized.


I like to think of floating point values as noisy analog voltages, with the extra propery that they can store small integers perfectly, and they can be copied within code but not round trip serialized and deserializer without noise.

They're not really noisy, but if an application would work with some random noise added, it will probably work with floats, and if it wouldn't work with noise added, it's probably easier to just not use floats and expext people to reason about IEEE details, while risking subtle bugs if different float representations get mixed.

Of course I'm not doing a lot of high performance algorithms, I would imagine in some applications you really do need to reason about floats.


The font choice for inline text is so distracting

This "Averia Serif Libre" is unreadable for me.


I still get a laugh of ecma 404. The first time I looked it up I refreshed the page a large number of times before I realized it wasn't an error.


I think that good decision regarding numbers in API (as it was made in my project) is to put meaningful decimal numbers into string and let them be handled by exact decimal calculation framework, e.g. BigDecimal in Java etc.


Does anyone have any idea why Crockford decided that at least one digit is required after the decimal point, as opposed to JavaScript which has zero or more?


It's weird that any parser that loses digits is tolerated. A parser that forces strings into uppercase US-ASCII never would be.


It's tolerated because the JSON spec explicitly allows it:

   This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
   available and widely used, good interoperability can be achieved by
   implementations that expect no more precision or range than these
   provide, in the sense that implementations will approximate JSON
   numbers within the expected precision.  A JSON number such as 1E400
   or 3.141592653589793238462643383279 may indicate potential
   interoperability problems, since it suggests that the software that
   created it expects receiving software to have greater capabilities
   for numeric magnitude and precision than is widely available.

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.
And yes, this is completely insane for a format that supposed to be specifically for serialization and interop. Needless to say, the industry has enthusiastically adopted it to the point where it became the standard.

I miss XML these days. Sure, it was verbose and had a bunch of different and probably excessive numeric types defined for XML Schema... but at least they were well-defined (https://www.w3.org/TR/xmlschema-2/#built-in-datatypes). And, on the other hand, without a schema, all you had were strings. Either way, no mismatched expectations.


That's true for every floating point number in every programming language you have ever used, though.

    $ python3
    Python 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 100000.000000000017
    100000.00000000001


This is why Decimal exists:

  Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
  [GCC 9.4.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from decimal import Decimal
  >>> Decimal('100000.000000000017')
  Decimal('100000.000000000017')
For example:

  >>> import json
  >>> json.loads('{"a": 100000.000000000017}')
  {'a': 100000.00000000001}
  >>> json.loads('{"a": 100000.000000000017}', parse_float=Decimal)
  {'a': Decimal('100000.000000000017')}


And not every programming language offers a Decimal type and on most of those, there’s usually a performance penalty associated with it not to mention issues of interoperability and developer knowledge of its existence. For financial calculations, usually using integers with an implicit decimal offset (e.g., US currency amounts being expressed in cents rather than dollars), while other contexts will often determine that the inherent inaccuracy of IEEE floating types is a non-issue. The biggest potential problem lies in treating values that act kind of like numbers and look like numbers as numbers, e.g., Dewey Decimal classification numbers or the topic in a Library of Congress classification.¹

1. This is a bit on my mind lately as I discovered that LibraryThing’s sort by LoC classification seems to be broken so I exported my library (discovering that they export as ISO8859-1 with no option for UTF-8) and wrote a custom sorter for LOC classification codes for use in finally arranging the books on my shelves after my move last year.


Decimal is not arbitrary precision, though. It has many of the same issues, you'll just see them in different places.

  >>> Decimal('100000.00000000000000000000017') + Decimal('1')
  Decimal('100001.0000000000000000000002')


but serializing/deserializing decimal using the json module is futile


Why is it futile? It can be serialized/deserialized perfectly through its string representation.


> That's true for every floating point number in every programming language you have ever used, though.

Alright, if "you" have only ever used python. In C, for example, we have hexadecimal floating point literals that represent all floats and doubles exactly (including infinities and nans that make the json parser fail miserably).


If you use the same syntax as OP, C’s parser will also round that literal. The existence of a hex literal for floats is something orthogonal


> we have hexadecimal floating point literals that represent all floats and doubles exactly

How do you do that?

A couple of resources I found but which I’m not sure if are about exactly what you speak of

https://stackoverflow.com/questions/65480947/is-ieee-754-rep...

https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html

Furthermore, what exactly do you mean by “all floats and doubles exactly”?


Yes, I was talking about what is described in your resources. You can do this:

    // define a floating-point literal in hex and print it in decimal
    float x = 0x1p-8;          // x = 1.0/256
    printf("x = %g\n", x);     // prints 0.00390625
    
    // define a floating point literal in decimal and print it in various ways
    float y = 0.3;             // non-representable, rounded to closest float
    printf("y = %g\n", y);     // 0.3 (the %g format does some heuristics)
    printf("y = %.10f\n", y);  // 0.3000000119
    printf("y = %.20f\n", y);  // 0.30000001192092895508
    printf("y = %a\n", f);     // 0x1.333334p-2


So for example if you make a variable that has the value parent commenter used

100000.000000000017

And then you print it.

Does it preserve the exact value?


Your question is ambiguous for two different reasons. First, this value is not representable as a floating-point number, so there's no way that you can even store it in a float. Second, once you have a float variable, you can print it in many different ways. So, the answer to your question is, irremediably, "it depends what you mean by exact value".

If you print your variable with the %a format, then YES, the exact value is preserved and there is no loss of information. The problem is that the literal that you wrote cannot be represented exactly. But this is hardly a fault of the floats. Ints have exactly the same problem:

    int x = 2.5;   // x gets the value 2
    int y = 7/3;   // same thing


So in other words, is it fair to say that this situation is not much different from what you get with Python?


https://0.30000000000000004.com/

Although it would be good to move in the direction of using a BigDecimal equivalent by default when ingesting unknown data.


JSON is a notation. It's syntax. The semantics are left up to the implementation. The question has no answer.


That font >.<


It does something to my eyes


It’s missing Swift tests, but otherwise it’s a great post.


If you would like to contribute Swift tests, I would be happy to take it! You can send a PR into this document, updating the data tables and adding a code sample at the end: https://github.com/bterlson/blog/blob/main/content/blog/what.... No need to test openapi-tools swift codegen unless you really want to!


I’m having a lot on my plate currently, but I’m adding this to my TODO list!


Since JSON is so widely used it should be modified to support more types - Mongo DB's Extended JSON supports all the BSON (Binary) types:

    Array
    Binary
    Date
    Decimal128
    Document
    Double
    Int32
    Int64
    MaxKey
    MinKey
    ObjectId
    Regular Expression
    Timestamp
https://www.mongodb.com/docs/manual/reference/mongodb-extend...


JS is likely to get a hook to be able to handle serialization/deserialization of such values without swapping out the entire implementation[1]. Native support for these types, without additional code or configuration, would likely break the Internet badly, so is unlikely to happen unfortunately.

1: https://github.com/tc39/proposal-json-parse-with-source


Much more valuable than any such extension would be a way to annotate types and byte lengths of keys and values so that parsers could work more efficiently. I’ve spent a lot of time making a fast JSON parser in Java and the thing that makes it so hard is you don’t know how many bytes anything is, or what type. It’s hard to do better than naive byte-by-byte parsing.


If you control the underlying data, I must reccomend Amazon Ion! Its text format is a strict superset of JSON, but they also maintain binary format that will round-trip data and is designed for efficient scanning. There's even prefixed annotations if you want them :)

It also specs proper decimal values, mitigating the issues presented in the OP.

https://amazon-ion.github.io/ion-docs/


JSON is not the place to be so fussy about number widths, and things like MaxKey and 24-hex-value ObjectId would be ridiculous.


Or maaaybe use XML for such cases


There are no numbers in JSON. There are only strings.


"ID numbers start from 2^53 and are allocated sequentially including odd numbers that are not compatible with "double" types. Please ensure you are reading this value as a 64-bit integer."


A little off topic, but fun to see that someone else has adopted that magical CSS theme! (https://css.winterveil.net/)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: