This paper is one of the most interesting published in the last couple of years IMO. I remain surprised at how many people have overlooked it. I'm not sure why this blog post makes no mention of the work, in the past the Cockroach folks have been quite explicit about crediting the research documents they're drawing from.
This blog post doesn't mention HLC because the explanation didn't require it. HLC boils down to a mechanism for taking the maximum physical wall time across >= 2 nodes, while still being able to provide monotonically increasing time via incrementing the logical component of the hybrid logical timestamp. In the blog post, this is referred to as "taking the maximum timestamp across requests". A segue into HLC would have served only to further complicate the explanation.
The work done by Sandeep Kulkarni, Murat Demirbas, David Alves, Todd Lipcon, Vijay Garg, and others was instrumental to our early design efforts. I am dismayed to see there's a patent pending, though both our design and open source implementation predate the patent filing.
FYI on the date issue, lest anyone think we filed the patent trying to steal the work done by others, the patent application says:
"This application claims to the benefit of U.S. Provisional Patent Application No. 61/911,720, entitled “HYBRIDTIME and HYBRIDCLOCKS FOR CLOCK UNCERTAINTY REDUCTION IN A DISTRIBUTED COMPUTING ENVIRONMENT”, which was filed on Dec. 4, 2013, which is incorporated by reference herein in its entirety."
(which predates the creation of the cockroachdb repo and the hybrid logical clock paper).
I know I'm diping my toe in some history here, but is there a sense of how the patent situation is going to shake out? I think this general family of algorithm is very important.
Agreed -- personally I'm against offensive use of patents like this as well, and it's my understanding that Cloudera doesn't intend to use this patent offensively. If it did, I would be upset and would consider leaving the company - I know many other employees feel the same way. The reason I agree to help write patent applications as an engineer is that I've seen the distraction and damages caused by patent trolls (or even other companies) and the importance of having a defensive portfolio.
Disclaimer: Obviously I'm not speaking for the company or making any promises here :)
I don't understand why 7ms is considered a good bound for atomic clocks?
Hafele and Keating Experiment:
"During October, 1971, four cesium atomic beam clocks were flown on regularly scheduled commercial jet flights around the world twice, once eastward and once westward, to test Einstein's theory of relativity with macroscopic clocks. From the actual flight paths of each trip, the theory predicted that the flying clocks, compared with reference clocks at the U.S. Naval Observatory, should have lost 40+/-23 nanoseconds during the eastward trip and should have gained 275+/-21 nanoseconds during the westward trip ... Relative to the atomic time scale of the U.S. Naval Observatory, the flying clocks lost 59+/-10 nanoseconds during the eastward trip and gained 273+/-7 nanosecond during the westward trip, where the errors are the corresponding standard deviations. These results provide an unambiguous empirical resolution of the famous clock "paradox" with macroscopic clocks."
So if we are loosing nano seconds per day, couldn't we fly clocks around the datacenters and resync every month. 7ms seems beatable and not a terrible operational overhead for a fast globally consistent database.
In this context, 7ms refers to the size of the TrueTime uncertainty interval, which is different from the typical instantaneous clock error. TrueTime returns an interval that is guaranteed to contain the absolute idealized physical "now." So this interval represents a worst case boundary, and if a system using TrueTime is able to separate events by the current interval, for example by forcing a committer to wait it out, then it can provide a temporal order that reflects the absolute real world physical order, despite any clock uncertainty. It's a bit different to think about than how we typically work with time, but it's a powerful primitive in a globally distributed system.
Indeed. The machines that do the actual work will still use a an inaccurate clock and get periodic corrective updates. That being said, you should be able to get a much higher precision using PTP and some decent network cards. There are plenty of factors that come into play (network latency, congestion, etc) but 7 ms sets the bar low.
I always thought it would be an interesting physics exam question to ask: "If a clock were landed on Halley's comet, and retrieved on its next orbit, what would be the expected difference in time relative to an earth bound clock?" It's tricky because all orbiting bodies experience a range of velocities as they orbit their star. But I gather Halley's comet has a particularly eccentric orbit (0.9) with a rapid perihelion of around 70km/sec, an aphelion of about 1km/sec, and about 75 years for a single orbit (ΔT' = ΔT / sqrt(1 - (v²/c²)))
Best method I can think of is to measure the proper time of both the trajectory of the earth and the trajectory of Halley's comet, and compare, but with the combined effects of gravity and a changing speed that could be quite challenging.
It's even worse if you insist on using geodesics instead of elliptical orbits, or if you decide to include the rotation of the earth in your calculations.
Absolute velocity does not affect time, simply because there is no such thing as absolute velocity. Relative velocity, though, does. For instance, if you look at a clock on a GPS satellite, moving fast relative to you, you can see it run slower than the one on your wrist. Similarly, someone stationary on the surface of the sun would see us here on earth moving in slow motion.
However, someone moving on the satellite would see the clock on the satellite move normally, and us on earth moving in slow motion, because they are on the same frame of reference of the satellite.
So no, time does not operate differently on different parts of the universe per se. It all depends on how it's moving relative to the observer.
> For instance, if you look at a clock on a GPS satellite, moving fast relative to you, you can see it run slower than the one on your wrist.
Actually, the general relativity effects of weaker gravitational field dominate the special relativity effects of velocity[1]. So the GPS satellite clock actually runs faster, not slower.
There's a really neat thing about this, the relationship balances out at a certain orbital height, before it flips over so there's an orbit where your chronologically in synch with the ground.
That's only true with respect to a given locus of points on the ground, not the full surface. I.E., the relative velocity of the SV isn't the same for all points that may be measuring.
In practice the GR effect is compensated by the satellite at manufacturing, the SR effect is treated in the receiver - for just that reason.
The velocity-time relationship is explained by Special Relativity (SR).
The Sun example is a bit more complicated, because there's the velocity-time effect due to SR, but also the fact that you're deep in a big gravity well, which has effects on your measured time due to General Relativity (GR).
SR says that "moving clocks run slow", but deep in the Sun's gravity well, we should also be running slow relative to the Earth. Not sure what the relative size of the two effects is.
EDIT> I assume the SR effect is larger than the GR effect, simply because SR was obvious before GR.
Yes, time is slower/faster in different parts of the solar system. Atomic clocks on the GPS satellites ticks 38 microseconds faster than on the surface of earth each day (which they correct for).
It does. There is no such thing as "now" unless you also include a "here". Everyone experiences a different time. Even us, who live in different points on the surface of the planet experience time differently.
This is based on my limited physics education, but I believe there are two things at play:
1) Velocity doesn't affect time, acceleration does. So if your twin flies to Mars and back, you age more quickly. But if your twin flies to Mars, and you join him after a year (following the same trajectory), you age the same. This is special relativity.
2) Acceleration due to gravity doesn't count. In fact, the opposite is true -- by standing on the ground, you are accelerating up at 32 feet per second squared. This is general relativity.
So, to answer your question. On the one hand, yes, if I am accelerating (not counting gravity), then I observe time pass more slowly. On the other hand, most of the universe is dominated by gravitational forces, so most systems wouldn't notice these effects.
The fascinating thing is that both (1) and (2) happen because of a very basic physics principle: if I can't use an experiment to tell two frames of reference apart, then the physics in the two frames of reference are identical.
For (1), I observe the same speed of light as someone moving at velocity v relative to me. Einstein used this simple axiom to derive special relativity. For (2), I can't tell the difference between free fall and being at rest (alternatively, I can't tell the difference between being on Earth and being in an elevator in space). This is because the inertial mass equals gravitational mass, a "coincidence" that dates back to Newton's law of gravitation. From this (and a lot of math) Einstein derived general relativity. Beautiful!
That's incorrect as well. You are mixing two relativistic effects: from the velocity of something relative to an observer, and from the gravity on that spot of the universe (which is what I assume you mean by acceleration).
Also wrong, as you dig deeper into a planet you experience less gravity, not more.
Velocity and acceleration are different things and Talking about less or more gravity is a poor descriptor.
At the center of the moon you would 'float' aka
no acceleration relative to the moon. You would still be orbiting the earth, sun, etc.
However by being at the bottom of a gravity well you get time dilation relative to someone in the same orbit on the other side of the earth.
However, it's important to note LEO means high orbital velocity which counters being higher in the gravity field. Similarly, standing on the surface of the earth you have time dilation from the earths rotation which you would not have at the center.
PS: On way to think about it is at the center space time is pulled by all the mass around you which is a stronger pull than standing beside a planet. However the pull is in balance, like a tug of war game nobody is winning.
These are conservative upper bounds and are meant to encompass long tail offsets. You can usually get NTP down to < 10ms offsets, but if you rely on that, you'll likely run into problems which end up violating your database guarantees.
Nice writeup. I disagree that we need chip-scale, atomic clocks. My idea was a dedicated, battery-backed piece of hardware that reliably stored time plus could sync other machines. Plugs into an interconnect with ultra-low latency. One for each datacenter.
You can plug them into each machine in the cluster periodically to sync them. Or you can plug it into a master node that connects with low-latency management interface separate from main data line. Occasionally, time server gets exclusive access to that line, assesses latency, and then syncs its time. Time server might be custom built to avoid its own skew or keep one of the timekeeping devices attached. Those are periodically shipped to a central location to resync themselves against an atomic clock or each other.
> My idea was a dedicated, battery-backed piece of hardware that reliably stored time plus could sync other machines. Plugs into an interconnect with ultra-low latency. One for each datacenter.
Google have a variant on this, where they use a GPS receiver in each data centre to provide an accurate time source for local machines.
No, they use a GPS with 7ms time spread for servers as you said. What Im describing is a custom device set against an atomic clock to nano or microsecond accuracy which then does the same to the servers via low latency interconnects. Optionally with time server & dedicated networks for reduced admin overhead.
Should do a lot better than 7ms with performance implications.
That's already a thing, Google GPS based NTP servers with a TCXO or Rubidium reference. You can buy one for ~2000$ that will use GPS, or Cellular as a time reference, and can add options like a Temperature controlled crystal oscillator (TCXO), or a Rubidium reference (basically, small atomic reference).
Interesting. So, are you saying it's a start on my idea or what I'm proposing? As in, can it currently sync the servers in multiple datacenters to the point they could operate with microsecond spreads?
Chip-scale atomic clocks are now about $1500.[1] This just gets you a 10.0000000000 MHz oscillator; something else has to count time and provide output.
Passing around the max sequence number as a monotonic sequence indicator has a risk. A bogus sequence number near the max value can cause serious problems.
When executing a transaction with a given timestamp on some node, doesn't the node have to guarantee that it will no longer accept commits with a smaller timestamp? Without that commitment, you could read a piece of data that is later updated by a transaction that occurred logically before you, breaking serializablility and/or snapshot isolation.
The spanner paper is unclear how they deal with this, but my guess is that since they have accurately synchronized clocks, you'll never have to block long for that commitment to hold. Spanner also uses pessimistic locking, so for a R/W transaction, you can rely on locking reads to prevent the anomaly.
With cockroachdb, wouldn't this commitment imply that poorly synchronized clocks would lead to poor performance?
Cockroach enforces this guarantee on a per-key basis as opposed to for the entire node. If a key has been read at time t, it may only be subsequently written at time > t. CockroachDB accomplishes this using a timestamp cache at the leader node for a range. But this doesn't cause writes to block. Instead, the write timestamp is pushed to the most recent read + 1 logical tick.
Do reads execute through raft? If not, how do you guarantee that behavior - relying on leader leases? If so, does the lease period guarantee disjointness in the timestamps a leader can process, or do you rely on the leader abdicating control via timeout?
Since it can push the write timestamp, does that mean the write transaction aborts if it was in serializable mode? Does that cause problems updating a value that is heavily read?
This paper is one of the most interesting published in the last couple of years IMO. I remain surprised at how many people have overlooked it. I'm not sure why this blog post makes no mention of the work, in the past the Cockroach folks have been quite explicit about crediting the research documents they're drawing from.
Second, cloudera has a patent on this work: http://www.freepatentsonline.com/y2015/0156262.html