Logs Are Streams, Not Files (2011)

colmmacc · on July 23, 2014

From the article:

> a better conceptual model is to treat logs as time-ordered streams

At scale it's probably better still to re-think logs as weakly-ordered lossy streams. One form of weak-ordering is the inevitable jitter that comes with having multiple processes, threads or machines; without some kind of global lock (which would be impactful to performance) it stops being possible to have a true before/after relationship between individual log entries.

Another form of weak ordering is that it's very common for log entries to be recorded only at the end of an operation, irrespective of its duration; so a single instantaneous entry really represents a time-span of activity with all sorts of fuzzy before/after/concurrent-to relationships to other entries.

But maybe the most overlooked kind of weak ordering is one that is rarely found in logging systems, but is highly desirable: log streams should ideally be processed in LIFO order. If you're building some kind of analytical or visualisation system or near real-time processor for log data, you care most about "now". Inevitably there are processing queues and batches and so on to deal with; but practically every logging system just orders the entries by "time" and handles those queues as FIFO. If a backlog arises; you must wait for the old data to process before seeing the new. Change these queues and batching systems to LIFOs and you get really powerful behavior; recent data always takes priority but you can still backfill historical gaps. Unix files are particularly poorly suited to this pattern though - even though a stack is a simple data-structure, it's not something that you can easily emulate with a file-system and command line tools.

0xbadcafebee · on July 23, 2014

You can have strongly-ordered logs too. The fact that nobody actually builds robust logging into their application doesn't mean it has to be that way :)

As for the order, it's really a matter of how you are getting the data [right now] and what tools you have to work with. If you're just receiving an unending network stream, you have no choice but FIFO. If you have log files you can start from the bottom and work up, but it's usually not as efficient as top-down. If you have clustered infrastructure you can process the same log(s) faster using parallel jobs. If you build log processing tools into each system from the start you can distribute the simple parts of data processing to each node. You have a lot of freedom in finding new ways to make logging more efficient.

LIFO is nice for old data you want to see now, but isn't as useful for new data being processed constantly.

justincormack · on July 23, 2014

You could write some simple code that seeks if there is sufficient data between the file position and the end and puts the gap on a queue for later processing. You need to be able to find end of record markers but that's it.

jimmaswell · on July 23, 2014

>it stops being possible to have a true before/after relationship between individual log entries.

Timestamps solve this easily. Synchronizing the time between machines isn't hard

jasode · on July 23, 2014

> Synchronizing the time between machines isn't hard

Well, maybe "hard" can be relative to scope and scale. It took google engineers some trial and error before settling on the concept of "smearing a leap second" across the entire day. Relying on plain vanilla NTP sync and GPS satellites was not enough.

http://googleblog.blogspot.com/2011/09/time-technology-and-l...

ColinCera · on July 23, 2014

> Timestamps solve this easily. Synchronizing the time between machines isn't hard

I can't tell if this is snark or not. Either way, it made me giggle.

falcolas · on July 23, 2014

> Programs that send their logs directly to a logfile lose all the power and flexibility of unix streams.

That's because if they push their data to stdout, and it's not caught by a pipe, the program will halt when the OS stdout buffer is filled.

> How many programs end up re-writing log rotation, for example?

This one is because files over a certain size cause certain file management systems, or kernels, to break logging. If they didn't rotate the files, the system would become unresponsive at worst, or the program would go down at best. Plus, if you take care of rotation and compression yourself (either directly or through a logrotated conf), you don't have to worry about filling a disk & causing an outage.

In short, logging is hard, because systems are managed by people. And people rarely get the logging setups right the first time.

philsnow · on July 23, 2014

sink/drain, not source/sink ? Does anybody use "sink" to mean the place where stuff comes out of (from a particular system's perspective) rather than the place where stuff goes ?

stinos · on July 23, 2014

last 'streaming' applications we wrote, some for audio processing, others for image processing, we used the source/processor/sink notion (didn't even think of using drain) - source = origin of data, sink = destination fordata, processor = data goes in and comes out again, so basically sink at input end and source at output end. But I don't think I ever saw sink to be used as a source of data.

gobengo · on July 23, 2014

farva · on July 23, 2014

It's not like there's really a difference between the two, under *nix.

jasode · on July 23, 2014

Yes, but you're using a technical definition of FILE and STREAM (pipes).[1]

However, the author was talking about a conceptual difference instead of the technical one. He's recommending that people think of logs as streams instead of files (again, "files" as an abstraction concept and not as FILE* as Unix o/s descriptor). The shift in thinking would trigger a different approach to writing log data.

In any case, I think the Jay Kreps article on logs has much more information -- especially with distributed systems.[2]

[1](E.g. The C Language fopen() returns a FILE* pointer and can open an actual file on disk or a transient named pipe (mkfifo)).

[2]http://engineering.linkedin.com/distributed-systems/log-what...

antocv · on July 23, 2014

People should educate themselves and stop thinking of files as, whateveer it is that makes them think that logfiles are bad.

Do they seriously think a file is like a magical stone being dropped where ever it may be? A self-contained unit, like a bit? Well, they're wrong. Files are streams of bits.

Streams of bits can be in transition or stationary. Example, moving bits from here to there, over network or on the harddrive - bits in transition, a collection called a file. A bunch of bits on a hard-drive or in RAM, is still a stream, but not in the process of being copied/moved elsewhere.

kiyoto · on July 23, 2014

I see what you are getting at, but my experience as one of the "mailing list" people for Fluentd (an open source log collector) has been a bit different, especially when a lot of data is written to a file rapidly: For example, when you try to tail a log file with log rotation reliably, you run into all sorts of edge cases. Such issues shouldn't exist if files and streams were truly interchangeable in the context of logging.

tszming · on July 23, 2014

What issue you are experiencing with fluentd in_tail plugin? (As it already support log rotation)

kiyoto · on July 23, 2014

Sorry that my point wasn't clear: it's not so much that I experience issues with in_tail. in_tail is actually solid at detecting log rotation, etc. My point was this is not a trivial problem, and a lot of the code inside in_tail.rb is for detecting and handling log rotation gracefully.

antocv · on July 23, 2014

> when you try to tail a log file with log rotation reliably,

Then you do tail --follow -f yourfile.

Files and streams are the same thing.

bluefinity · on July 23, 2014

I think you mean tail -F yourfile.

--follow and -f are the same thing.

antocv · on July 23, 2014

No, I meant --follow=name because the default is descriptor.

-f is short for --follow=descriptor

antocv · on July 23, 2014

Files are streams.