Learn Datalog Today

yawgmoth · on May 16, 2021

I learned LogiQL [1] in 2012, which is a Datalog-inspired language. We were writing an OLAP system and wrote some pieces by hand but a large chunk of our star schema was generated via Python + templates.

[1] https://www.routledge.com/LogiQL-A-Query-Language-for-Smart-...

Vaslo · on May 16, 2021

I used it in a Data Curation course. Extremely difficult to learn and found it pretty useless. Entire class struggled with it. You can often use recursive SQL to some of what it does.

Sorry to insult anyone or the creator but just being honest.

YeGoblynQueenne · on May 16, 2021

Logic programming is a difficult subject to teach and I've heard similar accounts to yours ("entire class struggled with it") for Prolog which I'm more familiar with. To be honest, I find datalog simple compared with Prolog -the only real complexity is in understanding fixpoint semantics which can be hairy at first- and so I'm surprised to hear your class struggled with it.

Of course I'm surprised only because I've been using and studying logic programming for a good ten years now and I have to struggle to remember how it was when I first started learning the subject. It was hard :)

I think though teaching Datalog as part of a databases - related (?) class makes it even harder. Datalog was basically created from Prolog's rib and it doesn't make a lot of sense without at least some background in logic programming, which I suspect is not given in a databases course. I also suspect your tutor would have gone from relational calculus to Datalog, that's a surefire way to maximise confusion.

Anyway logic programming is so far from most programmers' experience that it's hard to know where to start with its teaching. I've struggled with the question often because I help mark papers for Prolog courses at my uni and I can see the carnage and my heart goes out to the poor students, and the tutor.

So, yeah, you're not insulting anyone. Everyone understands it's a hard subject to teach.

Fellshard · on May 16, 2021

Did the class learn SQL before they learned Datalog? Did they know SQL before entering the class? Did they learn the principles behind SQL, or just understand how to use the language to get things done?

Finding Datalog useless and frustrating comes from those who have never expanded the depth of their knowledge beyond the basic popular languages and tools of the day. Slight exposure to logic programming and base-level relational model theory makes this tractable.

chmod600 · on May 16, 2021

"Finding Datalog useless and frustrating comes from those"

There are many reasons that people might get frustrated. If you assume without further evidence that it's purely due to user ignorance then you aren't likely to learn about any real shortcomings.

Fellshard · on May 17, 2021

I am willing to make that assumption when we're talking about people in an academic context.

raspasov · on May 17, 2021

This is an amazing website. That's how I learned the basics of Datalog.

escherize · on May 17, 2021

Same here. Good stuff

iimblack · on May 16, 2021

This is one of those things that has always sounded really interesting but I haven’t had the opportunity to use. Does anybody have experience with it? I’m also trying to find some benchmarks but don’t see much.

maweki · on May 16, 2021

As a real database language with ad-hoc-queries you're fine with SQL. Data-intensive graph analysis (the reason soufflé, souffle-lang.org, was developed) where you have a fixed set of deduction rules and change the database instance is an ideal playground for Datalog.

And soufflé is incredibly fast and for the usual benchmarks, like the transitive closure in comparison to recursive CTEs in SQL, blows everything out of the water. (my paper on that http://ceur-ws.org/Vol-2368/paper3.pdf, Figure 3)

I'm researching using Datalog as a general purpose programming language (specifically for use in microcontrollers, like Arduino) for my PhD and it can be very expressive.

discreteevent · on May 16, 2021

Have you come across Eve? It was a general purpose language described as a variant of datalog. They ran out of money in 2018. Everything was a record, even the browser Dom. It's a shame we can't work in a world as consistent and simple as they tried to make it. http://witheve.com/

maweki · on May 16, 2021

> Have you come across Eve?

Apparently that link was already marked as "visited" in my browser history and it seems familiar.

There seems to be a strong connection between reactive programming (especially the functional-reactive style) and Datalog. I believe that is the case because Datalog (simplification incoming) is a logic programming language that, contrary to ASP, for example, always returns a single model and can therefore always be seen as a function from input-relations (FRP cells) to output-relations where the deductive steps and intensional predicates are the streams.

The issue is - and that's not really surprising - that implementing useful call-conventions for your application on top of a datalog-engine is not easier than using a fixed call-convention (like a (f)RP-framework) and implementing your deduction with simpler semantics.

In the end, you have to bind your input and output-relations to your view and write down the rules and they are somewhat simple. With reactive programming you have a somewhat easier time binding your data to the views and the rules (functional mappings) are somewhat more difficult.

That is why I abandoned the reactive style of using Datalog as a general purpose programming language early on. If you really want to do this, you can already use differential dataflow and some plumbing code and you would not have lost a thing.

My approach is closer to Logic Production Systems.

kixiron · on May 16, 2021

Full disclosure, I do work for VMWare on the DDlog project

If you want a sort of FRP-backed datalog, I highly recommend Differential Datalog https://github.com/vmware/differential-datalog It's a datalog flavor/variant that's based on the Timely Dataflow and Differential Dataflow libraries, it's focused on incremental computation over streaming data. Coincidentally it's also used on network switches if that's relevant to your need for deploying on micro-controllers

maweki · on May 16, 2021

As I said, one can easily use differential dataflow with some plumbing code. And this is fine and no worse than the eve language shown above. The DDlog-project and its applications are a nice example for that.

Where I think this approach fails a bit is the following: There is a language-gap issue where your output-relations need to be mapped in a host language to outputs. And you have to call some functions to collect all data for the input relations. Especially in the context of robotics the latter often is not free of side-effects. Not being able to explicitly control those calls can be problematic.

Having a powerful host-language and explicitly embedding a Datalog-implementation is useful, as the decision procedures for your interactive application are not just a series of if-statements but proper condition-action-rules.

My approach is the other way around. I think embedding strict call-conventions for external functions and explicitly modelling side-effects into the language allows for an expressive LP-language that handles the interactive parts as well as the logic-deductive parts.

krasi0 · on May 17, 2021

Then why not embed a Prolog engine which being Turing complete is much more powerful and expressive than Datalog?

maweki · on May 17, 2021

For one, memory management is much more difficult with term structures.

But mainly Prolog semantics with side-effects are a whole different beast. This is a different thing I am researching, but basically the whole idea stems from robotics control with Prolog and how handling of side-effects just does not work nicely with SLD-resolution, specifically backtracking over side-effects.

I have a paper on that as well ;)

YeGoblynQueenne · on May 16, 2021

Is that an embedded Datalog then, that runs on microcontrollers?

I'm researching approaches that learn Datalog programs from examples for my PhD. In practice, the algorithms I study can learn Prolog programs but Datalog is a nice, decidable subset of Prolog that also has the advantage that clause ordering doesn't matter so a learner only needs to learn the clauses in a program, without also learning to order them.

There's a bunch of work from the program synthesis community on learning Datalog programs from examples. Here's a recent-ish paper I found interesting (and whose authors I'm not affiliated with):

https://www.semanticscholar.org/paper/Syntax-guided-synthesi...

maweki · on May 16, 2021

The approach is an iterative deduction with strict call-conventions. In some cases I compile the rules as-is into a C-Program, but if the used memory is provably bounded, I compile to a finite state machine + some registers. Had to ditch the extensional database for that. Not strictly necessary, as it turns out.

You can read it here (https://doi.org/10.1007/978-3-030-68446-4_11, behind paywall) or hit me up on researchgate for that paper. Or as a short overview: https://dbs.informatik.uni-halle.de/microlog/

The main problem, which I claim to solve, is the handling of side-effects. For the microcontroller (or a robot) it is vitally important to not have side-effects or at least have them explicit for the developer and the language semantics. I believe this is where most approaches fail, that they model effectful input-gathering just as input-relations, but you need to control your input as well as your output in an interactive application.

YeGoblynQueenne · on May 16, 2021

Oh, so you publish in logic program synthesis then? Your field and mine (Inductive Logic Programming) are practically neighbours! :)

The paper looks interesting, I'll have a look.

(Although I guess your work is more on the "transformation" part of "logic-based program synthesis and transformation").

>> I believe this is where most approaches fail, that they model effectful input-gathering just as input-relations, but you need to control your input as well as your output in an interactive application.

To be honest, I struggle a bit with the terminology here because I'm woefully behind with my knowledge of modern Datalog (I only need to use the bog-standard language from the '80s to achieve my goals). For example, I'm not sure what counts as input and output in Datalog parlance. I think "input" is your intensional and extensional definitions, then output is the ground facts derived from them?

I'm interested in this also:

>> Had to ditch the extensional database for that. Not strictly necessary, as it turns out.

Is that explained in the paper?

maweki · on May 16, 2021

Usually (using differential dataflow, for example) you have input relations (EDB) and you have some output relations (subset of IDB). You would loop over the output relation using your host-language to call some effectful functions using the result (printing, in the simplest case). Only deducing facts without calling an external function with those facts is not useful for an interactive application.

>>> Had to ditch the extensional database for that. Not strictly necessary, as it turns out. > Is that explained in the paper?

Yeah, but only in short.

Usually you would provide a database as input. But if you have a strict call-convention (call to external functions), you can use their result to load your data dynamically from "within" you logic program, instead of the facts just "being there" or "appear".

This makes static analysis much easier and given a microcontroller is basically without storage, where would you load a database from anyways, that you wouldn't know beforehand?

YeGoblynQueenne · on May 17, 2021

Well I'd like to know more about ditching the extensional database, although I kiind of see how it might work, if, for example, you're generating facts by calling external functions- maybe functions getting user input, or collecting data from sensors etc?

Thinking of it further, an intensionally defined program is, in a sense, just a compressed version of an extensionally defined one (in older logic programming terminology, the intenstionally defined program is a "name" for the predicates it defines), and an intensionally defined program can still "work" if you change the facts it relies on. For example, if you have a program that defines family relations you can always apply it to different individuals, of a different family etc.

So I can see how it might work- but I'd like to read more about it, because as you say storing an extensional datbase takes up space and that's always a consideration even outside of microcontrollers.

But I have more questions so I'll make some time to read your paper.

maweki · on May 17, 2021

You can always interpret the extensional predicate p/1 as a call to a function without arguments and set-valued return. That's basically just a LOAD-instruction. But then you're back at square one considering your static analysis.

So I also limit throughput in that functions are not allowed to be set-valued (which is fine in microcontrollers, as there is no standard way to encode this in memory-limited C anyways), instead you are allowed to call this function again and again over time.

As a function call is a side-effect, you get new values.

I am looking forward to further correspondence.

atombender · on May 17, 2021

Are languages like Datalog and Soufflé suitable for rewriting graphs, e.g. for optimization? Every use of Datalog I've seen refers to deducing facts about graphs, but I've never seen examples of morphing graphs. For example, converting a boolean expression into CNF form.

This is the sort of thing functional languages are great at, but if you're writing something in a different language (JavaScript, Go, Rust, etc.) then it'd be great to be able to plug in a declarative rule-based rewriting engine without rewriting everything in, say, OCaml.

maweki · on May 17, 2021

For rewriting there is the general problem of "consumption". You consume a piece of information during a rewriting step to obtain new information. The old information is then gone. Say you have some edge in a graph and you want that removed, then you have to write down a relation R={all-edges-without-that-edge}. And the number of your relations is fixed. You are allowed to use negation, but only in a restricted (non self-referential) manner.

Also, if your rewriting relationship is not confluent, Datalog - as it only ever has a single model - can not solve the problem, as there is no choice-construct.

Datalog is more akin to graph completion.

What you may want to look into is Answer Set Programming. You could describe your rewriting relationship as a logic program in this manner. ASP is strictly more powerful than Datalog and modern solvers compile to SAT and SAT solvers are fast. And even non-confluence is not a problem, as you get multiple answer sets.

The only issue with Logic Programming in general is value invention. The lack of value invention is usually the termination argument for any solving algorithm. If you need fresh values/names, stuff gets hard really quick, meaning you get infinitely sized models. Non-termination.

wuschel · on May 17, 2021

What usable implementations/libraries of datalog would you recommend?

maweki · on May 17, 2021

Depends on your use case

I would suggest soufflé (https://souffle-lang.github.io/) if you're batching data through for static analysis.

DDLog (https://github.com/vmware/differential-datalog) if you want a rule-engine within your interactive application, to encode decision procedures and business rules.

If you actually want this to drive the views of a (web-)application, I don't think we're really there yet. We have to wait at least for the next generation of developments. But you can give it a try with datascript https://github.com/tonsky/datascript

refset · on May 16, 2021

Disclosing benchmark results is forbidden by the licensing, which isn't uncommon for proprietary DBs, e.g. https://stackoverflow.com/questions/12115397/is-it-against-l...

Incidentally, I was writing up some notes earlier today about the nightly benchmarks we run for Crux (another Clojure-flavoured Datalog DB), which may be of interest: https://opencrux.com/articles/benchmarks.html

benrbray · on May 17, 2021

> Disclosing benchmark results is forbidden by the licensing

How on earth is this enforceable? Talk about anti-competitive!

detaro · on May 17, 2021

> How on earth is this enforceable?

It might not be, but few people care enough about publishing numbers to go through the trouble of finding out (and potentially betting wrong). Although comparisons to "leading proprietary industry database" winkwink nudgenudge might happen, at least in academic circles.

gillh · on May 17, 2021

At Netsil (microservice observability startup), API analytics and stream-processor was built using a dialect of Datalog. Netsil was acquired by Nutanix in 2018 and the technology was incorporated into a couple of product lines. The Datalog engine at Netsil is pretty sophisticated, capable of fine-grained parallelism (built on task-stealing architecture) - Datalog programs are chained together in a pipeline, such that the output of one program is an input to another. These programs (stages in the pipeline) can be run in serial, parallel or context_ordered modes. In context_ordered mode the tuples are sent to the execution engine ordered by their context_keys -- e.g. captured packets from the same flow are processed serially and in parallel across flows.

In addition, it incorporates SQLite embedded DB for materialized tables and state. The raw events are aggregated using Datalog rules & tables and SQL queries are auto-generated from high-level Datalog rules. We also extended Datalog with the concept of "Accumulators", stateful objects in rule head that emit a tuple when certain conditions are met - e.g. we build Windowed Counters, TCP re-assembly from network packets etc. using Accumulators.

In addition to all this, one could distribute Datalog programs across machines -- Tuples serialize to protobuf and sent across machines over ZeroMQ etc. We used distributed features by sending raw metrics from the Host based Agents on customer's machines to the cloud service where the Datalog based stream processor joins the incoming metrics/events tuples with metadata collected from external sources (e.g. K8S tags).

The Datalog engine also synchronized some tables with Zookeeper for distributing config updates in real-time such as metadata tables that were used in stream-joins. E.g. each time a new K8S pod was detected in customer environment, an update tuple was sent to cloud, which updated the Zookeeper backed materialized table thereby synchronizing with all other stream-processor pods in near-realtime.

raspasov · on May 17, 2021

Yes, I've been using DataScript (which uses this flavor of Datalog) to store and query the data locally in my iOS app in the last few months. DataScript works well but you need to know how to use it.

Benchmarks would entirely depend on the implementation of the database/language.

pm90 · on May 16, 2021

Heh, I read it as "learn datadog today". Which would also be an interesting article, as it took me quite a while to understand how its supposed to work.

tannhaeuser · on May 16, 2021

Unfortunately, the term "Datalog" has been used inflationary to mean anything to anyone as an attempt to market proprietary DBs at this point.

See [1] in case you're wondering what Datalog actually is.

[1]: https://en.m.wikipedia.org/wiki/Datalog

capableweb · on May 16, 2021

> Unfortunately, the term "Datalog" has been used inflationary to mean anything to anyone as an attempt to market proprietary DBs at this point.

Is it? You have any specific examples? The only one mentioned in the website is Datomic, and it is using Datalog. The very second link on the website is also the Datalog Wikipedia page.