Policy Engines: Open Policy Agent vs. AWS Cedar vs. Google Zanzibar

tptacek · on Aug 17, 2023

I thought this was a pretty weak writeup. I'm somewhat familiar with Zanzibar and less familiar with OPA or Cedar, and the coverage of Zanzibar was odd and superficial. Zanzibar is a large-scale distributed system whose motivation is handling intricately related sets of ACLs while avoiding vulnerabilities that come from disrespecting causal ordering.

One big "con" to using it is that it's an internal Google system! If you're going to compare it to open-source policy engines, the sane thing to do would be to pick one of the open-source Zanzibar-inspired systems, and compare that.

evancordell · on Aug 17, 2023

I've seen the sentiment in this article pop up in a few places, which I'd summarize as: Policy languages like OPA and Cedar are fast to evaluate and simple to write, so you should use it for all of your authorization needs.

But policy engines are only really fast and simple if they already have all of the data they need at evaluation time.

If you look at the examples in the Cedar playground[0], they require you to provide a list of "entities" to Cedar at eval-time. These entities are some (potentially large) chunk of your application's data. And while the policy evaluation over that data may be fast, the round trip to your database is probably not. And then you start to think about caching, data consistency, and so on, and suddenly you're thinking about a lot of the problems that Zanzibar was designed to address (but you're on your own to build it out).

IMO policy engines are best suited for ambient request data: things you already know about a request because of a session, a route, or a network path, and policies that make sense to manage on the same lifecycle as your application.

Disclaimer: I work on SpiceDB[1], a Zanzibar implementation, but I do also like policy engines.

[0]: https://www.cedarpolicy.com/en/playground

[1]: https://github.com/authzed/spicedb

gemanor · on Aug 17, 2023

> And then you start to think about caching, data consistency, and so on

If you are looking at OPA or Cedar as a standalone engine, this is the correct assumption. To avoid this hassle, there is an open-source tool called OPAL[1] that will let you run the policy engines with all the sync work without any investment in custom solutions. OPAL has a ready mechanism for data fetching and synchronization, so you can plug it into your application's data and not worry about the data.

Disclaimer: I'm one of the OPA maintainers.

[1] https://github.com/permitio/opal

evancordell · on Aug 17, 2023

The article was comparing OPA/Cedar to Zanzibar, which is why my head went there. I did go looking for info on how OPAL deals with caching and consistency and found these:

- Authz data is kept in memory, so what you can authorize over is limited by the memory of the box you run OPAL/OPA. The docs also mention sharding, but I'm not clear on how you actually do that with OPA. [0] Maybe there's another doc that I missed.

- You can get a token representing the last time data was synced to the cache in an OPAL health check, but I'm not clear on how you'd use it to ensure consistency in your application since hydrating the cache is asynchronous. [1]

Anyway, those are the types of things Zanzibar is concerned with, so that comparison (instead of Cedar) would've made more sense to me. Without spending more time on it, I'm not sure if I've represented OPAL correctly above, that's just what I found when I went looking.

[0]: https://docs.opal.ac/faq/#handling-a-lot-of-data-in-opa

[1]: https://docs.opal.ac/faq/#how-does-opal-guarantee-that-the-p...

gemanor · on Aug 17, 2023

> I'm not clear on how you actually do that with OPA The sharding is managed from the OPAL control plane, when you configure the data sources you also configure the way the sharding works.

> ensure consistency in your application since hydrating the cache is asynchronous. OPAL use eventual consistency for cache reliability, you can know that data has changed, even before you know what changed.

jwineinger · on Aug 18, 2023

> If you look at the examples in the Cedar playground[0], they require you to provide a list of "entities" to Cedar at eval-time. These entities are some (potentially large) chunk of your application's data.

This is a primary reason we stopped looking at AWS Cedar. If you don't know all of the policies that might apply to your request (b/c policy authors might be different than dev teams), how do you know what entities need be sent in the request context? And in a authz system with many different entity types (and stores), gathering them all, even if you know which ones to get, would be non-trivial. Repeat for every system using Cedar, or build some SPOFish thing in the middle.

That, and pricing seemed pretty terrible for us.

orweis · on Aug 17, 2023

In Zanziabr - The article refers to OSS implementations like SpiceDB or Ory. It's a follow-up to a more in depth article (1), trying to be a lighter read starting point.

- 1: https://www.permit.io/blog/zanzibar-vs-opa

tptacek · on Aug 17, 2023

It refers to Zanzibar as a "graphical" system, which I think was the first thing that snagged me on this. Your post does too; I assume this is a language snag? "Graphical" doesn't connote "graph-based" in American idiom, but rather "visual".

I don't think your writeup really captures OPA vs. Zanzibar especially well either, for the reasons given by the SpiceDB person upthread. It just sort of defines away the problem Zanzibar is trying to solve, while claiming that Zanzibar-type systems aren't deployable at the edge --- which is pretty clearly not true?

orweis · on Aug 17, 2023

Re: "Graphical" - I can see how that would have that effect :)

To be fair it doesn't really say that, it reads:"Graph-based authorization systems utilize a graphical representation to illustrate relationships between users and resources"

Still, I think Daniel (post author) could have picked better phrasing - I'll ask him to change it.

> "while claiming that Zanzibar-type systems aren't deployable at the edge" For most companies it's extremely impractical; and for a developer (Audience of this article) that simply wants to add performant permissions to their without embarking on a whole devops adventure it's as good as so.

Dowwie · on Aug 17, 2023

Another big "con" is its complexity and difficulty to understand/implement. It's the kind of thing that once you have a handle on, you go into business trying to sell it as a service because you look behind you and see a moat.

academia_hack · on Aug 18, 2023

It's a pity the state of the art here is still so dire for basic use cases. Things like managing group access to files in directories, or giving people field-level permissions to specific columns on specific database records.

All of these languages (OPA, Cedar) are nifty but assume you have a whole team of devops engineers dedicated to replicating slices of your database state and getting them to the policy engine at the right time without discrepancies. Likewise, systems like Permit.io, OPAL, Topaz and SpiceDB claim to solve this problem, but they do it by adding an extra API call to everything your application does and then potentially having a permissions database which is in a divergent state from your real application.

I was really hopeful about Oso for this since it is a policy engine that operates over the applications data as a library (rather than a service) but the development on the actual library appears basically abandoned in favor of their SaaS and it's currently full of weird quirks, bugs and outdated dependencies - especially around the filtering tasks (e.g. show me all the blog posts a user is allowed to read). I'd happily pay for a Oso library that worked, but it seems like they've hyper-focused on SaaS.

Really would love to see one of these languages solve the data access layer for regular applications, everyone immediately seems to run to kubernetes at Google scale when I'd kill for a nice way to write permissions that can sit on the same server as my application and do the job reasonably well.

jreynoldsdev · on Aug 17, 2023

What I still struggle to understand with these systems is they seem great for single resource authorization, but how do you perform bulk queries? For example, a user wants to query all blogs they have access to (assuming there are large amounts of them), does that require separate authorization logic in the DB?

jen20 · on Aug 17, 2023

Zanzibar in particular is designed to be able to answer the question "what can this user access?", or "who can access resource X?" as well as "can user Y access resource Y?".

This article from OSO [1] explains how, with references to tweets from Lea Kissner (one of the authors of the paper and implementors) which are unfortunately less useful now that Twitter threads have been vandalised.

[1]: https://www.osohq.com/post/zanzibar

jzelinskie · on Aug 17, 2023

Full disclosure: I'm a maintainer of SpiceDB, the most mature open source project inspired by Zanzibar

For this exact use case, SpiceDB created two APIs not available in Zanzibar: LookupSubjects and LookupResources. For other scenarios, there's also a BulkCheck API to performing many checks with less request overhead. The sibling comment here is correct that there isn't filtering/sorting available in SpiceDB yet.

Additionally, there are folks using SpiceDB today by replicating denormalized checks back into their database (e.g. Postgres) or search index (e.g. Elastic) so that you can filter them natively. This is the combination of the aforementioned Lookup APIs with our Watch API. While this strategy requires moving parts, it is necessary beyond a particular scale which is well beyond the point at which policy engines typically fall over.

While I'm biased, I do find this article somewhat misleading when describing Zanzibar-inspired systems; it presents opinion without any evidence or examples to justify the claim and concludes it as fact, but that might be because they're leaning on their previous article. Zanzibar is novel because it is fundamentally designed to be ran at the edge and solves the difficult problem of keeping the view of data at the edge consistent. This article conveniently leaves out how other systems get data to the edge while still keeping it consistent for their authorization logic. Latency is also brought up, but we recently managed to scale SpiceDB to >1M requests per second with 100B relationships while maintaining a 5ms p95 measured at the client application[0]. The claim that you absolutely need a service to run a Zanzibar system is a provably false claim based on the number of clusters in the wild running SpiceDB or Ory's Keto project.

[0]: https://authzed.com/blog/google-scale-authorization

turtles3 · on Aug 18, 2023

> Additionally, there are folks using SpiceDB today by replicating denormalized checks back into their database (e.g. Postgres) or search index (e.g. Elastic) so that you can filter them natively. This is the combination of the aforementioned Lookup APIs with our Watch API. While this strategy requires moving parts, it is necessary beyond a particular scale which is well beyond the point at which policy engines typically fall over.

Would you say that because of this, Zanzibar engines like spicedb only become useful on systems of a certain size / complexity? Fundamentally you run into data synchronization issues whether you are syncing denormalized data back to your db via Watch or whether you write the relationships to both data stores in the first place. This article[0] on the latter topic touches on this, but brushes over some tricker parts of implementing such a thing correctly (eg. 2 writes section only covers insert not update or delete which is generally less harmful to have a ghost update that persists in spicedb, streaming updates brushes over some major footguns).

Granted there's nothing unique to spicedb in this sort of complexity, but by nature of being a db, using spicedb mandates that users must take on the complexity.

Is it then fair to say that it is appropriate to use spicedb once a project reaches a certain size / complexity, or would you expect a startup to adopt it from the beginning?

[0] https://authzed.com/blog/writing-relationships-to-spicedb

jzelinskie · on Aug 21, 2023

>Fundamentally you run into data synchronization issues whether you are syncing denormalized data back to your db via Watch

The Watch and Lookup APIs emit revisions so that any replicated data can include revisions to guarantee consistency. The linked article covers replicating data into SpiceDB and not the other way around; this is generally done for brown-field projects and does come with consistency trade-offs.

It's true that this complexity isn't unique to SpiceDB. The important part is that SpiceDB makes this _possible_ because if you architect a solution where it isn't, you'll find one day you've backed yourself into a corner.

>Is it then fair to say that it is appropriate to use spicedb once a project reaches a certain size / complexity, or would you expect a startup to adopt it from the beginning?

I briefly touch on this subject a bit in this post[0]. Unfortunately, there's no dead simple answer. We do have customers that are startups in various stages, but they all deeply considered the implications of focusing on authorization before they jumped in. IME, startups really need to find product market fit first. Build your MVP using whatever it takes and and only move on to thinking about authorization when it becomes critical. When is it critical, but not too late? I think that's once you start noticing that each PR implementing a feature request is also touching authorization code/SQL. There are also other big signals: microservices architecture or enterprise customers are almost certain indicators that your authorization logic isn't going to remain a small library in your monolith.

[0]: https://authzed.com/blog/authz-must-scale

orweis · on Aug 17, 2023

Jimmy I truly think you're awesome (And so is SpiceDB), but the irony here stands out: "it presents opinion without any evidence or examples to justify the claim and concludes it as fact"

You mean stuff like: 1) "SpiceDB, the most mature open source project inspired by Zanzibar" (though I'd vouch for that one) 2) " it is necessary beyond a particular scale which is well beyond the point at which policy engines typically fall over." 3) "Zanzibar is novel because it is fundamentally designed to be ran at the edge" 4) "we recently managed to scale SpiceDB to >1M requests per second with 100B relationships while maintaining a 5ms p95 measured at the client application" - you should bundle that statement with you need to set it up within your own VPC for it to be fair. 5) "The claim that you absolutely need a service to run a Zanzibar system is a provably false claim based on the number of clusters in the wild running SpiceDB or Ory's Keto project" - how many clusters? :)

Re: "This article conveniently leaves out how other systems get data to the edge while still keeping it consistent for their authorization logic" The article actually does mention OPAL [0]

[0]: https://www.permit.io/blog/introduction-to-opal

jzelinskie · on Aug 17, 2023

Your critique of my comment is quite fair; we're both guilty of making claims, but not including all the supporting evidence for brevity's sake. I think we can both agree that everyone working in this space is doing awesome work and bringing authorization the attention that it's sorely needed.

orweis · on Aug 17, 2023

Agree 100%. <3 And as I told Joey many times - I'd love to collaborate more with you as well.

bfeynman · on Aug 17, 2023

What do you mean separate authorization logic? There are many layers to auth and usually they act as interceptors in request that go very fast. If you have blanket permissions to list, you are able to list resources you have access to... that's trivial. However `Blog` resources might have explicit deny policies on them as well, so yes those are also evaluated. Not sure how else you'd expect it to work sans caching like current state of resources and access.

ahoka · on Aug 17, 2023

Yes, you need to consider authorization at every layer. You can blanket deny a lot of things in a midlayer, but sooner or later you need to start interpreting business logic to do the rest.

gemanor · on Aug 17, 2023

Data filtering implementation has different approaches among the mentioned policy engine. For OPA, custom Rego code could return the allowed data, and the caching mechanism will ensure its consistency and reliability. For Zanzibar, since the policy derived from the data relations, data filtering is using is an internal part of the paper. I recommend the following article for more information about policy as code and policy as data to understand the context better - https://www.permit.io/blog/zanzibar-vs-opa

turtles3 · on Aug 17, 2023

This, especially when you combine it with pagination, filtering and ordering requirements.

Zanzibar implementations (eg. Spicedb, keto etc) offer functionality for listing resources accessible to a given principal, but as far as I can see none have a coherent solution for filtering and ordering.

The only solution I can see to this is as you suggest maintaining a shadow copy of the relationships in your db so you can answer the question with a regular SQL query. This obviously comes with a lot of headaches, and is the sole factor preventing us from adopting one of these systems, so I really hope I'm wrong about this.

random3 · on Aug 17, 2023

From short scans of the papers, at least with Zanzibar, AFAIK you can define entities and relations (think groups of users and directories) and infer rights based on those. I'm assuming Zanzibar backs the actual Goolge 360 document sharing so presumably it would scale for that use-case.

RandomBK · on Aug 17, 2023

The google paper refers to the existence of some 'permissions-aware index' (paraphrasing) that's used to answer range queries like this, but doesn't cover how this index would work.

I know various Zanzibar implementations have exposed APIs to solve this problem, but I still don't have a great intuitive understanding of how they work beyond 'push the ACL logic into the data layer', which brings us back to a pre-zanzibar world.

mdaniel · on Aug 17, 2023

Since the team seems to be hanging out in the comments: it appears the article was written out of order and "ReBAC" was used without the parenthetical definition in the OPA part but only defined down in the "Advantages of Google Zanzibar" section. I did what I usually do on an unusual acronym and "right-click, search" but .. that went poorly. The actual definition of "Relationship-based Access Control (ReBAC)" did show up about 4 results in, so fine, but it would be cheaper for the reader if you were to pull that upward to "define on first use"

I would pay $1 if it was clarified about how that differs from RBAC or ABAC but maybe I just haven't gone through enough blogs to already know

orweis · on Aug 18, 2023

Hi! Fair point. We got two articles coming this month: RBAC vs ReBAC, and RBAC vs ReBAC vs ABAC - we'll post those here / in the article itself when ready.

For now, in short: RBAC (Role based) is a simple identity to role to permission mapping. ABAC (Attribute based) maps conditions on attributes to to permissions (technically can implement anything - mostly used for things like time based, quotas, location, etc.) ReBAC (relationship based) maps relations between identities and resources to permissions (e.g. if a user is related as an owner to a folder, and the folder contains a file, the user is the owner of the file) - commonly used for resource and organization hierarchies

orweis · on Aug 17, 2023

Founder of Permit.io here- cool that this article grabbed some love. For those of you not sure which is the best from the article- Permit combines all 3 together.

- OPA/REGO or Cedar at the edge, for quick efficient and zero latency policies - And Zanzibar at the cloud control plane to manage the overall picture and relationships