Hacker News new | past | comments | ask | show | jobs | submit login

I started doing this for a niche area: US and European regulations and guidance documents for Good Laboratory Practice, and later for Canadian Cannabis regulations. Basically I created a standard XML schema for regulations and parsed them into XML [1]. This allowed for e.g. presenting tables of contents and section folding, pulling and linking definitions into their own search engine, etc. [2]

I thought that I could easily write a parser for each jurisdiction's formats, and then get predicate rules and related regulations for free.

I was wrong. a) there are many jurisdictions and sub-groups all doing their own thing; and b) most don't have any standard document formatting or tagging, let alone a defined structure. Even in the most structured formats (like the US eCFR's XML) the focus is on display rather than content. In the worst cases it was just whoever wrote up the Word document chose how they numbered and formatted chapters and sections etc.

There were so many special cases that it was a huge amount of work to add or update each document, and I ended up doing a lot of categorization and fixing by hand.

[1] I know people hate XML on HN, but I did my research and had specific reasons for choosing it at the time, including human readable, nesting sections, being able to easily publish and validate a schema, etc.

[2] See ReadtheRegs.com. You can browse the definitions page without an account.




This looks great! I share your sentiment: I looked into the XML files for the published German legal texts[1], and they seem to be made for display purposes only.

[1] Table of contents for XML files: https://www.gesetze-im-internet.de/gii-toc.xml


Crazy isn't it?

I actually pitched to the American Society of Quality Assurance a few years ago that we should be going to the various governing jurisdictions with a schema and encourage them to publish regulations in a standard format.

The benefits of treating regulations as data are enormous - not only do you have a standard way of consuming and linking regulatory requirements like in an API, you also get discoverability, the ability to make tools (syntax highlighting in legalese!), compile requirements over multiple jurisdictions, and more!

I had difficulty selling the idea among the non-computer-savvy (but technical) regulatory professionals, but I'm sure a few of you on HN can imagine the benefits of having a tree-sitter for legal code...

I could have pushed it further, taking the lead to pitch to the various regulators I work with in my consulting business, but in the end it was just too much work for a side project without interest from my peers.


> having a tree-sitter for legal code

This! I think this would enable so many services that make the legal system more approachable for many people.


I completely agree: in a lot of domains, freeform human language provides far more expressive power than you actually need, or want, for communicating ideas. My IANAL understanding of legalese is that it's an attempt to constrain the use of language to be more precise, but from an outsider's point of view it looks needlessly complicated.

Could be a https://xkcd.com/793/ situation though.


In this case I wasn't attempting to constrain the language rather than to capture the structure already implicit in the system - hierarchy of chapters, sections, clauses and sub-clauses, attributes such as definitions and exceptions, cross references, repeals and previous versions, interpretation notes, etc.

While the programmer/engineer in me likes the idea of trying to codify and constrain standard legal terms and grammar to some consistent interpretation, I do think this is an XKCD style oversimplification of a very complex system.

Though IANAL I am a "regulatory QA professional" who has to interpret intent, wording and current enforcement of various food, drug and cannabis regulations every day. It's a complete mess of spaghetti code and undefined behaviour, and worse it's the implied, imprecise and badly worded parts that turn out to be the most important.

It's a moving target of guidance documents, published inspection findings that reveal "the current thinking of the inspectorate" and "industry best practices" with no single point of reference. Not to mention the pharmacopoeia and published standards. Though there are so many ways we could improve things, I doubt you could ever actually get that ideal constrained language without turning it into a billion special cases.

It can be very frustrating to work with, especially trying to convince management why they can't do something that isn't expressly forbidden in the regulations! But this does show exactly why there's so much leaning on intent rather than precise requirements - much like tax code, organisations would and do find money-saving loopholes all the time that might put people at risk, hence the moving target of interpretation and best practices.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: