Hey HN,
Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.
In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.
My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!
Fantastic work, and really appreciate the write up. It's quite timely for me - I'm from a tech background and have just started studying Australian law, and was thinking about doing exactly this - so you are years ahead of me :).
> For someone interested in using the data (and help out with bugs/issues), where would you suggest starting?
I think the best place to start is by downloading the Corpus (visit https://huggingface.co/datasets/umarbutler/open-australian-l... , and then click "Files and versions" and then "corpus.jsonl"). You can then use my Python library orjsonl to parse the dataset (you'd run, `corpus = orjsonl.load('corpus.jsonl')`). At that point, there's any number of applications you could use the dataset for. You could pretrain a model like BERT, ELECTRA, etc... and share it on HuggingFace. You could connect the dataset to GPT and do RAG over it. Etc...
Fantastic work here! I was griping to my team just last night how painful developing a chunking strategy for Australian Legislation is that while there's (generally) layout consistency within a piece of legislation, that's not true across pieces of legislation... so I can imagine the pain of trying to collate legislation across jurisdictions.
I've reach out via your LinkedIn profile - would be great if there was an opportunity to collaborate.
> Fantastic work here! I was griping to my team just last night how painful developing a chunking strategy for Australian Legislation is that while there's (generally) layout consistency within a piece of legislation, that's not true across pieces of legislation... so I can imagine the pain of trying to collate legislation across jurisdictions.
Absolutely, there's a lack of consistency even within the same jurisdiction and document type. It only gets worse once you want to add multiple jurisdictions and different types of documents. My best strategy so far has been to use recursive chunking where you begin chunking at the largest section of newlines. Ideally though you want some form of semantic chunking where you already know what parts of the document represent Parts, Divisions, Schedules, Sections, Sub-sections, etc...
> I've reach out via your LinkedIn profile - would be great if there was an opportunity to collaborate.
Yes, this. I've been trying to find a general way to automatically semantically chunk various legislation for a while now. Partly so as to diff various versions/amendments, but also to graph connections to other referenced legislation.
Most of the time I end up having to just take half an hour to manually regex and format plain text.
A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs
One last question/comment, have you considered adding some additional reference info like the federal list of entities?[1]
> A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs
It's possible that they're in my database. I have included the as made version of all bills on the Federal Register of Legislation. However, if they haven't had a first reading yet, then probably not.
For processing PDFs, I recommend using `pdfplumber`, which is what I used to build the Corpus. Happy to discuss further if you'd like.
> One last question/comment, have you considered adding some additional reference info like the federal list of entities?
Do mean adding additional metadata? At the moment, I've kept the number of metadata attributes as low as possible. Every attribute added equates to more work to keep it standardised across all the jurisdictions and document types. My plan is to slowly add more attributes as I have time. I'd really like to associate a date with documents but even that is a hurdle. I have to decide what date should be the date of a document (is it the time it was issued, the time it was published, the time it came into force, the time the latest version was issued, etc... and what happens when a document doesn't have a date? should I extract it from its citation? how do I preserve time zone information? etc...).
I've used a number of pdf libraries in python and C# over the years, none have worked reliably as needed (that's just pdf I guess), but haven't used pdfplumber, I'll be sure to give it a go, thanks for the suggestion.
Yes, additional metadata. Totally understand it adds in a lot of complexity but could help for fine-tuning an LLM.
With regards to dates, not a lawyer, but for Federal I would go with "Start Date", it's always the day following the End Date of the previous comp. The Date of Assent (well the year at least) is in the title, but also the first start date. The registration date can be either before or after the start date depending. [1][2]
The tricky part is when sections have different commencement dates that are detailed in the text. I don't know anywhere that is easily accessible. And, if you think about it, usually the most important information for say businesses being regulated.
I wouldn't worry with timezone per say, it's relative to each particular state.[3] i.e. why polling closes in a federal election at 6pm in each state rather than coordinated with ACT.
Are you open to pushing one of your Work in Progress(WIP) models to https://ollama.ai/library to show off and let others try it out and provide feedback?
Unfortunately, I haven't been able to train a model on my database just yet (although I have seen great results using it as a RAG data source for GPT-4). I'll keep ollama in mind once I have something to share.
Australia has had free, searchable collections of Australian Law for 25+ years. Austlii is a prime example. There are Federal and State collections as well.
The author is conscientious enough to read the scraping policy (or was blocked by anti-scraping tools) from feeding from one of these sites into his LLM.
As I point out in the introduction, there are a few free-to-access legal databases available in Australia, but none are truly open in the sense of being free from copyright restrictions. Neither AustLII nor Jade are licensed under an open source copyright licence such as CC BY 4.0 (which is what the majority of my Corpus is under).
Love what you're doing! Being able to more easily bring LLMs and other AI in will democratise the law quite a bit. Agreed that even though Austlii exists, it needs to be under a creative commons license, and it takes doing the legwork and dealing with bureaucrats to get it there
This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?
Yeah I was also disappointed to hear that. Afaik (I asked someone who previously worked in an adjacent field), it sounds like there’s no central system for publishing the judgements, it’s all published by each individual courthouse in a way that suits them, so lots of tedious individual scraping would be involved I’d imagine.
Good work reaching out to, and trying to get along with Australian government departments. As a fellow Australian, and one employed in government, I can very much say that many people in charge of operating these systems should not be.
I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar setting (Australia)
Nice, but I have to ask how does it compare with https://austlii.edu.au/, especially for completeness?
The Australasian Legal Information Institute is a great resource and yet seems strangely unknown (to the wider public at least.)
Trivia, the only reason I found out about it was when I did some work for an Aus govt agency and found out that they shared their web site with austlii! This was back in the early 2000s.
So AustLII and Jade are what I was referring to when I say "While there were certainly a few free-to-access legal databases, none were truly open ...". They are free but not open in the sense of being licensed under an open source licence and being free to download in a raw format. Whereas everything in my corpus except decisions from the Federal Court of Australia is licensed under CC BY 4.0. And even decisions from the FCA are under a licence that permits both non-commerical and commerical usage.
In terms of completeness, however, AustLII and Jade win out. They seem to have almost everything if not everything. Their data is also much richer than mine. I must give props to AustLII for how they're able to hyperlink terms defined within legislation to their definition. I think they're an invaluable resource for members of the public.
The audience of my database was more so those who want to play around with raw legal data and want to feel secure that they are not breaching any laws in the process. The fact that it is stored in plain text is also beneficial for anyone trying to build ML models that only accept raw text.
In my experience, one of the devil details is continuously keeping such a database updated. Without a set of common standards among the various governments, they can capriciously change URLs, formatting, and other details that may make it somebody's fulltime job to keep it accurate and always up to date. Of course, not all use cases will require that, but many will.
Absolutely. Luckily, the websites I scrape don’t change very often (ie, every 5-10 years based on Wayback Machine), and also a couple of them all use the same underlying legislation management system and APIs.
Ultimately, however, I’m hoping that in the long term, the Australian Government will see the use in this project and decide to maintain it.
It actually looks quite clean. Certainly a lot better than some of our legal databases. I guess the only suggestion I'd have is that it seems like you can have an account with the website but there's login or register button on the front page (or maybe I'm just not seeing it?).
One other point, and this is not specific to CanLII (I haven't checked whether this is the case) but I've seen that a lot of legislation databases have poor SEO. In fact, AustLII is usually ranked higher than governments' own websites when searching for laws. I think it's something important to get right because a lot of people just use Google/Bing/Kagi to search websites nowadays rather than using internal search engines.
Github might be difficult as they impose constraints on the size of repositories and the Corpus is around 5GB. The Internet Archive is a good idea, however, I’ll have a look into that. I’ve also been thinking about sticking it on Kaggle as well to increase its reach.
You could also consider one or more of the scientific data repositories like Zenodo, FigShare, DataDryad, etc. 5GB is small potatoes for those folks and they have serious data retention policies. As a bonus, they'll also allocate you a citable DOI.
I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar environment (Australia)
It would be lovely I think if you used ML to help people ask questions so they can have more accessible law at their hands and understand what lots of things mean.
This can be applied to multiple countries around the world. The world of laws at your hands.
Could you explain how the majority of your corpus is under CC BY 4.0? I realise that's the licence you have picked on HuggingFace, but if the source data was not already CC BY 4.0, how are you able to re-licence it as CC BY 4.0?
The majority of the source data was already licensed under CC BY 4.0. Additionally, the Corpus, as a work constituting a curated and post-processed collation of other works, is also licensed under CC BY 4.0.
There is! It’s called the [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law), and it’s actually what inspired this project. The only caveat is that it doesn’t appear to be regularly updated, and so is more of a snapshot of US law rather than a semi-live copy. Also I’m not entirely sure how comprehensive it is (ie, whether there’s anything missing). For those interested in building a true US equivalent of my corpus, I think it could be a great starting point, particularly since they published the code they used to build it.
The Feds have good sources for the all various admin code/statutes/slip laws/etc, but there’s not a great unified source for case law.
There’s nothing at the state level right now. I’ve been considering setting up a statue scraper under the openstates umbrella but it’s a bit of daunting project to start. Lots of yeoman’s work parsing gnarly websites or evading Lexis scraper protections.
no, there's a lot that we need before we can even begin to improve this
even with a database of current laws as they exist right now, the laws to change them primarily come in 2 forms:
1. verbatim additional laws
2. instructions that are essentially diffs to the current law. what words to change, strike out, sections to re-arrange and modify, as well as new lines of code. these have to be spliced in to the prior state of the law
and after we have all that, laws are often following different logic. like logic gates. One set of laws may be using "and" as a set of conditions that must be satisfied all as one, but it also could be using "and" and an "exclusive or", a set of conditions where only one has to be satisfied. but when writing it, those things all flowed grammatically and harmonization of laws wasn't prioritized.
there's a whole lot that can be improved that we don't have the infrastructure to do just yet. someone could do it, but that's the first step.
In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.
My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!
You can find my database on HuggingFace (https://huggingface.co/datasets/umarbutler/open-australian-l...) and the code used to create it on GitHub (https://github.com/umarbutler/open-australian-legal-corpus-c...).