Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: how I built the largest open database of Australian law (umarbutler.com)
172 points by ubutler on Oct 29, 2023 | hide | past | favorite | 61 comments


Hey HN, Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.

In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.

My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!

You can find my database on HuggingFace (https://huggingface.co/datasets/umarbutler/open-australian-l...) and the code used to create it on GitHub (https://github.com/umarbutler/open-australian-legal-corpus-c...).


Fantastic work, and really appreciate the write up. It's quite timely for me - I'm from a tech background and have just started studying Australian law, and was thinking about doing exactly this - so you are years ahead of me :).

Just one note - the link in your Github readme to https://umarbutler.com/open-australian-legal-corpus doesn't seem to go anywhere.

For someone interested in using the data (and help out with bugs/issues), where would you suggest starting?


> Just one note - the link in your Github readme to https://umarbutler.com/open-australian-legal-corpus doesn't seem to go anywhere.

Thanks for the heads up! I've fixed that now.

> For someone interested in using the data (and help out with bugs/issues), where would you suggest starting?

I think the best place to start is by downloading the Corpus (visit https://huggingface.co/datasets/umarbutler/open-australian-l... , and then click "Files and versions" and then "corpus.jsonl"). You can then use my Python library orjsonl to parse the dataset (you'd run, `corpus = orjsonl.load('corpus.jsonl')`). At that point, there's any number of applications you could use the dataset for. You could pretrain a model like BERT, ELECTRA, etc... and share it on HuggingFace. You could connect the dataset to GPT and do RAG over it. Etc...


Hey Umar,

Fantastic work here! I was griping to my team just last night how painful developing a chunking strategy for Australian Legislation is that while there's (generally) layout consistency within a piece of legislation, that's not true across pieces of legislation... so I can imagine the pain of trying to collate legislation across jurisdictions.

I've reach out via your LinkedIn profile - would be great if there was an opportunity to collaborate.


> Fantastic work here! I was griping to my team just last night how painful developing a chunking strategy for Australian Legislation is that while there's (generally) layout consistency within a piece of legislation, that's not true across pieces of legislation... so I can imagine the pain of trying to collate legislation across jurisdictions.

Absolutely, there's a lack of consistency even within the same jurisdiction and document type. It only gets worse once you want to add multiple jurisdictions and different types of documents. My best strategy so far has been to use recursive chunking where you begin chunking at the largest section of newlines. Ideally though you want some form of semantic chunking where you already know what parts of the document represent Parts, Divisions, Schedules, Sections, Sub-sections, etc...

> I've reach out via your LinkedIn profile - would be great if there was an opportunity to collaborate.

Great! Always happy to connect.


Yes, this. I've been trying to find a general way to automatically semantically chunk various legislation for a while now. Partly so as to diff various versions/amendments, but also to graph connections to other referenced legislation.

Most of the time I end up having to just take half an hour to manually regex and format plain text.

A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs

One last question/comment, have you considered adding some additional reference info like the federal list of entities?[1]

[1] https://www.finance.gov.au/government/managing-commonwealth-...


> A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs

It's possible that they're in my database. I have included the as made version of all bills on the Federal Register of Legislation. However, if they haven't had a first reading yet, then probably not.

For processing PDFs, I recommend using `pdfplumber`, which is what I used to build the Corpus. Happy to discuss further if you'd like.

> One last question/comment, have you considered adding some additional reference info like the federal list of entities?

Do mean adding additional metadata? At the moment, I've kept the number of metadata attributes as low as possible. Every attribute added equates to more work to keep it standardised across all the jurisdictions and document types. My plan is to slowly add more attributes as I have time. I'd really like to associate a date with documents but even that is a hurdle. I have to decide what date should be the date of a document (is it the time it was issued, the time it was published, the time it came into force, the time the latest version was issued, etc... and what happens when a document doesn't have a date? should I extract it from its citation? how do I preserve time zone information? etc...).


I've used a number of pdf libraries in python and C# over the years, none have worked reliably as needed (that's just pdf I guess), but haven't used pdfplumber, I'll be sure to give it a go, thanks for the suggestion.

Yes, additional metadata. Totally understand it adds in a lot of complexity but could help for fine-tuning an LLM.

With regards to dates, not a lawyer, but for Federal I would go with "Start Date", it's always the day following the End Date of the previous comp. The Date of Assent (well the year at least) is in the title, but also the first start date. The registration date can be either before or after the start date depending. [1][2]

The tricky part is when sections have different commencement dates that are detailed in the text. I don't know anywhere that is easily accessible. And, if you think about it, usually the most important information for say businesses being regulated.

I wouldn't worry with timezone per say, it's relative to each particular state.[3] i.e. why polling closes in a federal election at 6pm in each state rather than coordinated with ACT.

[1] Section 12 of the Legislation Act 2003 https://www.legislation.gov.au/Details/C2023C00213

[2] Sections 4 Acts Interpretation Act 1901 https://www.legislation.gov.au/Details/C2023C00213

[3] Sections 37 Acts Interpretation Act 1901 https://www.legislation.gov.au/Details/C2023C00213


How can you be contacted? I would like to sponsor a project that does the same for smaller jurisdictions. Can email me at my username at yahoo.com


Are you open to pushing one of your Work in Progress(WIP) models to https://ollama.ai/library to show off and let others try it out and provide feedback?


Unfortunately, I haven't been able to train a model on my database just yet (although I have seen great results using it as a RAG data source for GPT-4). I'll keep ollama in mind once I have something to share.


Just FYI my work's network has blocked your site as "Malicious"

(Symantec Endpoint Protection chrome extension)


Weird. Thanks for the heads up. I'm using SiteGround to host it which is shared hosting so perhaps that's why? I'll have to investigate.


Awesome work!


Australia has had free, searchable collections of Australian Law for 25+ years. Austlii is a prime example. There are Federal and State collections as well. The author is conscientious enough to read the scraping policy (or was blocked by anti-scraping tools) from feeding from one of these sites into his LLM.


As I point out in the introduction, there are a few free-to-access legal databases available in Australia, but none are truly open in the sense of being free from copyright restrictions. Neither AustLII nor Jade are licensed under an open source copyright licence such as CC BY 4.0 (which is what the majority of my Corpus is under).


Love what you're doing! Being able to more easily bring LLMs and other AI in will democratise the law quite a bit. Agreed that even though Austlii exists, it needs to be under a creative commons license, and it takes doing the legwork and dealing with bureaucrats to get it there


This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?


Yeah I was also disappointed to hear that. Afaik (I asked someone who previously worked in an adjacent field), it sounds like there’s no central system for publishing the judgements, it’s all published by each individual courthouse in a way that suits them, so lots of tedious individual scraping would be involved I’d imagine.


Victoria is notoriously difficult to work with. Lots of chauvinism/exceptionalism and copyright squatting.


Great work and congratulations on your tenacity dealing with bureaucrats. Open access and machine readable formats should be widely available.


Good work reaching out to, and trying to get along with Australian government departments. As a fellow Australian, and one employed in government, I can very much say that many people in charge of operating these systems should not be.


I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar setting (Australia)


Nice, but I have to ask how does it compare with https://austlii.edu.au/, especially for completeness?

The Australasian Legal Information Institute is a great resource and yet seems strangely unknown (to the wider public at least.)

Trivia, the only reason I found out about it was when I did some work for an Aus govt agency and found out that they shared their web site with austlii! This was back in the early 2000s.


So AustLII and Jade are what I was referring to when I say "While there were certainly a few free-to-access legal databases, none were truly open ...". They are free but not open in the sense of being licensed under an open source licence and being free to download in a raw format. Whereas everything in my corpus except decisions from the Federal Court of Australia is licensed under CC BY 4.0. And even decisions from the FCA are under a licence that permits both non-commerical and commerical usage.

In terms of completeness, however, AustLII and Jade win out. They seem to have almost everything if not everything. Their data is also much richer than mine. I must give props to AustLII for how they're able to hyperlink terms defined within legislation to their definition. I think they're an invaluable resource for members of the public.

The audience of my database was more so those who want to play around with raw legal data and want to feel secure that they are not breaching any laws in the process. The fact that it is stored in plain text is also beneficial for anyone trying to build ML models that only accept raw text.


In my experience, one of the devil details is continuously keeping such a database updated. Without a set of common standards among the various governments, they can capriciously change URLs, formatting, and other details that may make it somebody's fulltime job to keep it accurate and always up to date. Of course, not all use cases will require that, but many will.


Absolutely. Luckily, the websites I scrape don’t change very often (ie, every 5-10 years based on Wayback Machine), and also a couple of them all use the same underlying legislation management system and APIs.

Ultimately, however, I’m hoping that in the long term, the Australian Government will see the use in this project and decide to maintain it.


What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?

Is it overdue for innovation?


It actually looks quite clean. Certainly a lot better than some of our legal databases. I guess the only suggestion I'd have is that it seems like you can have an account with the website but there's login or register button on the front page (or maybe I'm just not seeing it?).

One other point, and this is not specific to CanLII (I haven't checked whether this is the case) but I've seen that a lot of legislation databases have poor SEO. In fact, AustLII is usually ranked higher than governments' own websites when searching for laws. I think it's something important to get right because a lot of people just use Google/Bing/Kagi to search websites nowadays rather than using internal search engines.


Like you're saying that there's no place to register on the front page?


Interesting, https://www.canlii.org/en/ looks a bit like https://austlii.edu.au/

Probably copied / cooperated with each other, which is popular for Commonwealth countries.


Correct. They are a member of WorldLII which was founded by AustLII: http://www.worldlii.org/


Any other closing arguments? ;)


Would it we worth getting your corpus replicated into other venues as well, such at the Internet Archive or on GitHub itself?


Github might be difficult as they impose constraints on the size of repositories and the Corpus is around 5GB. The Internet Archive is a good idea, however, I’ll have a look into that. I’ve also been thinking about sticking it on Kaggle as well to increase its reach.


You could also consider one or more of the scientific data repositories like Zenodo, FigShare, DataDryad, etc. 5GB is small potatoes for those folks and they have serious data retention policies. As a bonus, they'll also allocate you a citable DOI.


And thanks again for the Zenodo clue!

I now have my first two DOIs, one (data) via Dryad and one (code) via Zenodo.


Thanks for the suggestions! Distributing the Corpus widely will be my next focus.


This is DOI

https://www.doi.org/

Because, I just learned about it.


Woot! Thanks for the DataDryad info. I am lodging a tiny test data set with them now...


There are also national and university data repositories that might be interested and for which 5GB is not even noticable!


I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar environment (Australia)


It would be lovely I think if you used ML to help people ask questions so they can have more accessible law at their hands and understand what lots of things mean.

This can be applied to multiple countries around the world. The world of laws at your hands.

It’s an interesting concept


So good! Its crazy how legal information is such a spread out mess.

Whats worse is that git is such a perfect solution for legislation.


The French constitution has been on github for quite a while.

https://github.com/legifrance

Probably others, haven't looked.


There's a bit of a difference between a few pages of founding documents and a complete history of precedent setting legal cases.


Could you explain how the majority of your corpus is under CC BY 4.0? I realise that's the licence you have picked on HuggingFace, but if the source data was not already CC BY 4.0, how are you able to re-licence it as CC BY 4.0?


The majority of the source data was already licensed under CC BY 4.0. Additionally, the Corpus, as a work constituting a curated and post-processed collation of other works, is also licensed under CC BY 4.0.


Incredible effort.

These types of projects have the potential to influence a nation.


I think in the law related subjects there is a huge potential for digitalisation. In Germany the law texts are online but the paragraphs not linked


Great work, thank you so much.

Fwiw and getting formatted text from html did you try

lynx —-dump url >> file.plaintext


That's really neat. Such a shame VIC couldn't be included though.


Insanely great. Amazing work.


Is there a U.S. equivalent?


There is! It’s called the [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law), and it’s actually what inspired this project. The only caveat is that it doesn’t appear to be regularly updated, and so is more of a snapshot of US law rather than a semi-live copy. Also I’m not entirely sure how comprehensive it is (ie, whether there’s anything missing). For those interested in building a true US equivalent of my corpus, I think it could be a great starting point, particularly since they published the code they used to build it.


The Feds have good sources for the all various admin code/statutes/slip laws/etc, but there’s not a great unified source for case law.

There’s nothing at the state level right now. I’ve been considering setting up a statue scraper under the openstates umbrella but it’s a bit of daunting project to start. Lots of yeoman’s work parsing gnarly websites or evading Lexis scraper protections.


no, there's a lot that we need before we can even begin to improve this

even with a database of current laws as they exist right now, the laws to change them primarily come in 2 forms:

1. verbatim additional laws

2. instructions that are essentially diffs to the current law. what words to change, strike out, sections to re-arrange and modify, as well as new lines of code. these have to be spliced in to the prior state of the law

and after we have all that, laws are often following different logic. like logic gates. One set of laws may be using "and" as a set of conditions that must be satisfied all as one, but it also could be using "and" and an "exclusive or", a set of conditions where only one has to be satisfied. but when writing it, those things all flowed grammatically and harmonization of laws wasn't prioritized.

there's a whole lot that can be improved that we don't have the infrastructure to do just yet. someone could do it, but that's the first step.


PRO is a prominent player in the space.

https://public.resource.org/

Notably, it’s thanks to them that in 2020 the Supreme Court ruled Georgia’s legal code, including annotations, is uncopyrightable.


Govinfo.gov and the house provides the entire us law corpus but it's weird


Nice. Thanks for sharing.


Well done!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: