I worked on a tool that infers a regex from sample strings, by building and then reducing an NDFA.
The NDFA->regex construction worked by "collapsing" a node in the NDFA. Choose a victim node. Construct a Kleene star for all "self loops." Lastly, form all triples of (incoming, self-loop, outgoing) edges; these become the labels of new edges. Now you can discard the victim node. Repeat until you're down to a single start -> goal node, and you're done (the edges just become alternations).
Surprisingly the length of a regex would vary dramatically by the choice of nodes to collapse. Finding the minimum regex was an unexpected challenge. Has anyone explored regexp minimization?
Incidentally it's not the case that Thompson and subset construction "are the basis of every lexer and regexp engine there is." It's not even true of the regexp engine inside the browser you're using now!
>frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible. It is available as a command line utility and for the browser as a JavaScript library.
This sounds exactly like what I was hoping to learn more about.
I wanted to create an IRC client where I would add/remove modules at runtime. These modules might respond to things said in-chat. As an example, you might have a greeter module that responds to common things like "hi" and "hello". The module would register expressions for things it triggers on. In parallel, all these modules would have registered expressions. Instead of testing all these expressions in series, I wanted to feed these to some code that would decompose and reduce them into a simplified expression. It'd be like if you arranged the regexps in a nested series of if-conditions.
Finding an appropriate middle ground for testing expensive regular expressions was the objective of all that, but I didn't get very far as I lost myself in Lua's LPeg project.
What a beautiful goddamn project. So unique and amazing.
I am not sure this is what you are looking for, but I have been working on Aude (AUtomata DEmystifier), an open source pedagogical application targeted at CS teachers and students that works in browsers without installation (but one can download and run it offline too).
The aim is to visualise and manipulate automata, including conversions between them and regular expressions. Happy to take feedback!
Well, I was a student and needed to practice the related lesson, "Languages and Automata", before the final exam. So I implemented the algorithms of the lesson and used Graphviz to render the result. The thing worked in a browser but ran on the server (using D!). I figured my fellow students or the teacher may find it useful, so I sent the link to the teacher (in a post-scriptum of a very long mail in which I was asking for help, ah ah).
"Your tool seems really interesting, what about you continue developing this during an internship this summer?"
Hell, yes. And Aude started.
Since then, I was lucky to mentor several interns to work on Aude with this teacher.
The NDFA->regex construction worked by "collapsing" a node in the NDFA. Choose a victim node. Construct a Kleene star for all "self loops." Lastly, form all triples of (incoming, self-loop, outgoing) edges; these become the labels of new edges. Now you can discard the victim node. Repeat until you're down to a single start -> goal node, and you're done (the edges just become alternations).
Surprisingly the length of a regex would vary dramatically by the choice of nodes to collapse. Finding the minimum regex was an unexpected challenge. Has anyone explored regexp minimization?
Incidentally it's not the case that Thompson and subset construction "are the basis of every lexer and regexp engine there is." It's not even true of the regexp engine inside the browser you're using now!