Regex are quite simple and useful but my only issue is with those recursive things. Like how do you match balanced brackets? I have a regex (pcre) copy-pasted for it but for the life of me I don't get it or maybe nod my head but instantly ununderstand it. I wish there was a simple to understand doc that teaches to me how I can match something like:
"(this is inside a bracket (and this is nested or (double nested)))
P.S. I know token parsing is better for these things but still I just want to learn the other thing too.
Balanced paranthesis are not a regular language, so it s theoretically imposdible to match them with regular expressions.
In practice, most regexp implemenations you see are more powerful then regular expressions. For instance, .net has a balancing groups feature [0] for exactly this usecase.
$str = "(this is inside a bracket (and this is nested or (double nested)))";
do {
preg_match_all('~\(((?:[^\(\)]++|(?R))*)\)~', $str, $matches);
echo $str = $matches[1][0] ?? '', "\n";
} while($str);
Outputs this [1]:
> this is inside a bracket (and this is nested or (double nested))
> and this is nested or (double nested)
> double nested
You're right that there is more processing involved (e.g. while loop) but I still don't understand this part
First, the "~" characters aren't really part of the regular expression. As far as I can tell, they are delimeters to mark the start/stop of this. Often you will see "/" used for this purpose.
Next is:
\( ... \)
This matches a pattern that starts with the literal character '(' and ends with ')', where what comes between them matches the elided portion. Since parantheses have special meaning in regex, we need to espace these characters.
Continueing are way inward, we see:
( ... )
Which is non-escaped parentheses. This is a pattern group, and is used to treat the pattern within it as a single unit. For example the pattern "ab" would match abbb, but not ababab, because the "" (repeat) modifier only applies to "b". However "(ab)" matches "ababab", but not "abbbb". In this case, there is no modifier, so these parantheses have no effect on what string matches the overall expression. However, many implementations also use paranthesis to define matching groups, which means they will return whatever is captured within the parantheses as a match. Essentially, the pattern of:
\(( ... )\)
means, find a string that starts with '(' and ends with ')', and pull out everything in the middle.
Next comes a simmilar construct:
(?:...)
There are 2 things going on here. This matches whatever is being elided by ..., however the library does not return it a separate result. This is used when you need to group things together within a regular expression, but do not want that specific grouping returned as part of the result. The "" here means that the entire pattern can be matched any number (including 0) of times, and should be matched as many times as possible.
Next is
[^\(\)]
The square brackets indicate that you should match any character within a particular set. The "^" in the beggining of square brackets means that you are inverting the selection, so you will match any character except those specified. The remaining characters, are paranthesis literals.
The first "+" indicates that the pattern should match 1 or more of the previus entity. In the case of [^\(\)]+, this would mean that it can match one or more non paranthesise characters.
The second "+" is different. Since quantifiers are not allowed to follow other quantifiers, the above meaning does not apply, and the langauge was allowed to overload the symbol. This modifies the previous quantifier to be greedy, meaning it will consume as many characters as possible (e.g. all characters until it hits a parenthesis). I don't think this is technically needed in this case, but probably improves efficiency.
The next component is "|", which means to match either the pattern on the left, or the right.
The next step is not a regular expression, but one of those "more powerful" additions I mentioned. (?R) is a recursive match, and matches whatever the overall expression matches. Eg, when your expression runs into a nested paranthesis, it recurses and parses the substring as a balanced paranthesis string.
Putting this all together (and ignoring whitespace while adding comments; as most major regex engines have an option to allow you to do):
\( #Start with an open parathesis
( #This is the beginning of the region I want to extract
(?: #Group the following pattern together, but don't save the matching substring
[^\(\)]++ # Match until a parenthesis character, assuming that would match at least 1 character
| # Or
(?R) #Match a string with balanced paranthesis (assuming that is what the overall regex does).
)* #Repeat the preceeding pattern as many times as nessasary
) #End the region I want to extract
\) #The next character should be a close paranthesis.
Looking at an example of this:
(aaa(bbb))
First, we match "(". Then we try to match (?:[^\(\)]++|(?R))* as a matching group.
This matches [^\(\)]++|(?R) as many times as necessary.
At this point, are remaing string is "aaa(bbb))".
Since the pattern we are matching this against is an "|" pattern, we have 2 options: we can either match against: [^\(\)]++, which would match "aaa", or we could match against (?R), which would fail, since the first character is not '('. As such, we match "aaa". Since this grouping was defined using (?:) instead of (), we do not save "aaa" as a separate result
Next, since the group is modified by "*", we can either match another instance of it, or move on to match the closing ")". The next character is not ')', so are only option is to match another instance of "[^\(\)]++|(?R)"
At this point the remaining string is (bbb)), so [^\(\)]++ fails to match, since it requires at least one character before the '('. However, now (?R) works and matches (bbb).
Now are remaining string is ")" and our options are again to match either "[^\(\)]++|(?R)", or ')'. At this point, neither [^\(\)]++ nor (?R) work, so the only option is to leave the repetition and match the closing ')'.
Wow thanks for explaining it to me so wonderfully. Your explanation for the double ++ really helped me since that part never made sense to me before. I guess the ?R probably only works with PHP? I will try to make some more examples for the ?R to try out today so I can learn the full power of it.
Again I'm so grateful to you for the explanation. One more thing I've learned from it is next time a regex makes my head explode, I'll just break each character in one line and write a comment next to it!
I guess I don't understand. Mind throwing up an example with multiple test strings on regex101.com ? I'd like to take a look and see if I can make a regex which does what you want.
So if you could write the examples there, and then a description like you would tell your mom of what you want I'll see what I can do.