Bit Of A Hack

Home Archives
logo

Keep up to date with Bit Of A Hack...

Flex is too greedy. (How to make flex lazy) [Permalink]

Flex uses regexes. A really handy feature of regexes is that you can make sub-patterns greedy or not greedy. For example, to parse a pascal style string literal you might use a regex like this:

'((.*)'')*(.*)'

This effectively says "Match a single quote, followed by any number of occurrences of {any character followed by two single quotes} followed by any number of any character, followed by another single quote". This will match the following two literals which are valid pascal string literals:

'foo bar'
'foo''bar'

The second one reduces to foo'bar in pascal.

This is not particularly useful however as it is greedy. The following pattern matches as a single string literal (while it is actually invalid pascal syntax);

'foo''bar' 'waa?'

This happens because the dot . also matches a single quote. The regex engine continues to look forward after finding the fourth single quote in the string above because this single quote could match the last dot just as well as it can match the end single quote. So the solution is to make the .* non-greedy. aka lazy. This makes the regex engine do something like "I could stop here and match the pattern so I'll not bother going any further." You do this by adding a question mark after the star and you get a regex like this:

'((.*?)'')*(.*?)'

Suddenly you have an ideal regex for your pascal style string literal. YAY! But there is a snag. Flex will not do non-greediness. :( So what's the solution now? Well, the problem was that the dot matched the single quote. if it didn't that would be fine. So how about a character class that is everything except a single quote in place of each dot?

'(([^']*)'')*([^']*)'

Suddenly the question marks are superfluous. This now matches properly. HOORAY! I really don't like this 'feature' of flex.

By .

comments powered by Disqus

This website uses cookies. If you don't like this, please stop using this site.