Threaded Mode | Linear Mode

**whenever** · Jul. 20, 2009, 03:35 AM

Quote:10a Loops -- Limiting expression scopes.
You can use "+" loops to isolate subexpressions, removing their
capatibility to look ahead.

No, "+" doesn't look ahead, it just repeats the preceding expression blindly. The help file has below words:

Quote:An important point to make about + is that it's a "blind" run. This means it repeats at long as the condition it's testing is true regardless of anything the follows it!

regardless of anything the follows it means it doesn't look ahead. BTW, the words might should be "regardless of anything that follows it".

Quote:Example:
Say we want to match <foo ... >, but only if the following tag isn't </foo >

<foo*>(^*</foo >)
... wouldn't work, because "*>" doesn't stop at the first match but is looking
ahead.

<foo[^>]+>(^[^<]+</foo >)
... would work, but [^...] forces inspection of each character.

<foo(*>)+{1}(^(*<)+{1}/foo >)
... does what we want, quickly. "*>", "*<" are not looking ahead anymore.

I think "*" is equal to "?++" in prox language while the help file has:

Quote:A double-plus acts much like the single "+" plus except it also pays attention to what comes afterwards (it can "see" so to speak).

So, The "*" itself looks ahead.

To match a string until meet a ">", "[^>]+>" is faster than "*>" because "+" just blindly repeat the "[^>]" while "*" has to check after each character match if it is followed by a ">".

Quote:10b Avoiding superfluous tests in OR conditions.
10c However, +/++ loops remove the uniqueness of the string under test, even if followed by {1,*}.

I don't understand what the examples are trying to show. A more detailed example might help. Smile!

***sidki3003*** · (This post was last modified: Jul. 20, 2009 06:59 AM by sidki3003.)

(Jul. 20, 2009 03:35 AM)whenever Wrote: No, "+" doesn't look ahead, it just repeats the preceding expression blindly. The help file has below words:

Sure it doesn't. It *removes* look ahead capatibility from subexpressions like "*>". Have a look at the examples below that statement.

Quote:I think "*" is equal to "?++" in prox language while the help file has:

Effectively yes. Speedwise the difference is around an order of magnitude.

Quote:So, The "*" itself looks ahead.

Right. That's what we need to get rid of in mentioned example situations.

Quote:To match a string until meet a ">", "[^>]+>" is faster than "*>" because "+" just blindly repeat the "[^>]" while "*" has to check after each character match if it is followed by a ">".

Now that we have made "*" blind, it's much faster than "[^>]+>".

A more accurate expression for "making blind" is: Limiting the subexpression's scope, so that - after the initial match - there is nothing left to look ahead.

Quote:I don't understand what the examples are trying to show.

\*\*\*+{98} instead of \*+{100} makes the expression start with two unique chars, which is what we want.

Quote:A more detailed example might help.

I couldn't think of any.

I should also note that techniques.txt is addressing advanced filter writers, who know the help files (i know them too Wink

), as well as the prox-list discussions. That allows it to go straight to the point, without repeating any Proxomitron basics.
(No one else would be interested in such things anyway.)

Thanks for looking at that draft. Smile!

I assume that its content is logically correct (all statements have been tested, of course), but maybe some wordings and/or examples could be improved to make it easier to understand.

**whenever** · Jul. 20, 2009, 07:54 AM

(Jul. 20, 2009 06:11 AM)sidki3003 Wrote:
(Jul. 20, 2009 03:35 AM)whenever Wrote: No, "+" doesn't look ahead, it just repeats the preceding expression blindly. The help file has below words:

Sure it doesn't. It *removes* look ahead capatibility from subexpressions like "*>".

Yes, the subexpressions "*>" doesn't look ahead when you suffix it with a "+", it is similar to Atomic Grouping in general regex flavors, but I think the "*" *within* the subexpressions still looks ahead until it finds the first ">". That's where I think slower than "[^>]+>".

Quote:\*\*\*+{98} instead of \*+{100} makes the expression start with two unique chars, which is what we want.

That's interesting, although I still couldn't understand why.

Better change the doc to:

Quote:Example:
To test for 100 asterisk symbols anywhere in a document:
\*\*\*+{98} instead of \*+{100}

***sidki3003*** · Jul. 20, 2009, 08:18 AM

(Jul. 20, 2009 07:54 AM)whenever Wrote: Yes, the subexpressions "*>" doesn't look ahead when you suffix it with a "+", it is similar to Atomic Grouping in general regex flavors, but I think the "*" *within* the subexpressions still looks ahead until it finds the first ">". That's where I think slower than "[^>]+>".

Significantly faster. Which is the point of entire chapter 10. Test it. Wink

Quote:Better change the doc to:

Quote:Example:
To test for 100 asterisk symbols anywhere in a document:
\*\*\*+{98} instead of \*+{100}

Done (although basic Prox). Smile!

**whenever** · (This post was last modified: Jul. 21, 2009 01:40 AM by whenever.)

(Jul. 20, 2009 08:18 AM)sidki3003 Wrote: Significantly faster. Which is the point of entire chapter 10. Test it.

A test proved I was wrong. Sad

"*>" is much faster than "[^>]+>", even "?++>" is faster than "[^>]+>". This is totally different from what I know about common regex flavors' behaving on Greedy vs. Lazy. It seems "*" in prox is not simply the ".*?" in common regex flavors and Scott had made special optimization for it.

"Look around" in common regex flavors doesn't consume characters, I think you are not meaning that when you say "look ahead" in your docs, so my suggestion is as below:

Quote:10a "+" -- Suppressing expression match attempts
You can use "+" loops to suppress match attempts *within* the preceding subexpression.

... wouldn't work, because "*>" doesn't stop at the first match and is matching forward

... does what we want, quickly. "*>", "*<" are not trying new match attempts anymore.

prefix(-possible_suffix|)\1*some_string
... would cause the filter attempting twice ("-possible_suffix" AND empty) before it fails to match:
"prefix-possible_suffix ... no_match"

***sidki3003*** · Jul. 21, 2009, 06:02 PM

Good points!

Rephrasing 10b was easy:

Code:

prefix(-possible_suffix|)\1*some_string

... would cause the filter attempting twice (first with "-possible_suffix"

then with "") to match ...

"prefix-possible_suffix ... no_match"

... before failing.

I do have problems integrating the changes into 10a. Although your suggestion is more exact, i perceive it as harder to understand than before. Also, i'd like to keep the "scope limiting" part, because it describes the actual process nicely (and because Scott was using it frequently too).

I'll revisit it later.