Post Reply 
Matching text that's NOT in a tag
Nov. 13, 2004, 12:59 PM
Post: #16
 
Hi all,

Ok, after looking over JD's filter set again, I saw the following filter:
Code:
Name = "Mark:  CSS {R}        [s]"
Active = TRUE
URL = "$TYPE(css)$SET(css=1)|$TYPE(htm)"
Limit = 8
Match = "<(style$SET(CSS=1)"
"|/style>$SET(CSS=))PrxNeverMatch"

Borrowing that idea & modifying it, I came up with this:
Code:
Name = "Mark: Tag Start"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 256
Match = "(<[a-z]+{1,*}(\s[a-z]+{1,*}=$AV(*))$SET(bTag=1)|>$SET(bTag=))PrxNeverMatch"

The idea here is to set a flag (bTag) when the start of a tag with attributes is detected, then clear the flag when the end of the tag is detected.

Since the match ends with the text "PrxNeverMatch", the match should never be true but the setting & clearing of the flag should still work. This means it shouldn't interfere with other filters or flood the log window.

Placement of this filter might be a bit tricky.

I suppose it would have to go before any web filters that match a tag & modify it without multi being true, otherwise it wouldn't "see" the start of the tag.

Conversely, it should be after any web filters that remove tags, otherwise it could "set" when it shouldn't.

That being said, something like this could be used for "bolding" the matching part numbers:
Code:
Name = "Bold Matching PNs"
Active = TRUE
URL = "$TYPE(htm)"
Match = "($LST(PartNumbers))\0(^$TST(bTag=1))"
Repace = "<b>\0</b>"

Without having a url of a page to test it on, I can't be sure it works.

Mike
Add Thank You Quote this message in a reply
Nov. 13, 2004, 07:12 PM
Post: #17
 
Mike,

Good research! Cheers

Quote:
Oddeysey Wrote:If (^within a tag && match on $LST) then
.... bold the text
I agree with this also, however, I don't think you can reverse this test. Doing so would reverse the scope.
As I understand scope in this sense, it encompasses the entire filter, not just parts of it. Under that guise, it wouldn't matter which order was used. However, I did first come up with the idea in the reverse form, and said to myself:

"If I have found the text, the first part is true, so I must check the next part. If I am not within a tag, then the second part is true, and I can bold the text." Here, things get murky for me. If I have already progressed in the first part to the point of finding the text, am I too far along the line of text to go back and test for an opening tag? I don't think so, but only an expert can answer that for sure. For that reason, as well as my earlier assertion about "most likely condition first", I chose the "not a tag" portion to be the first part of the overall test.

Looking at your modified JD filter, I'm left with the impression that you're gonna end up with two filters again. I'm not against that, you understand, I'm just curious. Are you proposing one filter or two?

If we are considering two-filter solutions, then let's think really outside the box. (I did some more research, too. [smoke]) Consider my earlier two filters: the first was intended to mark every occurance of the part number with a bold, the second was to scour every tag, and remove any bolds found therein. That required Multi to be turned on. Let's go see what Scott had to say about Multi, shall we? From his Proxomitron Help file:

Scott R. Lemmon Wrote:Normally, when a rule is matched the result is sent directly to the web browser - no other rules are allowed to process the matched section. This is mainly for efficiency, as it saves quite a bit of work, but it's also a useful way to give certain filters priority over others - essentially it's first come, first served.
With that under our belts, I'm now gonna propose that we reverse my two filters - #1 becomes the tag filter, and vice versa.

So picture this: If filter #1 examines a tag, finds a match, and replaces it exactly as it was found, then according to the quote above, that tag won't be filtered again. Remember, Multi is off by default. (OK, it's FALSE, come on, gimme a break here, please.) For the remainder of the solution, it should be easy to comprehend that filter #2 will bold every thing that it can find, but because the tags have already been Matched and Replaced, they shouldn't be Matched again, thus leaving them un-bolded. Cheers

If we're gonna use a two-filter solution, this strikes me as the most logical. Providing that Scott's description of "for the sake of efficiency" is the same as mine. [lol]

Whaddya think o' them apples, Mr. Z? [wha]


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Nov. 14, 2004, 08:27 AM
Post: #18
 
Hi Oddeysey,

Well, I think that would work also. In fact, if you look closely at the filters in my first post, you'll see that was how I initially approached it (check the last test in the "tag skipper" filter).

In my configuration, I use a "Tag Skipper" filter that matches certain tags & replaces them with the exact same thing & it works quite well. However, its a short list of tags that it can match. It actually seems to speed things up a bit & I don't get a lot of hits filling the log widow.

One problem with matching ALL the tags like that is flooding the log window with matches. Wolfram mentioned slow downs, especially if the log window was open. Thats why I stopped trying to match tags & went with the one filter approach.

I though that if the right delimiters were used, the part number could be matched without checking to see if it was inside a tag. Unfortunately, I was unable to discover a delimiter match string that worked.

Thats why I went back to the two filter approach in my last post. However, this time, I should think that the tag checking filter would be faster. It doesn't have to match as much text before setting the variable, & since it never really matches, theres no time spent replacing text and no entry in the log window.

I always prefer a one filter solution. Sometimes, I just can't figure out how to do it. Smile!

Mike
Add Thank You Quote this message in a reply
Nov. 14, 2004, 09:11 AM
Post: #19
 
Mike;

If I read your post correctly, you seem to imply that merely finding a tag that matches on the Bounds is sufficient to prevent any other filter from re-examining that tag (providing Multi is off). I'm under the impression that the Match portion must be matched exactly, not the Bounds. Or have I missed something, again?

And since the only tags a part number will be showing up in are the anchor and image tags, the filter can be limited to only those two tags, speeding things up considerably. (Log window or no. Wink )

Finally, for the record, riddle me this - do you (or you, Wolfram) normally surf the 'net with the Log open all the time? I don't, and I'm fairly certain I'm the normal one here. <_< I'm concerned about your concern over the speed while the Log window is open. :P


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Nov. 14, 2004, 10:45 AM
Post: #20
 
Hi Oddeysey,

Oddeysey Wrote:I'm under the impression that the Match portion must be matched exactly, not the Bounds.

You are correct. If "Bounds" is used, then the "Match" expression must exactly match the bounds. I did not mean to imply otherwise.

Since Wolfram didn't post links to pages he was trying to match on, I don't think that matching only img or anchor tags would be good enough. He only used links as an example, but surely there has to be other tags on that page. But yes, if only a limited set of tags had to be matched, I'm sure that would speed things up.

Oddeysey Wrote:Finally, for the record, riddle me this - do you (or you, Wolfram) normally surf the 'net with the Log open all the time? I don't, and I'm fairly certain I'm the normal one here.

LOL, I'm sure you are. Smile!

But for the record, I always have my log window open. Smile!

Mike
Add Thank You Quote this message in a reply
Nov. 14, 2004, 06:30 PM
Post: #21
 
Mike;

A parable for you.

After 55 years of wearing a wristwatch on my arm, my last one finally wore out. I swore I was gonna get an exact duplicate replacement, but when I saw the cost, I "adjusted" my thinking, and investigated the toy watches (gadgets galore, telling the time is an afterthought).

That was almost a year ago, and I still haven't found what I might like. Know why? Because I found that I don't need one. Even after five and a half decades, I was still able to go cold turkey, and give them up entirely - no regrets. Sure, I was semi-forced into it, but "any road to get to the destination", eh? :o

The morale is, you can give up your Log window. Really, you can. No more preaching atchya, just thought I'd speak my piece about my concerns, and then let it drop. <thud>

Now.....
Quote:If "Bounds" is used, then the "Match" expression must exactly match the bounds.

Uhhh, sorry, but that wasn't my question. I had asked if you had used Match at all. Here's the line from your post that confused me:
Quote:I should think that the tag checking filter would be faster. It doesn't have to match as much text before setting the variable, & since it never really matches, theres no time spent replacing text and no entry in the log window.

emphasis added

The implication I got was that Bounds was the only thing being checked here. It seems to me that if a Match isn't made, then the text under scrutiny is still available for examination by another filter. Assuming that we're checking tags in our PartNumbers filter set, shouldn't we be doing our best to achieve a match?

And finally....
Quote:..... I don't think that matching only img or anchor tags would be good enough. [Wolfram] only used links as an example, but surely there has to be other tags on that page.

I'm thinking like an HTML coder here - what tags must I use in order to accomplish the mission? Any ideas, besides the two we've already mentioned?

Is this clear at all? It's too early in the AM. I gotta go get a cuppa. :P Later.


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Nov. 16, 2004, 10:11 AM
Post: #22
 
Hi Oddeysey,

Sorry about the confusion. The filter I was refering to is "Mark: Tag Start" at the top of this page.

Mike
Add Thank You Quote this message in a reply
Nov. 17, 2004, 06:57 AM
Post: #23
 
Mike;

Gotchya, that made sense. <whew>

Now, if only whumann would show up again, we could prevail upon him to give us a link for testing purposes, and perhaps 5 or 10 lines from his $LST file.

Wolfram, are you out there?


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Dec. 03, 2004, 12:23 PM
Post: #24
 
Hi Mike and Oddysey,

I thought the discussion was over -- now I see I missed most of it. Thanks to Oddysey for the heads-up email.
I'm afraid I can't give you an URL to test because these are database query results in the company's intranet. As I did before, I can only post "anonymized" snippets to clarify certain problems. I don't think it's sensitive information but I have to stick to the rules. But I will test any solutions I like and report back if they work.

In fact I use the 3-filter-set of my Nov 5 2004, 09:13 AM post evers since and it works quite well. Let me try to summarize the proposed improvements and new ideas I've seen in your posts:

1. $SET can not only be used in "Replace" but also in "Match". I didn't think about this because in my head "Match" is the IN-part of the filter and "Replace" is the OUT-part.

2. If $SET is used in the "Match" section, the filter doesn't need to match for the $SET to be effective. In fact, if I want to prevent the matches from flooding my log, I can explicitly prevent the match.

3. Using an "or"-condition, I can go from a 3-filter-set to a 2-filter-set.

What I don't understand is the "Match" condition in "Mark: Tag Start". Why can't it be stripped down to
Code:
Name = "Mark: Tag Start"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 256
Match = "(<$SET(bTag=1)|>$SET(bTag=))PrxNeverMatch"

Another question, looking at the previous ideas: The following bounds should be functionally equal. The second one is more elegant but which one is faster?:
Code:
Bounds = "[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
Bounds = "[0-9]+{4}-[0-9]+{4})"

Additional remarks about flooding the log window:
  • Normally, I have my log window closed.</li>
  • I know that flooding the log creates a performance issue when the log wondow is open. I know I don't "feel" any performance issue when it's closed. What I don't know is whether or not there is *any* performance impact with the log window closed.</li>
  • Besides performance issues, flooding the log makes it useless for checking when and how certain filters work.</li>
Wolfram
Add Thank You Quote this message in a reply
Dec. 03, 2004, 02:03 PM
Post: #25
 
Hi Wolfram,

Good to hear back from you.

My match for setting the flag was a bit more complicated since I was concerned a single "<" might match that wasn't in the context of a html tag. But after thinking about it, so what if it did. This shouldn't bother anything and you should get faster matches since theres less to test.

Wolfram Wrote:CODE

Bounds = "[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
Bounds = "[0-9]+{4}-[0-9]+{4})"

As for which is faster, I don't know. Perhaps you can paste a large section of text in the "test" window and profile the filter using both styles. All things being equal, I prefer the second style, as for some reason, it's eaiser on my head. Smile!

Let us know how things work out.

Mike
Add Thank You Quote this message in a reply
Dec. 03, 2004, 04:48 PM
Post: #26
 
Hi Mike,

it's true, "<" and ">" appearing at strange places might confuse the filter. I haven't encountered such a case and will deal with it when I do. The bad thing is: Unless I manually check all the bold and non-bold partnumbers against the list, I might not notice if such a thing happens...

I tried the new 2-filter-set and so far it works very well. The result as displayed by the broweser seems to be the same as with the 3-filter-set but now I can open the Log window again and get entries only for the matches where some partnumber has been made bold. Very nice :-)

Thanks again,
Wolfram
Add Thank You Quote this message in a reply
Dec. 03, 2004, 08:19 PM
Post: #27
 
Hi Wolfram

Glad to hear it seems to be working. Smile!

Thanks for posting back.

Mike
Add Thank You Quote this message in a reply
Dec. 03, 2004, 08:24 PM
Post: #28
 
Mike;

It appears that you are logged in at the same time, just as I post this! Cheers


Wolfram;

Understood about your need to follow the rules, vis-a-vis our request for a test link. Oh well. Sad Can you possible post the two (or three) filter set you've finally settled on? That way, we can refer back to this thread in the future, should someone else have a similar question.

Thanks.


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Dec. 03, 2004, 10:39 PM
Post: #29
 
Almost forgot.........

(OK, I did forget. Don't remind me, please. Sad )

Quote:Another question, looking at the previous ideas: The following bounds should be functionally equal. The second one is more elegant but which one is faster?:
Code:
Bounds = "[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
Bounds = "[0-9]+{4}-[0-9]+{4})"

Wolfram, there's an old joke that starts out something like "A funny thing happened on the way to the .....", and you pick out the destination. In this case, that would be a lead-in to my following dissertation, which could be called "A funny thing happened on the way to executing machine-language instructions". It goes like this:

Sometimes when you write a piece of code that looks to you like it would execute faster as a loop than otherwise, you find that it just doesn't turn out to be true. This happens because when all is said and done, you've actually caused the processor, via the compiler/interpreter, to spend a lot overhead on constructing and managing that loop.

Using your examples, we'll investigate what happens. Line One asks the processor (again, via the compiler/interpreter) to assign each operand to a register, and calculate a total result. That result dumps out at the far end of the pipe, and the next instruction is loaded. This is called "brute force", and thanks to the low overhead, it works pretty quickly.

In the case of Line Two, we find that the processor must first determine how many registers to initialize for the loop. Right away, we've got a management function, don't we. Now, not only is the loop executing instructions on one piece of data at a time, but each interrum result must be stored somewhere. That's another management function. And of course, the final result must be calculated and returned. All of those operations are at the mercy of the compiler's author.

If that worthy was on the ball, the compiler is said to be "optimized", and will be able to tell when to use a loop, and when to just plow straight ahead, regardless of what the "higher language" coder actually wrote. If the author was not smart enough to make these distinctions, then the "higher language" coder will have to do some fine-tuning him/her self. For that reason, you might see a difference in execution times between Line One and Line Two. If they appear to require about the same amount of time, then you are probably using an optimized compiler (or interpreter). Thank your lucky stars. Smile!

As usual, when I go off on a tangent like this, everything I've just written makes perfectly good sense to me. However, if anyone reading this has questions, don't hesitate to ask. I'll be glad to clarify my points, because I may (repeat, "may" Wink) have been more pedantic than I should have been. It would be tragic if I'd spent the last 15 minutes writing this, and someone walked away with a glassy look in their eye because I sounded like I'd bite their head off for asking a question. Nothing could be further from the truth - ask away!

OK, back to the scratching post! [lol]


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Dec. 06, 2004, 10:15 AM
Post: #30
 
Hi Oddysey,

I have to admit that sometimes your posts tend to provide ample explanation and discussion but leave it to the reader to draw a set of to-the-point conclusions. That may or may not work depending on that reader's lingual and logical capabilities . Wink
In the present case I assume the implied conclusion to be:
The second bounds-statement may be slower than the first one or -- in the best case -- require equal time. Therefore the first one is preferable.

Finally here's the code of my final filter set (company specifics removed):

Code:
Name = "Test: Inside a Tag?"
Active = TRUE
Multi = TRUE
URL = "OnlyCertainURLs/*"
Limit = 16
Match = "(<$SET(InsideTag=1)|>$SET(InsideTag=0))PrxNeverMatch"

Name = "Parts-List"
Active = TRUE
URL = "OnlyCertainURLs/*"
Bounds = "[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
Limit = 16
Match = "($LST(Parts))\1$TST(InsideTag=0)"
Replace = "<b>\1</b>"

Thanks again for your help, Mike and Oddysey,
Wolfram
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: