Clicky

Hi there,

I want to match occurrences of a term that do not appear within xhtml tags. So in the following example I want to only match the second occurrence for font:

 <font color="red">font might be red</font>                             
1: 

Select allOpen in new window


Of course, I need to be flexible and "font" might be substituted for anything. I just want matches that are not within a xhtml tag <....>

I tried this:
 rex = new Regex("(?:?<=<)(?:?!>)\\b" + strHighlight + "\\b(?:?!=>)(?:?=<)", RegexOptions.IgnoreCase);                             
1: 

Select allOpen in new window



But that does not seem to work. Probably because it not the right way to say that strHighlight should not be within a <....> tag...

Thanks for your help!

Amkick

asked 12/04/2011 08:13

Amkick's gravatar image

Amkick ♦♦


20 Answers:
If the text you are searching for is always just the tag name, then you should be able to use the following:

link

answered

kaufmed's gravatar image

kaufmed

Try pattern:
font(?![^<>]*>)
link

answered 2011-12-05 at 04:43:00

TerryAtOpus's gravatar image

TerryAtOpus

hi guys, thanks fo the comments. I guess I did not make myself clear. I am implementing a highlighting function where I mark found terms within an html document. As users can query anything, they might also query words that appear within < and > tags. I don't want those occurrences to be matched. So I am looking for a single regex that does work for only the second occurrence of the word font in the example above and

1:
<tag name="whatever font you like"/> stuf about a font and other stuff.


The word font is a variable in my script.
link

answered 2011-12-05 at 12:35:31

Amkick's gravatar image

Amkick

I'm clear on that - I thought you would just modify my pattern to include a variable like this:

rex = new Regex(strHighlight + "(?![^<>]*>)", RegexOptions.IgnoreCase);
link

answered 2011-12-05 at 13:25:11

TerryAtOpus's gravatar image

TerryAtOpus

My suggestion would be:

1:
rex = new Regex("(?i)(?!<<[^<]*)font(?![^>]*>)")
link

answered 2011-12-05 at 13:29:27

kaufmed's gravatar image

kaufmed

Nix that.
link

answered 2011-12-05 at 13:52:21

kaufmed's gravatar image

kaufmed

Kaufmed's post just now reminded me that the word boundaries are important - this would give:

rex = new Regex("\\b" + strHighlight + "\\b(?![^<>]*>)", RegexOptions.IgnoreCase);
link

answered 2011-12-05 at 13:53:39

TerryAtOpus's gravatar image

TerryAtOpus

You might try this, but I still can't guarantee it. HTML is difficult to parse with regex.

1:
(?i)(?<=<(?=(S+))[^<]*>[^>]*)font(?=(?:[^<]*</1>)?)
link

answered 2011-12-05 at 13:56:52

kaufmed's gravatar image

kaufmed

Terry,

Yours would fail to find:

<tag name="whatever font you like"/> the size of the font is > 5


There's probably little chance of receiving such text, though.
link

answered 2011-12-05 at 13:59:28

kaufmed's gravatar image

kaufmed

Wouldn't the > 5 actually be &gt; 5 ?
link

answered 2011-12-05 at 14:02:10

TerryAtOpus's gravatar image

TerryAtOpus

Possibly. It depends on the doctype of the HTML document, I believe, as to whether an unencoded gt is allowed. If that's the case, then it's fine--as you know  ; )
link

answered 2011-12-05 at 14:06:04

kaufmed's gravatar image

kaufmed

kaufmed, your latest pattern requires a tag to be found before and after the word being searched for - that's ok for complete HTML pages, but not so good if we're doing something like searching the user defined content of a CMS.

Ironically, I think your latest pattern will fail if you have a > character (must be > rather than &gt;) *before* the word you're looking for:
<tag name="whatever font you like"/> the size of the something is > 5 but I want to find font please

One other thing I'm not sure about - are (x)html tags valid if they have leading spaces inside the tag? eg < strong>
You'd just need to add a \s* to your pattern to fix that though.

There's nothing like peer review to keep you on your toes... lol
link

answered 2011-12-05 at 14:09:21

TerryAtOpus's gravatar image

TerryAtOpus

Oh, and does ASP.NET allow wildcards in lookbehinds? PHP doesn't I believe.
link

answered 2011-12-05 at 14:24:19

TerryAtOpus's gravatar image

TerryAtOpus

Oh, and does ASP.NET allow wildcards in lookbehinds?

Yes, it does.
link

answered 2011-12-05 at 14:25:50

kaufmed's gravatar image

kaufmed

Ironically, I think your latest pattern will fail if you have a > character (must be > rather than &gt;) *before* the word you're looking for:

I didn't check it extensively, but it it seems to handle your suggested scenario:
link

answered 2011-12-05 at 14:29:06

kaufmed's gravatar image

kaufmed

Sorry - you're right, I see it does - this part of the pattern:
[^<]*>
matches:
name="whatever font you like"/> the size of the something is >

and:
[^>]*
matches:
5 but I want to find

(Was it intentional to work that way though?)
link

answered 2011-12-05 at 14:31:15

TerryAtOpus's gravatar image

TerryAtOpus

No, I'd have to say that was dumb luck. There's probably some way to break my pattern.
link

answered 2011-12-05 at 14:38:44

kaufmed's gravatar image

kaufmed

Hmmm. As (even) you two are having doubts this will work, I have changed things to a more controlled situation. I now want to remove all instances that appear within a name attribute. So:

1:
<a name="this is a [highlight]test[/highlight]">[highlight]test[/highlight]


should come back as

1:
<a name="this is a test">[highlight]test[/highlight]


because the regex would match on the first two occurrences of [/?highlight] only. Can you assist with this one too? Thanks so much.
link

answered 2011-12-05 at 14:45:47

Amkick's gravatar image

Amkick

Try replacing:
(?<=<\s*[a-z]+[^>]*name\s*=\s*"[^["]*)\[[^\]]*\]
with an empty string.
link

answered 2011-12-14 at 07:03:18

TerryAtOpus's gravatar image

TerryAtOpus

Regexes with variable length lookbehinds and such are the closest things to magic I know. My thanks go out to these two wizzards.
link

answered 2011-12-14 at 13:01:42

Amkick's gravatar image

Amkick

Your answer
[hide preview]

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Tags:

×102
×104
×11
×10

Asked: 12/04/2011 08:13

Seen: 303 times

Last updated: 12/14/2011 03:46