Regex techniques

D

Dave Runyan

I am still struggling with the use of the RegExp object in the VBScript_55
library, in Excel/VBA.

In my "old" regex processor, BKReplacem, I could use special characters such
as \n in the REPLACEMENT string, not just the search string. But that does
not seem to work in VBScript... I would get the literal "\n" in the output,
for example.

Perhaps I am trying to do things beyond this flavor of regex's capabilities.
I am trying to substantially reform and add information to multiple HTML
docs, using a set of search and replace rules stored in a spreadsheet for
ease of maintenance.

I note that I have never seen a VBS-regex example involving multiple lines
of text, but isn't that why they have the multi-line and global options?

Does anyone know of a source of VBS-regex documentation or examples that are
more oriented to text files, as opposed to field-validation applications?

I should mention that I am reading my text files in from a text stream
object into a single string variable, so that I can process the entire scope
of the document - line by line will not suffice for what I need to do.

To recap my questions:
1. Can I use special regex characters in the VBS replace string?
2. Can I process entire multi-line documents, and if so is the one-string
approach the right one in VBS?
3. Is there a source of documentation on more substantial multi-line VBS
regexs?

Thanks!
 
D

Dave Runyan

Thanks Tim, but I don't those VB expressions will be recognized by the RegExp
object. With any regex processor I have to use one of the "offical" regex
expressions like "\r\n" or "\x0d\x0a" or "\cM\cJ" which should all produce
the quivalent of "vbcrlf" in the output string. I could produce the
equivalent HTML control expressions, but I am trying to be able to interpret
and reproduce multi-line expressions, not just create a new line by
outputting (say) "<BR>". The difference is subtle, but for what I am doing
it is important.

VBS regex is supposed to treat end-of-line characters as valid characters
that will match the ".", when multi-line mode is set true. Mine does not,
and I can't understand why.

For example, with multi-line option = TRUE,
the pattern ^<CENTER>.*?</CENTER> should match both:

<CENTER>Hello World</CENTER> and

<CENTER>
Hello World
</CENTER>

BUT it only seems to match the first, one-line, string. I can't seem to
match multi-line strings no matter what I do.
 
R

Ron Rosenfeld

Thanks Tim, but I don't those VB expressions will be recognized by the RegExp
object. With any regex processor I have to use one of the "offical" regex
expressions like "\r\n" or "\x0d\x0a" or "\cM\cJ" which should all produce
the quivalent of "vbcrlf" in the output string. I could produce the
equivalent HTML control expressions, but I am trying to be able to interpret
and reproduce multi-line expressions, not just create a new line by
outputting (say) "<BR>". The difference is subtle, but for what I am doing
it is important.

VBS regex is supposed to treat end-of-line characters as valid characters
that will match the ".", when multi-line mode is set true. Mine does not,
and I can't understand why.

For example, with multi-line option = TRUE,
the pattern ^<CENTER>.*?</CENTER> should match both:

<CENTER>Hello World</CENTER> and

<CENTER>
Hello World
</CENTER>

BUT it only seems to match the first, one-line, string. I can't seem to
match multi-line strings no matter what I do.

In VB

DOT "." does NOT match \n, regardless of how Multiline is set.

The Multiline property changes how ^ and $ are interpreted.

If you want to match all characters, including \n, you need to use something
like [\s\S]

So if you want to match both options, above, use the pattern:

^<CENTER>[\s\S]*?</CENTER> or
^<CENTER>[\s\S]*</CENTER>


--ron
 
D

Dave Runyan

Okay, that makes sense, but I still don't understand two things:

1. Why couldn't I explicitly match \r\n in the input text - I confirmed that
the input text stream contained crlf at the right point.
2. Can I use \r\n etc. in the REPLACE string to GENERATE those characters in
the result?

Thanks!

Ron Rosenfeld said:
Thanks Tim, but I don't those VB expressions will be recognized by the RegExp
object. With any regex processor I have to use one of the "offical" regex
expressions like "\r\n" or "\x0d\x0a" or "\cM\cJ" which should all produce
the quivalent of "vbcrlf" in the output string. I could produce the
equivalent HTML control expressions, but I am trying to be able to interpret
and reproduce multi-line expressions, not just create a new line by
outputting (say) "<BR>". The difference is subtle, but for what I am doing
it is important.

VBS regex is supposed to treat end-of-line characters as valid characters
that will match the ".", when multi-line mode is set true. Mine does not,
and I can't understand why.

For example, with multi-line option = TRUE,
the pattern ^<CENTER>.*?</CENTER> should match both:

<CENTER>Hello World</CENTER> and

<CENTER>
Hello World
</CENTER>

BUT it only seems to match the first, one-line, string. I can't seem to
match multi-line strings no matter what I do.

In VB

DOT "." does NOT match \n, regardless of how Multiline is set.

The Multiline property changes how ^ and $ are interpreted.

If you want to match all characters, including \n, you need to use something
like [\s\S]

So if you want to match both options, above, use the pattern:

^<CENTER>[\s\S]*?</CENTER> or
^<CENTER>[\s\S]*</CENTER>


--ron
 
R

Ron Rosenfeld

Okay, that makes sense, but I still don't understand two things:

1. Why couldn't I explicitly match \r\n in the input text - I confirmed that
the input text stream contained crlf at the right point.

Most likely, somewhere between the input text stream and the input to the VB
routine, the \r is getting stripped out.
2. Can I use \r\n etc. in the REPLACE string to GENERATE those characters in
the result?

I'm pretty certain you cannot, although others more knowledgeable about
VBScript may have a work around. I believe that the replace string can only
contain literals, or subexpressions.

You can use CHAR(10).

For example,

Given your source string:

<CENTER>Hello World</CENTER>

and you want to insert \n as in your second example (before and after the
CENTER'd string, you could do something like this:

Pattern = "(>)(.*)?(</)"

Replace String = "$1"&CHAR(10)&"$2"&CHAR(10)&"$3"


Thanks!

Ron Rosenfeld said:
Thanks Tim, but I don't those VB expressions will be recognized by the RegExp
object. With any regex processor I have to use one of the "offical" regex
expressions like "\r\n" or "\x0d\x0a" or "\cM\cJ" which should all produce
the quivalent of "vbcrlf" in the output string. I could produce the
equivalent HTML control expressions, but I am trying to be able to interpret
and reproduce multi-line expressions, not just create a new line by
outputting (say) "<BR>". The difference is subtle, but for what I am doing
it is important.

VBS regex is supposed to treat end-of-line characters as valid characters
that will match the ".", when multi-line mode is set true. Mine does not,
and I can't understand why.

For example, with multi-line option = TRUE,
the pattern ^<CENTER>.*?</CENTER> should match both:

<CENTER>Hello World</CENTER> and

<CENTER>
Hello World
</CENTER>

BUT it only seems to match the first, one-line, string. I can't seem to
match multi-line strings no matter what I do.

In VB

DOT "." does NOT match \n, regardless of how Multiline is set.

The Multiline property changes how ^ and $ are interpreted.

If you want to match all characters, including \n, you need to use something
like [\s\S]

So if you want to match both options, above, use the pattern:

^<CENTER>[\s\S]*?</CENTER> or
^<CENTER>[\s\S]*</CENTER>


--ron

--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top