Regex Pattern

G

Geoff K

Hi
The aim is to identify anomalous strings in user data using a Regex pattern
but as a complete newbie I have got bogged down.
My test code is below but as you will see there are some "bad" results.
The criteria - a string can be:
text only = Head, Dept Head
text plus numerics = Class 8-10, Class 8, 6th form, Head KS6
a single dot
If numerics are included then:
a single or double digit must be followed by a hyphen or an ordinal =
1st, 10-12, 15th
there must be no lead zeros = 0 or 01270
ordinals must not be hyphenated = 2nd-3rd, 2nd-4

My Regex doesn't yet include a test for text only and wondered if that had
to be a separate process?

I would be grateful of any help to complete the pattern.

T.I.A.

Geoff
Sub TestPattern()

Dim objRegExp As Object
Dim i As Long

'''some test and result strings
Dim arrString() As Variant
Dim arrResult() As Variant
arrString = Array("String", "0", "class 1-", "8-", "8-10", _
"12", "12-15", "165", "12-165", "0 12", "Class 8", "1st class", _
"2nd class", "3rd class", "6th form", "40th class", "400th class", _
"Head", "Head KS6", ".")
arrResult = Array("Result", "ok", "ok", "ok", "ok", "ok", "ok", _
"ok", "bad", "bad", "bad", "ok", "ok", "ok", "ok", "ok", "ok", _
"bad", "bad", "bad")

'''set pattern
Set objRegExp = CreateObject("Vbscript.RegExp")
With objRegExp
.Global = True
.IgnoreCase = True
.MultiLine = True
.Pattern = "\b[1-9](-|st|nd|rd|th)\b|\b[1-9][0-9](|-|st|nd|rd|th)\b"
End With

With Sheets(1)

'''setup test strings
.Range("a:c").ClearContents
.Range("a:c").NumberFormat = "@"
.Cells(1, "B") = "Regex Test"
For i = LBound(arrString) To UBound(arrString)
.Cells(i + 1, "A") = arrString(i)
.Cells(i + 1, "C") = arrResult(i)
Next

'''run test pattern
For i = LBound(arrString) + 1 To UBound(arrString)
If objRegExp.Test(.Cells(i + 1, 1)) Then
.Cells(i + 1, "B") = "valid"
Else
.Cells(i + 1, "B") = "invalid"
End If
Next

End With

End Sub
 
R

Ron Rosenfeld

Hi
The aim is to identify anomalous strings in user data using a Regex pattern
but as a complete newbie I have got bogged down.
My test code is below but as you will see there are some "bad" results.
The criteria - a string can be:
text only = Head, Dept Head
text plus numerics = Class 8-10, Class 8, 6th form, Head KS6
a single dot
If numerics are included then:
a single or double digit must be followed by a hyphen or an ordinal =
1st, 10-12, 15th
there must be no lead zeros = 0 or 01270
ordinals must not be hyphenated = 2nd-3rd, 2nd-4

My Regex doesn't yet include a test for text only and wondered if that had
to be a separate process?

I would be grateful of any help to complete the pattern.

T.I.A.

Geoff


It shouldn't be that hard to do, but your criteria are not congruent with your
examples.

For example, you give as an example in the text plus numerics category
"Class 8" and "Head KS6". But your criteria state that if numerics are
included, a single or double digit must be followed by a hyphen or an ordinal.
That is not present in these examples.

You also give "Class 8-10" as an example. But the 10 is not followed by a
hyphen or an ordinal.

In addition, are these single entries within a cell; or are there multiple
entries within a cell as shown on the various lines you have provided?





--ron
 
G

Geoff K

You are right. I should qualify that criteria by adding "unless the string
is terminated or if a digit is followed by a space then the next character
should not be numeric" as in "0 12".

The strings are as extracted from real records. Data input is not under my
control. What I am trying to do is simply identify typo errors and wrong
data being entered in wrong fields for example you would not expect to find a
telephone number attached to a job title or "6th Form Head" as a Surname. It
is important to trap as many errors as possible at this stage and using
numerics s an approach which would appear to trap many of them. Having said
that I have also seen "Please select one…" and "Other" appear under the field
name "Job Title" and numerics won't trap those kind of mistakes. <g>
 
R

Ron Rosenfeld

You are right. I should qualify that criteria by adding "unless the string
is terminated or if a digit is followed by a space then the next character
should not be numeric" as in "0 12".

The strings are as extracted from real records. Data input is not under my
control. What I am trying to do is simply identify typo errors and wrong
data being entered in wrong fields for example you would not expect to find a
telephone number attached to a job title or "6th Form Head" as a Surname. It
is important to trap as many errors as possible at this stage and using
numerics s an approach which would appear to trap many of them. Having said
that I have also seen "Please select one…" and "Other" appear under the field
name "Job Title" and numerics won't trap those kind of mistakes. <g>


See how this regex does in your error checking routine:

"(^|[^-])[1-9]\d?(st|nd|rd|th)(?!-)|((^|\D)[1-9]\d?-[1-9]\d?(?=\D|$))|(^(\D+|\.)$)|([a-z]\s?)[1-9]\d?$"

Not perfect, but maybe a start.
--ron
 
R

r

for the check the ordinals try Is_Ordinals ...

Sub test()
Dim i As Long
For i = 0 To 99
Cells(i + 1, 1) = Ordinals(i)
Cells(i + 1, 2) = Is_Ordinals(CStr(Cells(i + 1, 1)))
Next
End Sub

Function Ordinals(l As Long) As String
'da 0 a 99
Dim re As Object
Dim s As String

s = VBA.Strings.Right(CStr(l), 2)
Set re = CreateObject("vbscript.regexp")
re.Pattern = _
"^((1|[2-9]1)|(2|[2-9]2)|(3|[2-9]3)|([04-9]|1[0-9]|[2-9]0|[2-9][4-9]))$"
s = re.Replace(s, "$2st$3nd$4rd$5th")
re.Pattern = "\d{1,2}[a-z]{2}"
If re.test(s) Then
Ordinals = re.Execute(s)(0)
End If
End Function

Function Is_Ordinals(s As String) As Boolean
'da 0 a 99
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Pattern = _

"^((1|[2-9]1)st|(2|[2-9]2)nd|(3|[2-9]3)rd|([04-9]|1[0-9]|[2-9]0|[2-9][4-9])th)$"
If re.test(s) Then
Is_Ordinals = True
End If
End Function

regards
r

Il mio ultimo lavoro ...
http://excelvba.altervista.org/blog/index.php/Excel-VBA/UsedRange-eccezioni-e-alternative.html


Ron Rosenfeld said:
You are right. I should qualify that criteria by adding "unless the string
is terminated or if a digit is followed by a space then the next character
should not be numeric" as in "0 12".

The strings are as extracted from real records. Data input is not under my
control. What I am trying to do is simply identify typo errors and wrong
data being entered in wrong fields for example you would not expect to find a
telephone number attached to a job title or "6th Form Head" as a Surname. It
is important to trap as many errors as possible at this stage and using
numerics s an approach which would appear to trap many of them. Having said
that I have also seen "Please select one…" and "Other" appear under the field
name "Job Title" and numerics won't trap those kind of mistakes. <g>


See how this regex does in your error checking routine:

"(^|[^-])[1-9]\d?(st|nd|rd|th)(?!-)|((^|\D)[1-9]\d?-[1-9]\d?(?=\D|$))|(^(\D+|\.)$)|([a-z]\s?)[1-9]\d?$"

Not perfect, but maybe a start.
--ron
 
G

Geoff K

Gosh - what a pattern! As a newbie to Regex I am impressed.
On first run this performs as I would expect. However if you are able then
please check back again as I have a library of more than 500 databases to
evaluate against.

What is enticing is the prospect of reducing processing time. I run the
exception report immediately after I have compiled a file often from more
than 50 workbooks in xls. This typically represents in excess of 15,000 rows
of 8 fields and often exceeds 40,000 rows.
My current code for validation is based entirely on finding any numeric in
several fields. This is placed into a Collection for each field then dumped
into an anomaly report. However it is no where near refined
enough as it picks up all sorts of valid stuff and the aim is exception
reporting to assist the user, when later in CSV format, to do a search and
replace exercise. Because of the 1000 limit in unique items in an xls
Autofilter I produced my own exception report using the code below.
However this is slow (because I use an array of all numbers) compared to the
main process and I resolved to research Regex capabilities. Excuse my use of
On Error as I rely on the properties of a Collection to only accept unique
entries as opposed to a Dictionary.

Geoff

'''get strings containing numerics by column
bRogueNum = False
arrRogueNum = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
If Not .Cells(.Rows.Count, "A").End(xlUp).Row = 1 Then
Set tbl = Union(.Range("C2:F" & .Cells(.Rows.Count, "A").End _(xlUp).Row),
..Range("H2:H" & .Cells(.Rows.Count, "A").End _(xlUp).Row))
For j = 1 To 6
If Not j = 5 Then
Set colRogueNum = New Collection
For i = LBound(arrRogueNum) To UBound(arrRogueNum)
With tbl.Columns(j)
Set rogueNum = .Find(arrRogueNum(i), , xlValues, _
xlPart, xlByRows, xlNext)
If Not rogueNum Is Nothing Then
bRogueNum = True
firstAddress = rogueNum.Address
Do
'''create collection of unique anomalies
On Error Resume Next
colRogueNum.Add rogueNum.Text, CStr(rogueNum)
On Error GoTo 0
Set rogueNum = .FindNext(rogueNum)
If rogueNum Is Nothing Then Exit Do
If rogueNum.Address = firstAddress Then Exit Do
Loop
End If
End With
Next i
If colRogueNum.Count > 0 Then '''write output to Log
With omegaWS
rnum = .Cells(.Rows.Count, j + 15).End(xlUp).Row
For k = 1 To colRogueNum.Count
.Cells(k + rnum, j + 15) = colRogueNum(k)
Next
'''autofit all anomalies
.Range("O:S").EntireColumn.AutoFit
End With
End If
End If
Next j
End If
 
R

Ron Rosenfeld

Gosh - what a pattern! As a newbie to Regex I am impressed.
On first run this performs as I would expect.

I think you are responding to me, so thank you for the feedback. As written,
the regex is designed to be used in a routine that will give a True/False
response (e.g. re.test(string).

Please note that I forgot to mention that the regex object needs to be set to
be case insensitive (re.ignorecase = true)
However if you are able then
please check back again as I have a library of more than 500 databases to
evaluate against.

What is enticing is the prospect of reducing processing time. I run the
exception report immediately after I have compiled a file often from more
than 50 workbooks in xls. This typically represents in excess of 15,000 rows
of 8 fields and often exceeds 40,000 rows.
My current code for validation is based entirely on finding any numeric in
several fields. This is placed into a Collection for each field then dumped
into an anomaly report. However it is no where near refined
enough as it picks up all sorts of valid stuff and the aim is exception
reporting to assist the user, when later in CSV format, to do a search and
replace exercise. Because of the 1000 limit in unique items in an xls
Autofilter I produced my own exception report using the code below.
However this is slow (because I use an array of all numbers) compared to the
main process and I resolved to research Regex capabilities. Excuse my use of
On Error as I rely on the properties of a Collection to only accept unique
entries as opposed to a Dictionary.

Geoff

'''get strings containing numerics by column
bRogueNum = False
arrRogueNum = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
If Not .Cells(.Rows.Count, "A").End(xlUp).Row = 1 Then
Set tbl = Union(.Range("C2:F" & .Cells(.Rows.Count, "A").End _(xlUp).Row),
.Range("H2:H" & .Cells(.Rows.Count, "A").End _(xlUp).Row))
For j = 1 To 6
If Not j = 5 Then
Set colRogueNum = New Collection
For i = LBound(arrRogueNum) To UBound(arrRogueNum)
With tbl.Columns(j)
Set rogueNum = .Find(arrRogueNum(i), , xlValues, _
xlPart, xlByRows, xlNext)
If Not rogueNum Is Nothing Then
bRogueNum = True
firstAddress = rogueNum.Address
Do
'''create collection of unique anomalies
On Error Resume Next
colRogueNum.Add rogueNum.Text, CStr(rogueNum)
On Error GoTo 0
Set rogueNum = .FindNext(rogueNum)
If rogueNum Is Nothing Then Exit Do
If rogueNum.Address = firstAddress Then Exit Do
Loop
End If
End With
Next i
If colRogueNum.Count > 0 Then '''write output to Log
With omegaWS
rnum = .Cells(.Rows.Count, j + 15).End(xlUp).Row
For k = 1 To colRogueNum.Count
.Cells(k + rnum, j + 15) = colRogueNum(k)
Next
'''autofit all anomalies
.Range("O:S").EntireColumn.AutoFit
End With
End If
End If
Next j
End If

I'm not sure what your question is. If it is how to speed up your routine
above, there are others better qualified to do that.
--ron
 
G

Geoff K

Ron
In running through the databases with the new pattern I now realise the
significance of KS3, KS4, KS5 etc.
KS is an acroynym for Key Stage and this appears quite frequently in the
lists. It would be very useful to "allow" this through and I wondered if you
would be kind enough to amend the pattern accordingly.
It can take the form of KS3 Manager, Manager KS4 or expanded to Key Stage 5
Manager or Manager Key Stage 6.

Is this possible?

Re speed, I don't have a problem.

Thank you so far.

Geoff
 
R

Ron Rosenfeld

Ron
In running through the databases with the new pattern I now realise the
significance of KS3, KS4, KS5 etc.
KS is an acroynym for Key Stage and this appears quite frequently in the
lists. It would be very useful to "allow" this through and I wondered if you
would be kind enough to amend the pattern accordingly.
It can take the form of KS3 Manager, Manager KS4 or expanded to Key Stage 5
Manager or Manager Key Stage 6.

Is this possible?

Re speed, I don't have a problem.

Thank you so far.

Geoff

The entries where the string ended with a digit should have been being flagged
as OK.

I don't have time to work on this for the next few days, but if your valid
phrases are all as above, you could just add, via alternation, to the existing
regex.

e.g. (for one digit KS entries):

(^|[^-])[1-9]\d?(st|nd|rd|th)(?!-)|((^|\D)[1-9]\d?-[1-9]\d?(?=\D|$))|(^(\D+|\.)$)|([a-z]\s?)[1-9]\d?$|KS[1-9]|Stage\s+[1-9]

(for two digit KS entries, just add the \d? after each [1-9] as in

KS[1-9]\d?
--ron
 
G

Geoff K

Ron

Great.

Thank you for the help and example. That is excellent advice. I can
progress from there - i hope <g>

Geoff
 
R

Ron Rosenfeld

Ron

Great.

Thank you for the help and example. That is excellent advice. I can
progress from there - i hope <g>

Geoff

Glad to help. Thanks for the feedback.
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top