Compare two strings and find longest common sub-string

D

David Turner

The problem seems relatively straightward to the human eye but I imagine it's
quite difficult to do in VBA without prior knowledge of what the strings
contain.
If I have two strings like:
The quick red fox jumped over the lazy brown cow
The quick red fox just stood there
How would I compare them and identify "quick red fox" as the longest common
sub-string?
I suppose I would have to start by reading one of the strings into an array
and comparing its elements against the second string? Or somehow use the
Filter function to send the matching items to another array? But then I can
only seem to be able compare each element against one word ("quick", "red" or
"fox") which I would have to know in advance.
There's no doubt some better way.
Any advice greatly appreciated.

Sub CompareStrings()
Dim Array1() As String
Dim string1 As String
Dim string2 As String
Dim InString() As String
Dim i As Integer
Dim j As Integer

string1 = "The quick red fox jumped over the lazy brown cow"
string2 = "The quick red fox just stood there"

For i = 0 To UBound(Split(string1, " ")) - 1
ReDim Preserve Array1(i)
Array1(i) = Split(string1, " ")(i)
'MsgBox Array1(i)
Next

InString = Filter(Array1, "red", True)

For j = 0 To UBound(InString)
MsgBox InString(j)
Next j

End Sub
 
K

Karl E. Peterson

David said:
The problem seems relatively straightward to the human eye but I imagine it's
quite difficult to do in VBA without prior knowledge of what the strings
contain.
If I have two strings like:
The quick red fox jumped over the lazy brown cow
The quick red fox just stood there
How would I compare them and identify "quick red fox" as the longest common
sub-string?

Perhaps it's not even very easy to the human eye? (The longest common
substring was actually "The quick red fox j", not "quick red fox".)
I suppose I would have to start by reading one of the strings into an array
and comparing its elements against the second string? Or somehow use the
Filter function to send the matching items to another array? But then I can
only seem to be able compare each element against one word ("quick", "red" or
"fox") which I would have to know in advance.
There's no doubt some better way.
Any advice greatly appreciated.

I'd suggest you start by clearly enumerating your actual requirements.
Is this a character comparison or a word comparison? Do the matches
need to start at the same position, or may they be anywhere within the
string. Etc, etc, etc...

If you can't define the problem, you'll never solve it.
 
N

Neil Humphries

I would put them in a For Next loop incremented from 1 to the length of the
shortest string.
Use the Mid() function to extract the characters in the shortest string one
by one starting at the first.
Use the Instr() function to search for the character in the longer string.
As long as the strings are a match, increment and keep looping. When they
don't match, use the index of the loop to tell you the position.
Use the Left() function to return the portion of the string that matches.

This will only work if the two string start out the same. It gets more
complex if the patterns match in the middle.
 
F

Fumei2 via OfficeKB.com

I would agree with Karl. As, technically, the largest sub-string is indeed
"The quick red fox j", you need to spell out EXACTLY the requirements.

Neil: "This will only work if the two string start out the same. It gets more
complex if the patterns match in the middle."

That is putting it mildly.

This is a very complex task. Quite likely do-able, but precise requirements
are needed.

Neil said:
I would put them in a For Next loop incremented from 1 to the length of the
shortest string.
Use the Mid() function to extract the characters in the shortest string one
by one starting at the first.
Use the Instr() function to search for the character in the longer string.
As long as the strings are a match, increment and keep looping. When they
don't match, use the index of the loop to tell you the position.
Use the Left() function to return the portion of the string that matches.

This will only work if the two string start out the same. It gets more
complex if the patterns match in the middle.
The problem seems relatively straightward to the human eye but I imagine it's
quite difficult to do in VBA without prior knowledge of what the strings
[quoted text clipped - 36 lines]
 
D

David Turner

Karl E. Peterson said:
I'd suggest you start by clearly enumerating your actual requirements.
Is this a character comparison or a word comparison? Do the matches
need to start at the same position, or may they be anywhere within the
string. Etc, etc, etc...

If you can't define the problem, you'll never solve it.

Sorry. Word comparison, the matches can start anywhere in the strings and
the strings could be of variable length, possibly ending in punctuation: two
sentences for example.
I now realise it's a much more complex problem than I first thought.
 
D

Doug Robbins - Word MVP

It is probably several orders of magnitude more complex than you may have
first though.

I am wondering however is some of the purpose-built plagiarism detection
software might not be the way to go.

--
Hope this helps,

Doug Robbins - Word MVP

Please reply only to the newsgroups unless you wish to obtain my services on
a paid professional basis.
 
K

Karl E. Peterson

Doug said:
It is probably several orders of magnitude more complex than you may have
first though.

It's sounding fairly simple, actually, as brute force tasks go. And,
likewise, as brute force tasks go, it's going to take exponentially
longer to execute as the string size(s) grow. It's not unlike cracking
the combination to a lock without knowing how many numbers are in the
combination, although we are rewarded with partial success results
which actually makes it possible.

You'd "just" start comparing the first character to every character in
the other string, noting if you have a match of one. If you don't find
the first char, you scan the second string for the 2nd char in the
first string, and so on. Once you have a match of one, you start
looking for a match of two. You'd scan the second string for the first
two chars in the first. If they're not there, you'd start scanning the
second string for the 2nd and 3rd chars in the first, and so on.
I am wondering however is some of the purpose-built plagiarism detection
software might not be the way to go.

That's not a bad thought.
 
P

Peter Jamieson

In the simple case where you match character by character, white space
has to match exactly, etc. you're probably better off starting by
comparing the endpoints of the longest possible match, then reducing the
comparison length. If the endpoints don't match, the contents can't
match so you do not actually need to compare them.

Can't prove it though. But I would have thought there was good coverage
of this kind of algorithm in standard works such as Knuth's Sorting and
Searching volume. I don't know where you find that stuff online.


Peter Jamieson

http://tips.pjmsn.me.uk
 
K

Karl E. Peterson

Peter said:
In the simple case where you match character by character, white space has to
match exactly, etc. you're probably better off starting by comparing the
endpoints of the longest possible match, then reducing the comparison length.
If the endpoints don't match, the contents can't match so you do not actually
need to compare them.

Yeah, that's a potentially good option, too. A bit more complex, but
that happens when you start trying to finesse brute force attacks. :)
Can't prove it though. But I would have thought there was good coverage of
this kind of algorithm in standard works such as Knuth's Sorting and
Searching volume. I don't know where you find that stuff online.

You'd think, yeah. But there aren't many folks who still think in 2N
versus N^2 terms. Most just throw a framework at it, and see what
sticks.
 
D

David Turner

I cobble this together no idea where to go, if anywhere(!), from here:

Sub LongestMatch()
Dim strArray1() As String
Dim strArray2() As String
Dim string1 As String
Dim string2 As String
Dim i As Integer
Dim j As Integer
Dim m As Integer
Dim n As Integer


string1 = "quick red fox jumped"
string2 = "red fox stood"

For m = 0 To UBound(Split(string1, " ")) - 1
ReDim Preserve strArray1(m)
strArray1(m) = Split(string1, " ")(m)
'MsgBox strArray1(m)
Next

For n = 0 To UBound(Split(string2, " ")) - 1
ReDim Preserve strArray2(n)
strArray2(n) = Split(string2, " ")(n)
'MsgBox strArray2(n)
Next

For m = 0 To UBound(strArray1)
For n = 0 To UBound(strArray2)
If strArray1(m) = strArray2(n) Then
MsgBox strArray1(m) ' don't know what to do next
ElseIf strArray1(m) <> strArray2(n) Then
MsgBox strArray2(n) ' don't know what to do next
End If
Next
Next

End Sub
 
O

Office PC Developer

This should do the trick...

Option Explicit

Public Function Main()

Dim aStr As String, bStr As String
aStr = "The quick red fox jumped over the lazy brown cow"
bStr = "The quick red fox just stood there"

Dim longestPhrases As Collection, longPhrase As Variant
Set longestPhrases = FindLongestMatchingPhrase(aStr, bStr)
For Each longPhrase In longestPhrases
Debug.Print "Longest Matching Phrase: " & longPhrase
Next

End Function

Public Function FindLongestMatchingPhrase(aStr As String, bStr As String) As
Collection

Dim longestPhrases As Collection
Set longestPhrases = New Collection

Dim aWords As Variant, bWords As Variant
aWords = Split(aStr, " ")
bWords = Split(bStr, " ")

Dim aPhrases As Collection, bPhrases As Collection
Set aPhrases = BuildPhrases(aWords)
Set bPhrases = BuildPhrases(bWords)

Dim curPhrase As Variant, longestPhraseLen As Integer
longestPhraseLen = 0
For Each curPhrase In aPhrases
If (FoundInOtherPhrases(curPhrase, bPhrases)) Then
If (Len(curPhrase) > longestPhraseLen) Then
longestPhraseLen = Len(curPhrase)
Set longestPhrases = New Collection
AddDistinctPhrase curPhrase, longestPhrases
ElseIf (Len(curPhrase) = longestPhraseLen) Then
AddDistinctPhrase curPhrase, longestPhrases
End If
End If
Next
For Each curPhrase In bPhrases
If (FoundInOtherPhrases(curPhrase, aPhrases)) Then
If (Len(curPhrase) > longestPhraseLen) Then
longestPhraseLen = Len(curPhrase)
Set longestPhrases = New Collection
AddDistinctPhrase curPhrase, longestPhrases
ElseIf (Len(curPhrase) = longestPhraseLen) Then
AddDistinctPhrase curPhrase, longestPhrases
End If
End If
Next

Set FindLongestMatchingPhrase = longestPhrases

End Function

Public Function BuildPhrases(wordList As Variant) As Collection

Dim phrases As Collection
Set phrases = New Collection

Dim firstIndex As Integer, secondIndex As Integer, curPhrase As Variant
For firstIndex = 0 To UBound(wordList)
curPhrase = ""
For secondIndex = firstIndex To UBound(wordList)
If (secondIndex > firstIndex) Then
curPhrase = curPhrase & " "
End If
curPhrase = curPhrase & wordList(secondIndex)
AddDistinctPhrase curPhrase, phrases
Next
Next

Set BuildPhrases = phrases

End Function

Public Sub AddDistinctPhrase(curPhrase As Variant, phrases As Collection)

On Error GoTo NotFound
Dim existingPhrase As Variant
existingPhrase = phrases(curPhrase)
Exit Sub

NotFound:
phrases.Add Key:=curPhrase, Item:=curPhrase
Exit Sub

End Sub

Public Function FoundInOtherPhrases(curPhrase As Variant, phrases As
Collection) As Boolean

On Error GoTo NotFound
Dim existingPhrase As Variant
existingPhrase = phrases(curPhrase)
FoundInOtherPhrases = True
Exit Function

NotFound:
FoundInOtherPhrases = False
Exit Function

End Function
 
D

David Turner

Awesome! Only takes a second even with quite long strings. Will do my best to
try and understand the code. Many thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top