Fast way of comparing strings?

  • Thread starter Guillermo López-Anglada
  • Start date
G

Guillermo López-Anglada

Hi,

I have the following code to compare strings:

Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _
ReferenceString As String)

Dim nWord As Long
Dim WordsFound As Long

For nWord = 0 To UBound(ArrayToCompare())

'palabra contenida en cadena...
If InStr(ReferenceString, ArrayToCompare(nWord)) > 0 Then

'aumentamos cómputo de palabras contenidas...
WordsFound = WordsFound + 1
End If
'siguiente palabra...
Next

GetNumOfMatchingWords = WordsFound
End Function

I basically convert a string into an "array of words" and then check whether each of these words is
contained within another string. Then I perform the same operation the other way around in order to find
the highest percentage of matching words between the two strings. This approach works for my purpose, but
as I have thousands of strings to compare, it isn't very efficient. Could someone suggest a faster method?

The whole process involves reading a text file (in binary mode), splitting the read buffer into lines,
then the strings into words and then calling this routine. I think this one is the bottleneck. I've tried
the Like operator as well, but it isn't flexible enough.

Regards,

Guillermo
 
K

Klaus Linke

Hi Guillermo,

Not sure...

Do you compare every string to every other string? Are the strings you
compare and the reference strings from the same list (text file)?

Maybe you could put all words from all strings into a two-dimensional array:
The word itself in the first column, and a number indicating which string
it's from, "1", "2"..., in the second.

Use a (fast) sorting algorithm.

Now you have a list like
Aardvark 7
Aberdeen 3
Aberdeen 5
Aberdeen 12
and 2
and 5
and 7
....

Now you could build a two-dimensional integer matrix (both dimensions equal
to the number of strings) by running through this list once, and
incrementing every time a word appears in two strings:
Aardvark: can be ignored (since it only appears in one sentence),
Aberdeen: increment matrix(3,5), matrix(3,12), matrix(5,12)
and: increment matrix(2,5), matrix(2,7), matrix(5,7) ...

When you're finished, matrix elements with a high number indicate a pair of
sentences that has a lot of words in common.

Regards,
Klaus
 
T

Tony Strazzeri

In addition to Klaus' reply you could look at another approach.

When I need to check an item against a large list I don't like
iterating through the list to find if the item exists. I have
developed an approach using either a listbox control or a collection
object. Both of these allow you to add only unique items to their
lists. If you try to add an item that already exists they generate an
error. My technique involves testing for the error to determine if
the item exists. I think (expect/hope) this may be faster than
iterating an array but have not tested it.

Maybe you could use this method. I can't post example code at this
time because I am not at a PC where I have the code but I will do so
when I can.

Hope this helps.

Cheers
TonyS.
 
R

Russ

What your ultimate goal?
Or is it to find the words or the number of words that are common to two
lists?
Or is it to find a total word count for each word?
Or is it to find which group of words consistently appear more often in
other groups of word?
(Then use what I say below.)


In your code below you are counting the number of words that are in common
with both. If you reverse the process, like you say you do between the same
two lists, then you should come up with the same number of common words.



Looking at your code below, change 'If InStr' to 'While InStr'...<> 0 if you
want to count every match in the string. Then if you reverse the process,
you could come up with a different number of matches, if the lists are not
unique and allow the same word to repeat.
 
R

Russ

Guillermo,
Message below in text.
What your ultimate goal?
Or is it to find the words or the number of words that are common to two
lists?
Or is it to find a total word count for each word?
Or is it to find which group of words consistently appear more often in
other groups of word?
(Then use what I say below.)


In your code below you are counting the number of words that are in common
with both. If you reverse the process, like you say you do between the same
two lists, then you should come up with the same number of common words.



Looking at your code below, change 'If InStr' to 'While InStr'...<> 0 if you
want to count every match in the string. Then if you reverse the process,
you could come up with a different number of matches, if the lists are not
unique and allow the same word to repeat.

You would do the While loop like this:

Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _
ReferenceString As String)

Dim nWord As Long
Dim WordsFound As Long

nWord = 1

nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord))
While nWord <> 0
'aumentamos cómputo de palabras contenidas...
WordsFound = WordsFound + 1
nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord))
Wend

GetNumOfMatchingWords = WordsFound
End Function
 
R

Russ

Guillermo,
Use this instead. To avoid an endless loop, I have to increment nWord, too.

Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _
ReferenceString As String)

Dim nWord As Long
Dim WordsFound As Long

nWord = 1

nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord))
While nWord <> 0
nWord = nWord + 1
'aumentamos cómputo de palabras contenidas...
WordsFound = WordsFound + 1
nWord = InStr( nWord, ReferenceString, ArrayToCompare(nWord))
Wend

GetNumOfMatchingWords = WordsFound
End Function
 
R

Russ

Guillermo,
I wasn't going through ArrayToCompare, so use this instead.


Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _
ReferenceString As String)
Dim iCounter As Long
Dim nWord As Long
Dim WordsFound As Long

iCounter = 1
For nWord = 0 To UBound(ArrayToCompare())
iCounter = InStr( iCounter, ReferenceString, ArrayToCompare(nWord))
While iCounter <> 0
'aumentamos cómputo de palabras contenidas...
WordsFound = WordsFound + 1
iCounter = InStr( iCounter + 1, ReferenceString, _
ArrayToCompare(nWord))
Wend
Next nWord
GetNumOfMatchingWords = WordsFound
End Function
 
R

Russ

Guillermo,
Sorry, I'm not testing this function. But as I look at it again, I see that
I needed to move the iCounter inside the For loop.

Private Function GetNumOfMatchingWords(ArrayToCompare() As String, _
ReferenceString As String)
Dim iCounter As Long
Dim nWord As Long
Dim WordsFound As Long

For nWord = 0 To UBound(ArrayToCompare())
iCounter = 1
iCounter = InStr( iCounter, ReferenceString, ArrayToCompare(nWord))
While iCounter <> 0
'aumentamos cómputo de palabras contenidas...
WordsFound = WordsFound + 1
iCounter = InStr( iCounter + 1, ReferenceString, _
ArrayToCompare(nWord))
Wend
Next nWord
GetNumOfMatchingWords = WordsFound
End Function
 
T

Tony Strazzeri

Here is the explanation and sample code to use a collection to
determine is an item exists in a list.

I wrote the notes below as aprt of my operations and coding
practices. This is the first time they will have been read by someone
else. I hope they make sense..



VBA does not have a method for quickly determining if a item is a
member of a list. For example in an array of names does the array
contain the name fred. Conversely determine if fred is not a member
of the array of names.
The usual way to determine this is to iterate through the entire array
until the item is found (or not found). Although tedious this is a
quick way to check a small number of items to see if they are part of
the list.
A good technique that I have developed is to use a collection to
achieve this.

============================
Collection Object
A Collection object is an ordered set of items that can be referred to
as a unit.

Remarks
The Collection object provides a convenient way to refer to a related
group of items as a single object. The items, or members, in a
collection need only be related by the fact that they exist in the
collection. Members of a collection don't have to share the same data
type.
A collection can be created the same way other objects are created.
For example:
Dim objCollection As New Collection

Once a collection is created, members can be added using the Add
method and removed using the Remove method. Specific members can be
returned from the collection using the Item method, while the entire
collection can be iterated using the For Each...Next statement.
===================================


When adding items you specify the item value and a key for the item.
I usually use the same value for the key as for the item. If the
item is a long string the may be wasteful of memory in which case
alternatives should be sought
Attempting to add an item with the same key as an existing item key
generates a' runtime error 457'. The error can be trapped or simply
ignored.
To ignore the error simply wrap the add with an error trap. Eg.

On Error Resume Next
objCollection.Add Item, key
On Error GoTo 0

This will avoid adding the duplicate leaving the original item in the
collection. If we want to update an existing value (where we don't
know if the item is in the list we can use a similar technique but
this time attempting to remove the item then adding the new item.

On error resume next
objCollection.remove "key"
On error goto 0
objCollection.Add Item, key

The following example will determine if the item is a member of the
list. If the item exists the variable temp will contain its value.

temp = ""
On Error Resume Next
temp = objCollection.Item(key)
On Error GoTo 0

Things to be aware of.
· Item numbers and item count are dynamic.
· If an item is removed all subsequent items will shift up and the
item count will reduce by 1.
· Items can be removed either by key or item number

Since you can have a collection of virtually anything you can address/
store multiple data properties of the item by storing an object in the
collection
Limitation
You can't return an item's index without iterating the list.


METHOD USING A LISTBOX IN A USERFORM
===================================
If programming in a userform we can use the properties of a ListBox
(which can remain hidden) to more easily return the index of an item
if it exists in the list.
Although we can use the same technique as for a collection (above).
The listbox has the advantage that it can easily return the item's
index in the list.
Using a listbox called lbxCTItem

With lbxCTItem
.ListIndex = -1 'unselects any selected item
On Error Resume Next
lbxCTItem = str
On Error GoTo 0
thisItem = .ListIndex
End With


'If the item does not exist thisItem retains a value of -1
'Otherwise it returns the items index number

If thisItem < 0 Then
Msgbox "The item does not exist in the list."
End If


You can store additional data about the item as columns of the
listbox. By changing the listcolumn number you can even find an item
using that column. But beware of duplicates.

I Would be interested in hearing comments on this technique. Maybe
some of the more savvy developers can comment/test on whether this is
quicker than other methods. If not I will get around to it one day.
Thill then I suspec that even with several hundred items in a list
most methods will be equally effective in this environment (Document
automation as distinct from number crunching).


Cheers
TonyS.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top