Find matches in 2 cols using Collection vs Dictionary

R

Ron Rosenfeld

A Collection object uses the Key to make its decisions. The Item is just along for the ride.
You can stick almost anything into the Item.

That is a good point about the Collection object, if you are using it only to make the decision, and don't really need it to also store data.

And thanks for the pointer about the difference between Dim x as New yyy: Set x = yyy vs. Dim x as yyy: Set x = New yyy
 
G

GS

Jim Cone explained on 1/18/2012 :
Chip Pearson has some comments on "Don't Use Auto-Instancing Object
Variables" at
http://www.cpearson.com/excel/DeclaringVariables.aspx

A Collection object uses the Key to make its decisions. The Item is just
along for the ride.
You can stick almost anything into the Item.

Thanks, Jim.
I looked at Chip's article and I agree with it without reservation. I
reiterate, though, my comments regarding non-intrinsic objects since
his comments ref the external Scripting lib. Nonetheless, since we do
not work with the Collection object in the ways he points out, my
position still stands for using auto-instancing because the object
doesn't get instantiated until needed in the function and doesn't
persist to exist outside the function. I guess it's a matter of knowing
when it's okay to use auto-instancing and when not to use it. I suppose
it's also good practice to be consistent in how one handles this,
though in my world there's room for variable consistency<g> when
warranted. In the context of this function the Collection is also
auto-destroyed and so no need to add extra Set statements<IMO>. I'm a
strong supporter of explicitly destroying any objects we explicitly
create, and so...

Dim cRngB As Collection
Set cRngB = New Collection

...should be explicitly destroyed as a point of 'good practice' when
we're done with it...

Set cRngB = Nothing

...so VBA doesn't have to do the extra processing involved with implicit
destruction of the object variable. Not a big deal on a one-by-one
basis but the performance effect can be accumulative over the life of a
project's runtime.

So the variable consistency I use is...

Create explicitly; destroy explicitly
Auto-instance; auto-destroy
--


I'm okay with the way the Key/Item is handled since both are required
inputs and so must be some value when added. I don't see any advantage
in swapping the same value as Key with vbNullString for Item. I could,
however, swap using Item for the check to using Key since Key should be
unique while Item can be anything. I realize we use the same value for
both but the logical test in the real world will 'usually' be made on
Key if known, Item otherwise. Ron's implementation is good either way
IMO.
 
G

GS

Jim Cone explained :
Also, you should find code is faster when adding to the collection, if you
fill the Item with vbNullstring.
The collection is just being used to dump duplicates and you can iterate the
Keys just as easy as the Items.

After giving this more thought I decided to try using vbNullString just
to eliminate the 2 CStr() functions. It improved performance by 1 sec,
which is 12.5% based on the time using the 2 CStr() functions.

As stated previously, at first I didn't think it would be an advantage
but forgot I had to use CStr() because I deliberately left the cells
numeric so they'd be usable in formulas/calcs. (There was no need for
CStr() using Dictionary) Good catch, Jim! ..thanks for pointing this
out!

<FWIW>
I tested Chip's theory about being able to test an auto-instance object
and found a discrepancy in the results. Unless I misread his comments,
we can work with an auto-instance object same as an explicitly created
object using the Set statement.
 
G

GS

Thanks Jim & Ron for your feedback and helpful input. I modified this
project as follows...

The caller sub:
--I added an optional notification that asks the user if they want to
run a process on a returned list when the number of matches is
reported. This can be swapped out for the former notification via
'Commenting'. (Code for the process to call needs to be added as
appropriate to user's needs)

The StripDupes() function:
--In the case where no matches are found, the function makes no changes
to the worksheet.
--Iteration of the Collection acts on 'Key'.
--The 2 CStr() functions for adding values to 'Item' were replaced with
'vbNullString'


Final drafts...

Sub CompareCols_StripDupes()
Dim bSuccess As Boolean, lMatchesFound As Long
Dim vAns As Variant, sMsg As String

sMsg = _
"Do you want to remove any duplicate items in the non-matches?" _
& vbLf & vbLf & "(Doing so will return a list of unique items)"
vAns = MsgBox(sMsg, vbYesNo + vbQuestion)

If vAns = vbNo Then
bSuccess = StripDupes(lMatchesFound) '//dupes allowed
Else
bSuccess = StripDupes(lMatchesFound, False) '//no dupes allowed
End If 'vAns = vbNo

If lMatchesFound = 0 Then MsgBox "No matches found!": Exit Sub
If lMatchesFound < 0 Then
sMsg = "Both columns must have more than 1 item!" _
& vbLf & vbLf _
& "Please try again: specify different columns!"
MsgBox sMsg, vbExclamation
Exit Sub
End If 'lMatchesFound < 0

If bSuccess Then
sMsg = Format(CStr(lMatchesFound), "#,##0") _
& " Matches were found"
If vAns = vbYes Then _
sMsg = sMsg & " (including non-match duplicates)"
MsgBox sMsg '//comment out if using option below

'Optional: Ask to run a process on the new list
' sMsg = sMsg & vbLf & vbLf _
' & "Do you want to process the new list?"
'
' vAns = MsgBox(sMsg, vbYesNo + vbQuestion)
' If vAns = vbYes Then
' 'Code... ('Call' a process to act on the new list)
' End If 'vAns = vbYes
Else
MsgBox "An error occured!"
End If 'bSuccess
End Sub

Function StripDupes(Matches As Long, _
Optional AllowDupes As Boolean = True) As Boolean
' Compares 2 user-specified cols and removes matches found.
' User can also specific target col to receive revised list.
'
' Args In: Matches: ByRef var to return number of matches found to
' the caller.
'
' AllowDupes: True by default. Keeps duplicate non-match
' values in col to remove dupes from. If passing False,
' duplicate items in non-match col are removed.
'
' Returns: True if matches found and no error occurs;
' False if a: matches not found --OR-- error occurs;
' b: either input col has less than 2 items.
'
' Sources: Ron Rosenfeld, Jim Cone, GS (Garry Sansom)

Dim i&, j&, lMatchesFound& 'as long
Dim vRngA, vRngB, vRngOut(), vAns 'as variant
Dim sRngOut As String

'Get the label of the columns to act on
Const sMsg As String = "Please enter the label of the column"
'Column to filter
vAns = Application.InputBox(sMsg _
& " to remove duplicates from", Type:=2)
If vAns = False Or vAns = "" Then Exit Function
vRngA = Range(vAns & "1:" & vAns _
& Cells(Rows.Count, vAns).End(xlUp).Row)
sRngOut = vAns '//output goes here unless specified below
'Column to be checked
vAns = Application.InputBox(sMsg _
& " to check for duplicates", Type:=2)
If vAns = False Or vAns = "" Then Exit Function
vRngB = Range(vAns & "1:" & vAns _
& Cells(Rows.Count, vAns).End(xlUp).Row)

'Make sure lists contain more than 1 item
If Not IsArray(vRngA) Or Not IsArray(vRngB) Then _
Matches = -1: Exit Function

'Column to receive the results
vAns = Application.InputBox(sMsg _
& "where the new list is to go" & vbLf _
& "(Leave blank or click 'Cancel' to use column " _
& UCase$(sRngOut) & ")", Type:=2)
If vAns = False Or vAns = "" Then sRngOut = sRngOut _
Else sRngOut = vAns

Debug.Print Now()
Dim cRngB As New Collection
On Error Resume Next
For j = LBound(vRngB) To UBound(vRngB)
cRngB.Add Key:=CStr(vRngB(j, 1)), Item:=vbNullString
Next 'j
Err.Clear: On Error GoTo ErrExit

If AllowDupes Then '//fastest
On Error GoTo MatchFound
For i = LBound(vRngA) To UBound(vRngA)
If cRngB.Key(CStr(vRngA(i, 1))) <> "" Then _
vRngA(i, 1) = "": lMatchesFound = lMatchesFound + 1
skipit:
Next 'i

Else '//slowest
On Error GoTo MatchFound
For i = LBound(vRngA) To UBound(vRngA)
cRngB.Add Key:=CStr(vRngA(i, 1)), Item:=vbNullString
Next 'i
End If 'AllowDupes
Err.Clear: On Error GoTo ErrExit

j = 0: ReDim vRngOut(UBound(vRngA) - lMatchesFound, 0)
For i = LBound(vRngA) To UBound(vRngA)
If Not vRngA(i, 1) = "" Then _
vRngOut(j, 0) = vRngA(i, 1): j = j + 1
Next 'i

If lMatchesFound > 0 Then '//only write if lMatchesFound > 0
Range(sRngOut & ":" & sRngOut).ClearContents
With Range(sRngOut & "1").Resize(UBound(vRngOut), 1)
.Value = vRngOut
.NumberFormat = "0000000000000" '//optional
.EntireColumn.AutoFit '//optional
End With
End If 'lMatchesFound > 0

Debug.Print Now()
ErrExit:
Matches = lMatchesFound: StripDupes = (Err = 0)
Exit Function

MatchFound:
If AllowDupes Then Resume skipit
vRngA(i, 1) = "": lMatchesFound = lMatchesFound + 1: Resume Next

End Function 'StripDupes()

Enjoy!
 
G

GS

I've reworked this utility as follows:

A bug in the AllowDupes feature is fixed
Prompts have been moved to the caller
Parameters are passed to the function via an array
Function supports:
- removing matches or non-matches
- returning a list with or without duplicate values


<code> - watch for line wrapping

The new caller:
Sub CompareCols_FilterMatches()
Dim bSuccess As Boolean, lMatchesFound As Long
Dim vAns As Variant, vCriteria(5) As Variant, sMsg As String

'Get the label of the columns to act on
Const MSG As String = "Please enter the label of the column"

tryagain:
'Column to filter
sMsg = MSG & " to be filtered": vAns = Application.InputBox(sMsg,
Type:=2)
If vAns = False Or vAns = "" Then Beep: Exit Sub
vCriteria(0) = Range(vAns & "1:" & vAns & Cells(Rows.Count,
vAns).End(xlUp).Row).Address
'Output goes in the column being filtered unless specified otherwise
below
vCriteria(2) = UCase$(vAns)

'Column to be checked
sMsg = MSG & " to check for matches": vAns =
Application.InputBox(sMsg, Type:=2)
If vAns = False Or vAns = "" Then Beep: Exit Sub
vCriteria(1) = Range(vAns & "1:" & vAns & Cells(Rows.Count,
vAns).End(xlUp).Row).Address

'Make sure lists contain more than 1 item
If Not Range(vCriteria(0)).Cells.Count > 1 _
Or Not Range(vCriteria(1)).Cells.Count > 1 Then
sMsg = "Columns MUST have more than one value!" & vbLf & vbLf
sMsg = sMsg & "Please try again with a different set of
columns"
MsgBox sMsg, vbCritical: GoTo tryagain
End If

'Column to receive the results
sMsg = MSG & "where the new list is to go" & vbLf _
& "(Leave blank or click 'Cancel' to use column '" &
vCriteria(2) & "')"
vAns = Application.InputBox(sMsg, Type:=2)
If Not (vAns = False) And (vAns <> "") Then vCriteria(2) =
UCase$(vAns)

'Return or remove matches?
sMsg = "Do you want to return the matches found instead of removing
them?"
vAns = MsgBox(sMsg, vbYesNo + vbQuestion)
If (vAns = vbYes) Then vCriteria(3) = 1 Else vCriteria(3) = 0

'Return a unique list?
sMsg = "Do you want only unique items in the returned list?" & vbLf &
vbLf & "(No duplicates)"
vAns = MsgBox(sMsg, vbYesNo + vbQuestion) '//YES = no dupes allowed
If (vAns = vbYes) Then vCriteria(4) = 0 Else vCriteria(4) = 1
bSuccess = FilterMatches(lMatchesFound, vCriteria())

If lMatchesFound = 0 Then MsgBox "No matches found!": Exit Sub
If lMatchesFound < 0 Then
sMsg = "Both columns must have more than 1 item!"
sMsg = sMsg & vbLf & vbLf & "Please try again: specify different
columns!"
MsgBox sMsg, vbExclamation: Exit Sub
End If 'lMatchesFound < 0

If bSuccess Then
sMsg = Format(CStr(lMatchesFound), "#,##0") _
& " Matches were found"
If vAns = vbYes Then _
sMsg = sMsg & " (including non-match duplicates)"
MsgBox sMsg '//comment out if using option below

'Optional: Ask to run a process on the new list
' sMsg = sMsg & vbLf & vbLf _
' & "Do you want to process the new list?"
'
' vAns = MsgBox(sMsg, vbYesNo + vbQuestion)
' If vAns = vbYes Then
' 'Code... ('Call' a process to act on the new list)
' End If 'vAns = vbYes
Else
MsgBox "An error occured!"
End If 'bSuccess
End Sub


The new function:
Function FilterMatches(Matches As Long, Criteria() As Variant) As
Boolean
' Compares 2 user-specified cols and filters matches found.
' User can also specific target col to receive resulting list.
' Optionally supports returning a unique list or allow duplicates.
' Optionally supports returning matches or non-matches.
'
' Args In: Matches: ByRef var to return number of matches found to
the caller.
'
' vCriteria(): A variant array containing the filtering
parameters.
' Criteria(0) - Address of the values to be filtered
' Criteria(1) - Address of the values to check
' Criteria(2) - Label of the column to put the filtered
list
' Criteria(3) - Numeric value to determine if we return
matches or non-matches
' Criteria(4) - Numeric value to determine if we return a
unique list or allow dupes
'
' Returns: True if matches found and no error occurs;
' False if a: matches not found --OR-- error occurs;

Dim i&, j& 'as long
Dim vFilterRng, vCheckRng, vResult, vaMatches(), vaNoMatches(),
vaDataOut() 'as variant
Dim bReturnMatches As Boolean, bMatch As Boolean, bDupesAllowed As
Boolean
Dim cItemsToCheck As New Collection, sMsg$, sRngOut$ 'as string

'Load the filtering criteria
vFilterRng = Range(Criteria(0)): vCheckRng = Range(Criteria(1)):
sRngOut = Criteria(2)
bReturnMatches = (Criteria(3) = 1): bDupesAllowed = (Criteria(4) = 1)
ReDim vaMatches(UBound(vFilterRng)): ReDim
vaNoMatches(UBound(vFilterRng)): j = 0

'Load the Collection with the values to be checked.
'Collections only allow unique keys so use OERN (no need to check if
they already exist)
Set cItemsToCheck = New Collection: On Error Resume Next
For i = LBound(vCheckRng) To UBound(vCheckRng)
cItemsToCheck.Add Key:=CStr(vCheckRng(i, 1)), Item:=vbNullString
Next 'i
Err.Clear

'Check the Collection for matches
On Error GoTo MatchFound
For i = LBound(vFilterRng) To UBound(vFilterRng)
bMatch = False '..reset
cItemsToCheck.Add Key:=CStr(vFilterRng(i, 1)), Item:=vbNullString
If bMatch Then
If bReturnMatches Then vaMatches(j) = vFilterRng(i, 1): j = j + 1
Else
vaNoMatches(j) = vFilterRng(i, 1): j = j + 1
cItemsToCheck.Remove CStr(vFilterRng(i, 1)) '..so dupes of it
don't get counted
End If 'bMatch
Next 'i

'Initialize the return list
If bReturnMatches Then vResult = vaMatches Else vResult = vaNoMatches

'Return a list of unique values?
If Not bDupesAllowed Then
On Error GoTo UniqueList
Dim cUniqueList As New Collection
For i = LBound(vResult) To UBound(vResult)
cUniqueList.Add Key:=CStr(vResult(i)), Item:=vbNullString
Next 'i
End If 'Not bDupesAllowed
Err.Clear: On Error GoTo ErrExit

'Make the list to return contiguous.
ReDim vaDataOut(UBound(vResult), 0): j = 0
For i = LBound(vResult) To UBound(vResult)
If Not vResult(i) = "" Then vaDataOut(j, 0) = vResult(i): j = j + 1
Next 'i

If Matches > 0 Then '..only write if Matches > 0
Columns(sRngOut).ClearContents
With Range(sRngOut & "1").Resize(UBound(vaDataOut) + 1, 1)
.Value = vaDataOut
.NumberFormat = "0000000000000" '..optional
.EntireColumn.AutoFit '..optional
End With
End If 'Matches > 0


ErrExit:
' If bReturnMatches Then Matches = UBound(vResult) ' + 1
FilterMatches = (Err = 0): Exit Function

MatchFound:
bMatch = True: Matches = Matches + 1: Resume Next

UniqueList:
vResult(i) = "": Matches = Matches + 1: Resume Next

End Function 'FilterMatches()
 
G

GS

I forgot to post the performance details...

There are 4 possible results using this utility:

1. Returns a list of matches with duplicates
2. Returns a list of non-matches with duplicates
3. Returns a unique list of matches (no duplicates)
4. Returns a unique list of non-matches

You can choose where to put the returned list. Tested on two cols x
500,000 rows of data, depending on which return options are selected
the new list generated in about 10 to 12 seconds. This might improve if
Calculation/EnableEvents/ScreenUpdating are toggled off/on, but I doubt
by much since the return list gets 'dumped' into the worksheet in one
shot. This produces a slight flicker that's reasonably acceptible IMO.
 
G

GS

I made a few changes that seem to improve performance so that 2 cols x
500000 rows processes in 6 to 8 secs on my machine now...


Function FilterMatches(Matches As Long, Criteria() As Variant) As
Boolean
' Compares 2 user-specified cols and filters matches found.
' User can also specific target col to receive resulting list.
' Optionally supports returning a unique list or allow duplicates.
' Optionally supports returning matches or non-matches.
'
' Args In: Matches: ByRef var to return number of matches found to
the caller.
'
' vCriteria(): A variant array containing the filtering
parameters.
' Criteria(0) - Address of the values to be filtered
' Criteria(1) - Address of the values to check
' Criteria(2) - Label of the column to put the filtered
list
' Criteria(3) - Numeric value to determine if we return
matches or non-matches
' Criteria(4) - Numeric value to determine if we return a
unique list or allow dupes
'
' Returns: True if matches found and no error occurs;
' False if a: matches not found --OR-- error occurs;

Dim i&, j& 'as long
Dim vFilterRng, vCheckRng, vResult, vaMatches(), vaNoMatches(),
vaDataOut() 'as variant
Dim bReturnMatches As Boolean, bMatch As Boolean, bDupesAllowed As
Boolean
Dim cItemsToCheck As New Collection, sMsg$, sRngOut$ 'as string

'Load the filtering criteria
vFilterRng = Range(Criteria(0)): vCheckRng = Range(Criteria(1)):
sRngOut = Criteria(2)
bReturnMatches = (Criteria(3) = 1): bDupesAllowed = (Criteria(4) = 1)
ReDim vaMatches(UBound(vFilterRng)): ReDim
vaNoMatches(UBound(vFilterRng)): j = 0

'Load the Collection with the values to be checked.
'Collections only allow unique keys so use OERN (no need to check if
they already exist)
Set cItemsToCheck = New Collection: On Error Resume Next
For i = LBound(vCheckRng) To UBound(vCheckRng)
cItemsToCheck.Add Key:=CStr(vCheckRng(i, 1)), Item:=vbNullString
Next 'i
Err.Clear

Debug.Print Now()
'Check the Collection for matches
On Error GoTo MatchFound
For i = LBound(vFilterRng) To UBound(vFilterRng)
bMatch = False '..reset
cItemsToCheck.Add Key:=CStr(vFilterRng(i, 1)), Item:=vbNullString
If bMatch Then
If bReturnMatches Then vaMatches(j) = vFilterRng(i, 1): j = j + 1
Else
vaNoMatches(j) = vFilterRng(i, 1): j = j + 1
cItemsToCheck.Remove CStr(vFilterRng(i, 1)) '..so dupes of it
don't get counted
End If 'bMatch
Next 'i

'Initialize the return list
If bReturnMatches Then vResult = vaMatches Else vResult = vaNoMatches

'Return a list of unique values?
If Not bDupesAllowed Then
On Error GoTo UniqueList
Dim cUniqueList As New Collection
For i = LBound(vResult) To UBound(vResult)
cUniqueList.Add Key:=CStr(vResult(i)), Item:=vbNullString
Next 'i
ReDim vaDataOut(cUniqueList.Count - 1, 0): j = 0
Else
ReDim vaDataOut(UBound(vResult), 0): j = 0
End If 'Not bDupesAllowed
Err.Clear: On Error GoTo ErrExit

'Make the list to return contiguous.
For i = LBound(vaDataOut) To UBound(vaDataOut)
If Not vResult(i) = Empty Then vaDataOut(j, 0) = vResult(i): j = j
+ 1
Next 'i

If Matches > 0 Then '..only write if Matches > 0
Columns(sRngOut).ClearContents
With Range(sRngOut & "1").Resize(UBound(vaDataOut) + 1, 1)
.Value = vaDataOut
.NumberFormat = "0000000000000" '..optional
.EntireColumn.AutoFit '..optional
End With
End If 'Matches > 0
Debug.Print Now()


ErrExit:
' If bReturnMatches Then Matches = UBound(vResult) ' + 1
FilterMatches = (Err = 0): Exit Function

MatchFound:
bMatch = True: Matches = Matches + 1: Resume Next

UniqueList:
vResult(i) = Empty: Matches = Matches + 1: Resume Next

End Function 'FilterMatches()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top