Find and replace the 1st occurrence for many strings

J

Jac Tremblay

Hi everyone
I have 40000 row of data and need to delete duplicate rows. Of course, I have checked the Excel programming workgroups and Chip Pearson's site. I found many interesting ideas but I need a bit more. I found out that when I use the advanced filter on the key row, it is a significant gain on time processing. In that particular case, what I do is load each unique value extracted in a dynamic array. Then I select the data column and, for each unique key, modify the first occurrence found by adding a string of 3 "#" in front of it. After that, I replace all the other occurrences by "---" (suggestion of CPearson). Finally, I replace all "###" with an empty string. I can then sort the rows, find out the number of rows containing "---" and delete them. That way of doing works faster when there are more than 1000 rows. But the one part where I loop to replace the first occurrence of each key is too slow. I would like to know if there is some faster way to do that job
The Sub procedure below is called by a main procedure (I have over 20 sheets to deal with). The parameters are the columns (2 or 3) that form the key (strCol1 = "1", strCol2 = "2" and strCol3 = "3"
Thanks in advance
Here is the code
' ******************************************************************
Sub DeleteDuplicateRows(ByVal strCol1, ByVal strCol2, ByVal strCol3
' Jac Tremblay 2004-05-2
Dim strFormula As Strin
Dim strRangeAddress As Strin
Dim strSearch() As Strin
Dim lngNbSearch As Lon
Const strConstDashes As String = "---
Const strConstNumbers As String = "###
Dim intColFilter As Intege
Dim intColFiltre As Strin
Dim lngI As Lon
Dim lngNbLines As Lon

' The number of data rows is found in column D
lngNbLinesBeginning = Range("D3").End(xlDown).Ro
If lngNbLinesBeginning <= 3 The
booSheetDone = Tru
Exit Su
End I

' The number of columns is found in row 2
intNbCol = Range("A2").End(xlToRight).Colum
Columns("D:E").Selec
Selection.Insert Shift:=xlToRigh
intNbCol = intNbCol +

' NbToAA is a fonction that returns the correspondin
' letter(s) of a column number
strNbCol = NbToAA(intNbCol
Range("D3").Selec
Selection.FormulaR1C1 = "1
Selection.AutoFill
Destination:=Range("D3:D" & lngNbLinesBeginning),
Type:=xlFillSerie
Range("E3:E" & lngNbLinesBeginning).Selec
Selection.NumberFormat = "General
strFormula = "=RC[" & strCol1 & "]&RC[" & strCol2 & "]
If strCol3 <> "" The
strFormula = strFormula & "&RC[" & strCol3 & "]
End I

' The formula concatenates the key columns
Selection.FormulaR1C1 = strFormul
Selection.Cop
Selection.PasteSpecial Paste:=xlPasteValues,
Operation:=xlNone, SkipBlanks:=False,
Transpose:=Fals
Application.CutCopyMode = Fals

' Sort the data on the key and the sequence number
Range("A3:" & strNbCol & lngNbLinesBeginning).Selec
Selection.Sort Key1:=Range("E3"), Key2:=Range("D3"),
Order1:=xlAscending, Header:=xlGuess, OrderCustom:=1,
MatchCase:=False, Orientation:=xlTopToBottom,
DataOption1:=xlSortNormal, DataOption2:=xlSortTextAsNumber
intColFilter = Range("F2").End(xlToRight).Column +
intColFiltre = NbToAA(intColFilter

' Use advanced filter on column E
Range("E2").Selec
ActiveCell.Value = "Key
Range(Selection, Selection.End(xlDown)).Selec
strRangeAddress = ActiveWindow.RangeSelection.Addres
Range(strRangeAddress).AdvancedFilter
Action:=xlFilterCopy,
CopyToRange:=Columns(intColFiltre & ":" & intColFiltre),
Unique:=Tru

' Load unique values in a dynamic array
Range(intColFiltre & "1").Selec
lngNbSearch =
ActiveCell.Offset(1, 0).Selec
Do While ActiveCell.Value <> "
lngNbSearch = lngNbSearch +
ReDim Preserve strSearch(lngNbSearch
strSearch(lngNbSearch) = ActiveCell.Valu
ActiveCell.Offset(1, 0).Selec
Loo
Columns(intColFiltre).Delet

' For each code found, replace the first occurrence with
' the string "###" concatenated to the cell value.
' ***** HERE IS THE PROBLEM - BEGINNING *****
Columns("E:E").Select
For lngI = 1 To lngNbSearch
Selection.Find(What:=strSearch(lngI), After:=ActiveCell, _
LookIn:=xlValues, LookAt:=xlWhole, SearchOrder:=xlByColumns, _
SearchDirection:=xlNext, MatchCase:=False, _
SearchFormat:=False).Activate
ActiveCell.Replace What:=strSearch(lngI), _
Replacement:=strConstNumbers & strSearch(lngI), _
LookAt:=xlWhole, SearchOrder:=xlByColumns, MatchCase:=False, _
SearchFormat:=False, ReplaceFormat:=False
Next lngI
' ***** HERE IS THE PROBLEM - END *****

' Replace each other values by "---".
Columns("E:E").Select
For lngI = 1 To lngNbSearch
Selection.Replace What:=strSearch(lngI), _
Replacement:=strConstDashes, _
LookAt:=xlWhole, SearchOrder:=xlByRows, _
MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Next lngI

' Replace the "###" with nothing.
Columns("E:E").Select
Selection.Replace What:=strConstNumbers, _
Replacement:="", _
LookAt:=xlPart, SearchOrder:=xlByRows, _
MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
ReDim strSearch(0)

' Sort the data on column E.
Range("A3:" & strNbCol & lngNbLinesBeginning).Select
Selection.Sort Key1:=Range("E3"), _
Order1:=xlAscending, Header:=xlGuess, OrderCustom:=1, _
MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal

' Build a formula in cell D2 pour to get the number of "---".
Range("D2").Select
Selection.FormulaR1C1 = "=COUNTIF(R[1]C[1]:R[" & _
lngNbLinesBeginning & "]C[1],""---"")"
lngNbLines = ActiveCell.Value

' Select and delete the lines.
Rows("3:" & lngNbLines + 2).Select
Selection.Delete Shift:=xlUp

' Count the number of rows left.
lngNbLinesFin = Range("F3").End(xlDown).Row

' Sort the data on column D (sequence).
Range("A3:" & strNbCol & lngNbLinesFin).Select
Selection.Sort Key1:=Range("D3"), Order1:=xlAscending, _
Header:=xlGuess, OrderCustom:=1, MatchCase:=False, _
Orientation:=xlTopToBottom, DataOption1:=xlSortTextAsNumbers

' Delete temporary columns D et E.
Columns("D:E").Select
Selection.Delete Shift:=xlToLeft
End Sub
' *******************************************************************
Thanks again.
 
L

LarryP

I have something similar set up as a macro in my
Personal.xls because I use it all the time. On a few
thousand rows it runs fairly quickly; for 40,000 you might
have to go have a cup of coffee! What mine does is, you
select the first cell in any contiguous column of data,
run the macro, and it (1) inserts a new column, (2)
scrolls down through the entire data column putting
either "UNIQUE" or "DUPLICATE" in the new column, and
gives you a result count at the end of the process. After
that you can filter and delete all the "DUPLICATE" rows,
then get rid of the added column. Like yours, this macro
is populating a dynamic array, so the more data, the
slower it runs. May or may not have any advantages over
your approach, but for whatever it's worth, here's the
code:

***************************************************
Sub Duplicates()
Dim strValArray() As String
Dim lngCounter As Long

Selection.EntireColumn.Insert
ReDim strValArray(0)
strValArray(0) = ActiveCell.Offset(0, 1).Value
ActiveCell.Value = "Unique"
lngCounter = 1
Do While ActiveCell.Offset(1, 1).Value > ""
ActiveCell.Offset(1, 0).Select
'FindDups
Dim x As Long
x = 0
Do Until ActiveCell.Value = "Duplicate" Or x =
UBound(strValArray) + 1
If strValArray(x) = ActiveCell.Offset(0,
1).Value Then
ActiveCell.Value = "Duplicate"
Else
x = x + 1
End If
Loop


If ActiveCell.Value <> "Duplicate" Then
ReDim Preserve strValArray(UBound(strValArray)
+ 1)
strValArray(UBound(strValArray)) =
ActiveCell.Offset(0, 1).Value
ActiveCell.Value = "Unique"
End If
lngCounter = lngCounter + 1
Loop
MsgBox ("Items checked = " & lngCounter & ", Unique
Count = " &
UBound(strValArray) + 1)
End Sub

*******************************************************
-----Original Message-----
Hi everyone,
I have 40000 row of data and need to delete duplicate
rows. Of course, I have checked the Excel programming
workgroups and Chip Pearson's site. I found many
interesting ideas but I need a bit more. I found out that
when I use the advanced filter on the key row, it is a
significant gain on time processing. In that particular
case, what I do is load each unique value extracted in a
dynamic array. Then I select the data column and, for each
unique key, modify the first occurrence found by adding a
string of 3 "#" in front of it. After that, I replace all
the other occurrences by "---" (suggestion of CPearson).
Finally, I replace all "###" with an empty string. I can
then sort the rows, find out the number of rows
containing "---" and delete them. That way of doing works
faster when there are more than 1000 rows. But the one
part where I loop to replace the first occurrence of each
key is too slow. I would like to know if there is some
faster way to do that job.
The Sub procedure below is called by a main procedure (I
have over 20 sheets to deal with). The parameters are the
columns (2 or 3) that form the key (strCol1 = "1", strCol2
= "2" and strCol3 = "3".
Thanks in advance.
Here is the code:
' ***********************************************************
********
Sub DeleteDuplicateRows(ByVal strCol1, ByVal strCol2, ByVal strCol3)
' Jac Tremblay 2004-05-25
Dim strFormula As String
Dim strRangeAddress As String
Dim strSearch() As String
Dim lngNbSearch As Long
Const strConstDashes As String = "---"
Const strConstNumbers As String = "###"
Dim intColFilter As Integer
Dim intColFiltre As String
Dim lngI As Long
Dim lngNbLines As Long

' The number of data rows is found in column D.
lngNbLinesBeginning = Range("D3").End(xlDown).Row
If lngNbLinesBeginning <= 3 Then
booSheetDone = True
Exit Sub
End If

' The number of columns is found in row 2.
intNbCol = Range("A2").End(xlToRight).Column
Columns("D:E").Select
Selection.Insert Shift:=xlToRight
intNbCol = intNbCol + 2

' NbToAA is a fonction that returns the corresponding
' letter(s) of a column number.
strNbCol = NbToAA(intNbCol)
Range("D3").Select
Selection.FormulaR1C1 = "1"
Selection.AutoFill _
Destination:=Range("D3:D" & lngNbLinesBeginning), _
Type:=xlFillSeries
Range("E3:E" & lngNbLinesBeginning).Select
Selection.NumberFormat = "General"
strFormula = "=RC[" & strCol1 & "]&RC[" & strCol2 & "]"
If strCol3 <> "" Then
strFormula = strFormula & "&RC[" & strCol3 & "]"
End If

' The formula concatenates the key columns.
Selection.FormulaR1C1 = strFormula
Selection.Copy
Selection.PasteSpecial Paste:=xlPasteValues, _
Operation:=xlNone, SkipBlanks:=False, _
Transpose:=False
Application.CutCopyMode = False

' Sort the data on the key and the sequence number.
Range("A3:" & strNbCol & lngNbLinesBeginning).Select
Selection.Sort Key1:=Range("E3"), Key2:=Range("D3"), _
Order1:=xlAscending, Header:=xlGuess, OrderCustom:=1, _
MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal, DataOption2:=xlSortTextAsNumbers
intColFilter = Range("F2").End(xlToRight).Column + 2
intColFiltre = NbToAA(intColFilter)

' Use advanced filter on column E.
Range("E2").Select
ActiveCell.Value = "Key"
Range(Selection, Selection.End(xlDown)).Select
strRangeAddress = ActiveWindow.RangeSelection.Address
Range(strRangeAddress).AdvancedFilter _
Action:=xlFilterCopy, _
CopyToRange:=Columns(intColFiltre & ":" & intColFiltre), _
Unique:=True

' Load unique values in a dynamic array.
Range(intColFiltre & "1").Select
lngNbSearch = 0
ActiveCell.Offset(1, 0).Select
Do While ActiveCell.Value <> ""
lngNbSearch = lngNbSearch + 1
ReDim Preserve strSearch(lngNbSearch)
strSearch(lngNbSearch) = ActiveCell.Value
ActiveCell.Offset(1, 0).Select
Loop
Columns(intColFiltre).Delete

' For each code found, replace the first occurrence with
' the string "###" concatenated to the cell value.
' ***** HERE IS THE PROBLEM - BEGINNING *****
Columns("E:E").Select
For lngI = 1 To lngNbSearch
Selection.Find(What:=strSearch(lngI), After:=ActiveCell, _
LookIn:=xlValues, LookAt:=xlWhole, SearchOrder:=xlByColumns, _
SearchDirection:=xlNext, MatchCase:=False, _
SearchFormat:=False).Activate
ActiveCell.Replace What:=strSearch(lngI), _
Replacement:=strConstNumbers & strSearch (lngI), _
LookAt:=xlWhole, SearchOrder:=xlByColumns, MatchCase:=False, _
SearchFormat:=False, ReplaceFormat:=False
Next lngI
' ***** HERE IS THE PROBLEM - END *****

' Replace each other values by "---".
Columns("E:E").Select
For lngI = 1 To lngNbSearch
Selection.Replace What:=strSearch(lngI), _
Replacement:=strConstDashes, _
LookAt:=xlWhole, SearchOrder:=xlByRows, _
MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Next lngI

' Replace the "###" with nothing.
Columns("E:E").Select
Selection.Replace What:=strConstNumbers, _
Replacement:="", _
LookAt:=xlPart, SearchOrder:=xlByRows, _
MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
ReDim strSearch(0)

' Sort the data on column E.
Range("A3:" & strNbCol & lngNbLinesBeginning).Select
Selection.Sort Key1:=Range("E3"), _
Order1:=xlAscending, Header:=xlGuess, OrderCustom:=1, _
MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal

' Build a formula in cell D2 pour to get the number of "---".
Range("D2").Select
Selection.FormulaR1C1 = "=COUNTIF(R[1]C[1]:R[" & _
lngNbLinesBeginning & "]C[1],""---"")"
lngNbLines = ActiveCell.Value

' Select and delete the lines.
Rows("3:" & lngNbLines + 2).Select
Selection.Delete Shift:=xlUp

' Count the number of rows left.
lngNbLinesFin = Range("F3").End(xlDown).Row

' Sort the data on column D (sequence).
Range("A3:" & strNbCol & lngNbLinesFin).Select
Selection.Sort Key1:=Range("D3"), Order1:=xlAscending, _
Header:=xlGuess, OrderCustom:=1, MatchCase:=False, _
Orientation:=xlTopToBottom, DataOption1:=xlSortTextAsNumbers

' Delete temporary columns D et E.
Columns("D:E").Select
Selection.Delete Shift:=xlToLeft
End Sub
' ***********************************************************
********
Thanks again.
.
 
J

Jac Tremblay

Hi Larry,
I'm glad that someone answered me. My problem is a real one and I will find a way to do the job within an acceptable time.
Your code is good, but as you say, with 40000 rows, the process time may be too long. I also have built a similar macro (code is down below - NewDelDups) and found it too slow. Take a look at it, you may find some interesting tricks inspired by Dick Kusleika and some other MVPs.
I am presently studying Chip Pearson's macros but, as I said before, I need more processing power and technique because of the volume of the data. I include another macro, DeleteDuplicateRows by Chip Pearson and a bit by me, that I tested for time processing. With 20000 rows, I get 6 or 7 minutes. With 40000, I had enough time to watch the 3rd period... and canceled after...
I have a few ideas and will work on them and keep you informed.

' *******************************************************************
Sub NewDelDups()
' Jac Tremblay 2004-05-25
Dim lngRow As Long
Dim lngLine As Long
Dim intColumn As Integer
Dim strColumn As String
Dim strValue1 As String
Dim strValue2 As String

' Find the starting line number.
lngRow = ActiveCell.Row
lngLine = lngRow ' Variable de travail
intColumn = ActiveCell.Column
strColumn = NbEnAA.NbEnAA(intColumn)

' Find the value of the first cell.
lngLine = lngLine + 1
strValue1 = Range(strColumn & lngLine).Value
While Range(strColumn & lngLine).Value <> ""

' Get the next value.
lngLine = lngLine + 1
strValue2 = Range(strColumn & lngLine).Value

While strValue2 = strValue1

' If the 2nd is identical to the 1st, put a "z" in
' column C on the same line.
Range("C" & lngLine).Value = "z"

' Get the next value.
lngLine = lngLine + 1
strValue2 = Range(strColumn & lngLine).Value
Wend

' Values are now different, so change the 1st value and loop.
strValue1 = Range(strColumn & lngLine).Value
Wend
MsgBox "Job done. Number of lines to delete: " & lngLine
End Sub
' *******************************************************************
' *******************************************************************
Public Sub DeleteDuplicateRows()
' Chip Pearson
'
' This macro deletes duplicate rows in the selection. Duplicates are
' counted in the COLUMN of the active cell.

' Time processing variables
Dim sngStart As Single
Dim sngEnd As Single
Dim sngTime As Single
Dim intHour As Integer
Dim intMinute As Integer
Dim intSecond As Integer

' Chip Pearson's variables
Dim Col As Integer
Dim r As Long
Dim C As Range
Dim N As Long
Dim V As Variant
Dim Rng As Range

' Get the starting time.
sngStart = Timer

On Error GoTo EndMacro
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Col = ActiveCell.Column
If Selection.Rows.Count > 1 Then
Set Rng = Selection
Else
Set Rng = ActiveSheet.UsedRange.Rows
End If

N = 0
For r = Rng.Rows.Count To 1 Step -1
V = Rng.Cells(r, 1).Value
If Application.WorksheetFunction.CountIf(Rng.Columns(1), V) > 1 Then
Rng.Rows(r).EntireRow.Delete
N = N + 1
End If
Next r

' Get the final time and display the result.
sngEnd = Timer
sngTime = sngEnd - sngStart
intHour = sngTime \ 3600
intMinute = sngTime \ 60
intSecond = sngTime Mod 60

MsgBox "Process time : " & vbCrLf & vbCrLf & _
"hour(s) : " & intHour & vbCrLf & _
"minute(s) : " & intMinute & vbCrLf & _
"seconde(s) : " & intSecond & vbCrLf & vbCrLf & _
"Rows deleted: " & N
EndMacro:
Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic
End Sub

' **********************************************************************
Thanks again, LarryP. I will continue to look for an answer and will keep the workgroup informed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top