UDF to merge duplicate rows

H

Henrietta Horne

Since you guys were so helpful, I hate to wear out my welcome, but...

Now that I have the high-index data from my previous question, I need
to process the list of words to eliminate duplicates. The table below
shows some sample data. The first column is the worksheet row numbers.
The Word column (B) contains the words. The Score column (C) contains
a count of how often that words occurs is a large collection of text
(400 million words). The Index column (D) contains the result of the
GetHighIndex UDF from my previous post.

I need another UDF to walk down the list looking for duplicate words
(tee, a, in). When it finds a duplicate, it will merge the two rows
into one by adding the Scores and deleting the second row.

In the example below, rows 5 & 6 would be combined to form one row
with a Score of 3301 (199 + 3102).

Rows 62-65 would be combined into one row with a Score of 7,140,219
(808 + 9,711 + 279,364 + 6,850,336).

Only exact matches are combined. The two "balloon" rows would be
combined as would the two "ballooning" rows, but would not include the
"balloonist" row.

Can I impose on someone to get this started?

The parts I am not sure how to do inside a UDF are:

1. How to step through the rows and address the cells (relatively).

2. How to delete a row.


B C D
4 Word Score Index
5 tee 199 2
6 tee 3,102 2
7 a 298 3
8 a 9,996,626 3
9 at 1,730,609 3
10 eat 69,484 3

62 in 808 6
63 in 9,711 6
64 in 279,364 6
65 in 6,850,336 6

2054 dear 97 11
2055 dear 2,015 11
2056 dear 3,364 11
2057 dear 8,417 11

32159 balloon 536 20
32160 balloon 4,887 20
32161 ballooning 28 20
32162 ballooning 82 20
32163 balloonist 51 20

48196 turquoise 435 26
48197 turquoise 718 26

49270 ad-lib 29 27
49271 ad-lib 46 27
49272 ad-libbed 40 27
49273 ad-libbing 19 27
 
C

Clif McIrvin

Henrietta Horne said:
Since you guys were so helpful, I hate to wear out my welcome, but...
Can I impose on someone to get this started?

The parts I am not sure how to do inside a UDF are:

1. How to step through the rows and address the cells (relatively).

The working procedure I'm posting below might give you some ideas to
work with, even though I'm making no attempt to apply it to your
question said:
2. How to delete a row.
one possibility:
Range("A5").Select
Selection.EntireRow.Delete

(code the result of recording a macro while manually selecting a cell
and using the UI to delete the row)
========== begin code

Sub Copy2BreakTable()
' 1/07/11 cm remove auto-filter; use range for find
' 1/25/10 cm add auto-filter on Plant ID ... partially tested
' 08 29 09 cm disable screen updates during startup

Const newRow = 2 ' New Data row
Const lookRows = 300 ' Only search newest rows

Dim ErrNum As Variant
Dim newDate, thisDate, currentDate
Dim newPlant, newLotID
Dim x2, x$, x1$
Dim SearchBegin, SearchEnd, SearchDirection
Dim firstRow, lastRow, currentRow ' cylinder data range
'Dim bPlantFilter As Boolean

'' turn off slow screen updating during searching
' Application.ScreenUpdating = False

Dim screenUpdateState As Boolean
Dim statusBarState As Boolean
Dim calcState As XlCalculation
Dim eventsState As Boolean
Dim displayPageBreakState As Boolean

With Application
screenUpdateState = .ScreenUpdating
statusBarState = .DisplayStatusBar
calcState = .Calculation
eventsState = .EnableEvents
End With

'turn off some Excel functionality so your code runs faster
With Application
.ScreenUpdating = False
.DisplayStatusBar = False
.Calculation = xlCalculationManual
.EnableEvents = False
End With

Set c = Nothing ' insure exit test is valid

If newSub() Then
' newSub returns False if activesheet not valid cylData
' copies new cylinder measurements to row newRow;
' returns with BreakTable!Data as ActiveSheet

' Look in last lookRows rows for match in LotID
' Begin search at current row; look up or down depending
' on date comparison; begin at last row if current row
' is outside of lookRows range

' gather data of interest...
newPlant = Cells(newRow, BreakTableColumns.Plant).Value
newLotID = Cells(newRow, BreakTableColumns.LotID).Value
newDate = Cells(newRow, BreakTableColumns.SampleDate).Value
Set r = Range("cylData") ' all cylinder data
firstRow = r.Row
lastRow = r.Rows.Count + firstRow - 1 'xx' marker row
SearchEnd = lastRow
SearchBegin = lastRow - lookRows
If SearchBegin < firstRow Then
SearchBegin = firstRow
End If
currentRow = Selection.Row
If currentRow < SearchBegin Or _
currentRow >= SearchEnd Then
' search from the bottom up -
' new break data tends to be recent
currentRow = SearchEnd
End If

Select Case newPlant
Case "MCSS"
Case "LCP" ' special processing for LCP worksheet lotID
If Mid(newLotID, 7, 1) = "+" Then
newLotID = Left(newLotID, 7)
ElseIf Right(newLotID, 2) = "++" Then
newLotID = Left(newLotID, 6) & "++"
Else
newLotID = Left(newLotID, 6)
End If
Cells(newRow, BreakTableColumns.LotID).Value = newLotID
newLotID = "xxxx" ' force search on date
Case Else
If Right(newLotID, 1) = "+" Then
newLotID = Left(newLotID, Len(newLotID) - 1)
End If
End Select

' Prepare to search for match on LotID
' search 'up' or 'down' depending on new date

currentDate = Cells(currentRow, BreakTableColumns.SampleDate).Value
Select Case DateDiff("d", currentDate, newDate)
Case Is < 0 ' newDate is older
SearchDirection = xlPrevious
Case Is >= 0 ' newDate is newer
SearchDirection = xlNext
End Select

' define search range

Set r = Range(Cells(SearchBegin, BreakTableColumns.LotID), _
Cells(SearchEnd, BreakTableColumns.LotID))
Set c = Cells(currentRow, BreakTableColumns.LotID)

Set c = r.Find(What:=newLotID, After:=c, _
LookIn:=xlValues, LookAt:=xlPart, _
SearchOrder:=xlByRows, _
SearchDirection:=SearchDirection, _
MatchCase:=False, SearchFormat:=False)
If c Is Nothing Then 'lot not found! -- look for date
Application.Speech.Speak "Lot not found", -1, 0, -1
For x2 = SearchBegin To SearchEnd
thisDate = Cells(x2, BreakTableColumns.SampleDate).Value
If IsEmpty(thisDate) Then Exit For
Select Case DateDiff("d", thisDate, newDate)
Case Is < 0 ' thisDate is future
Exit For
Case 0 ' same date
If DatePart("h", thisDate) = 0 Then
Exit For ' No Sample time (MCSS) so stop here
End If ' else ignore time; continue search
End Select
Next x2 ' SearchBegin To SearchEnd
Set c = Cells(x2, BreakTableColumns.SampleDate)
End If ' c Is Nothing Then lot not found! -- look for date

With c ' scroll to selected row
.Show
.EntireRow.Select
End With

Select Case newPlant
Case "MCSS" 'Set MCSS Summary not printed flag
flagMCSS = True
With Selection.Cells(1, 5).Interior
.ColorIndex = 6
.Pattern = xlSolid
End With
With Cells(4, 5)
.Value = "MCSS Summary not printed"
With .Interior
.ColorIndex = 6
.Pattern = xlSolid
End With
End With
End Select

End If

'' restore screen updates
'put this at the end of your code
With Application
.Calculation = calcState ' cancels cut/copy mode
.EnableEvents = eventsState
.DisplayStatusBar = statusBarState
.ScreenUpdating = screenUpdateState
End With

If Not c Is Nothing Then
Rows(newRow).Copy
End If

Set c = Nothing
Set r = Nothing

End Sub
 
D

Don Guillett Excel MVP

Since you guys were so helpful, I hate to wear out my welcome, but...

Now that I have the high-index data from my previous question, I need
to process the list of words to eliminate duplicates. The table below
shows some sample data. The first column is the worksheet row numbers.
The Word column (B) contains the words. The Score column (C) contains
a count of how often that words occurs is a large collection of text
(400 million words). The Index column (D) contains the result of the
GetHighIndex UDF from my previous post.

I need another UDF to walk down the list looking for duplicate words
(tee, a, in). When it finds a duplicate, it will merge the two rows
into one by adding the Scores and deleting the second row.

In the example below, rows 5 & 6 would be combined to form one row
with a Score of 3301 (199 + 3102).

Rows 62-65 would be combined into one row with a Score of 7,140,219
(808 + 9,711 + 279,364 + 6,850,336).

Only exact matches are combined. The two "balloon" rows would be
combined as would the two "ballooning" rows, but would not include the
"balloonist" row.

Can I impose on someone to get this started?

The parts I am not sure how to do inside a UDF are:

1. How to step through the rows and address the cells (relatively).

2. How to delete a row.

           B         C      D
    4    Word      Score  Index
    5  tee           199     2
    6  tee         3,102     2
    7  a             298     3
    8  a       9,996,626     3
    9  at      1,730,609     3
   10  eat        69,484     3

   62  in            808     6
   63  in          9,711     6
   64  in        279,364     6
   65  in      6,850,336     6

 2054  dear           97    11
 2055  dear        2,015    11
 2056  dear        3,364    11
 2057  dear        8,417    11

32159  balloon       536    20
32160  balloon     4,887    20
32161  ballooning     28    20
32162  ballooning     82    20
32163  balloonist     51    20

48196  turquoise     435    26
48197  turquoise     718    26

49270  ad-lib         29    27
49271  ad-lib         46    27
49272  ad-libbed      40    27
49273  ad-libbing     19    27

Trying to copy your data for testing didn't work so what you need is a
looping macro from the bottom up to do this.

"If desired, send your file to dguillett @gmail.com I will only look
if:
1. You send a copy of this message on an inserted sheet
2. You give me the newsgroup and the subject line
3. You send a clear explanation of what you want
4. You send before/after examples and expected results."
 
C

Charabeuh

Hello,

Another way without VBA:

I assumed your data are in columns A,B,C
(e.g. Word,Score,Index <==> A1 to C10000)

Select an empty cell on your sheet ( e.g. cell F1)
Select in your menu 'Data' the command 'Consolidate'
Into the windows:
- select in the list function "Sum"
- select in the textbox "Référence" the area of your data to sum
(i.e. A1:B10000)
- click 'Add'
- then select the option 'Top row' and 'Left column')
- clic 'OK'

You will get a table with the sum of score.

Into cell F2, put the following formula to get your index:
=MAX(FIND(MID(F2,ROW(INDIRECT("1:" &
LEN(F2))),1),"etaoinsrhldcumgfpwybvkjxzq-"))
(this is an array formula, you should validate this formula with the three
keys CTRL+SHIFT+Enter instead of the single key Enter)

Then drag down this formula to the end of your consolidated data.


-----------------------------------------------------------------------------------------------------------------
"Henrietta Horne" a écrit dans le message de groupe de discussion :
(e-mail address removed)...

Since you guys were so helpful, I hate to wear out my welcome, but...

Now that I have the high-index data from my previous question, I need
to process the list of words to eliminate duplicates. The table below
shows some sample data. The first column is the worksheet row numbers.
The Word column (B) contains the words. The Score column (C) contains
a count of how often that words occurs is a large collection of text
(400 million words). The Index column (D) contains the result of the
GetHighIndex UDF from my previous post.

I need another UDF to walk down the list looking for duplicate words
(tee, a, in). When it finds a duplicate, it will merge the two rows
into one by adding the Scores and deleting the second row.

In the example below, rows 5 & 6 would be combined to form one row
with a Score of 3301 (199 + 3102).

Rows 62-65 would be combined into one row with a Score of 7,140,219
(808 + 9,711 + 279,364 + 6,850,336).

Only exact matches are combined. The two "balloon" rows would be
combined as would the two "ballooning" rows, but would not include the
"balloonist" row.

Can I impose on someone to get this started?

The parts I am not sure how to do inside a UDF are:

1. How to step through the rows and address the cells (relatively).

2. How to delete a row.


B C D
4 Word Score Index
5 tee 199 2
6 tee 3,102 2
7 a 298 3
8 a 9,996,626 3
9 at 1,730,609 3
10 eat 69,484 3

62 in 808 6
63 in 9,711 6
64 in 279,364 6
65 in 6,850,336 6

2054 dear 97 11
2055 dear 2,015 11
2056 dear 3,364 11
2057 dear 8,417 11

32159 balloon 536 20
32160 balloon 4,887 20
32161 ballooning 28 20
32162 ballooning 82 20
32163 balloonist 51 20

48196 turquoise 435 26
48197 turquoise 718 26

49270 ad-lib 29 27
49271 ad-lib 46 27
49272 ad-libbed 40 27
49273 ad-libbing 19 27
 
H

Henrietta Horne

The working procedure I'm posting below might give you some ideas to
work with, even though I'm making no attempt to apply it to your

one possibility:
Range("A5").Select
Selection.EntireRow.Delete

(code the result of recording a macro while manually selecting a cell
and using the UI to delete the row)

========== begin code

Sub Copy2BreakTable()
' 1/07/11 cm remove auto-filter; use range for find
' 1/25/10 cm add auto-filter on Plant ID ... partially tested
' 08 29 09 cm disable screen updates during startup

Const newRow = 2 ' New Data row
Const lookRows = 300 ' Only search newest rows

Dim ErrNum As Variant
Dim newDate, thisDate, currentDate
Dim newPlant, newLotID
Dim x2, x$, x1$
Dim SearchBegin, SearchEnd, SearchDirection
Dim firstRow, lastRow, currentRow ' cylinder data range
'Dim bPlantFilter As Boolean

'' turn off slow screen updating during searching
' Application.ScreenUpdating = False

Dim screenUpdateState As Boolean
Dim statusBarState As Boolean
Dim calcState As XlCalculation
Dim eventsState As Boolean
Dim displayPageBreakState As Boolean

With Application
screenUpdateState = .ScreenUpdating
statusBarState = .DisplayStatusBar
calcState = .Calculation
eventsState = .EnableEvents
End With

'turn off some Excel functionality so your code runs faster
With Application
.ScreenUpdating = False
.DisplayStatusBar = False
.Calculation = xlCalculationManual
.EnableEvents = False
End With

Set c = Nothing ' insure exit test is valid

If newSub() Then
' newSub returns False if activesheet not valid cylData
' copies new cylinder measurements to row newRow;
' returns with BreakTable!Data as ActiveSheet

' Look in last lookRows rows for match in LotID
' Begin search at current row; look up or down depending
' on date comparison; begin at last row if current row
' is outside of lookRows range

' gather data of interest...
newPlant = Cells(newRow, BreakTableColumns.Plant).Value
newLotID = Cells(newRow, BreakTableColumns.LotID).Value
newDate = Cells(newRow, BreakTableColumns.SampleDate).Value
Set r = Range("cylData") ' all cylinder data
firstRow = r.Row
lastRow = r.Rows.Count + firstRow - 1 'xx' marker row
SearchEnd = lastRow
SearchBegin = lastRow - lookRows
If SearchBegin < firstRow Then
SearchBegin = firstRow
End If
currentRow = Selection.Row
If currentRow < SearchBegin Or _
currentRow >= SearchEnd Then
' search from the bottom up -
' new break data tends to be recent
currentRow = SearchEnd
End If

Select Case newPlant
Case "MCSS"
Case "LCP" ' special processing for LCP worksheet lotID
If Mid(newLotID, 7, 1) = "+" Then
newLotID = Left(newLotID, 7)
ElseIf Right(newLotID, 2) = "++" Then
newLotID = Left(newLotID, 6) & "++"
Else
newLotID = Left(newLotID, 6)
End If
Cells(newRow, BreakTableColumns.LotID).Value = newLotID
newLotID = "xxxx" ' force search on date
Case Else
If Right(newLotID, 1) = "+" Then
newLotID = Left(newLotID, Len(newLotID) - 1)
End If
End Select

' Prepare to search for match on LotID
' search 'up' or 'down' depending on new date

currentDate = Cells(currentRow, BreakTableColumns.SampleDate).Value
Select Case DateDiff("d", currentDate, newDate)
Case Is < 0 ' newDate is older
SearchDirection = xlPrevious
Case Is >= 0 ' newDate is newer
SearchDirection = xlNext
End Select

' define search range

Set r = Range(Cells(SearchBegin, BreakTableColumns.LotID), _
Cells(SearchEnd, BreakTableColumns.LotID))
Set c = Cells(currentRow, BreakTableColumns.LotID)

Set c = r.Find(What:=newLotID, After:=c, _
LookIn:=xlValues, LookAt:=xlPart, _
SearchOrder:=xlByRows, _
SearchDirection:=SearchDirection, _
MatchCase:=False, SearchFormat:=False)
If c Is Nothing Then 'lot not found! -- look for date
Application.Speech.Speak "Lot not found", -1, 0, -1
For x2 = SearchBegin To SearchEnd
thisDate = Cells(x2, BreakTableColumns.SampleDate).Value
If IsEmpty(thisDate) Then Exit For
Select Case DateDiff("d", thisDate, newDate)
Case Is < 0 ' thisDate is future
Exit For
Case 0 ' same date
If DatePart("h", thisDate) = 0 Then
Exit For ' No Sample time (MCSS) so stop here
End If ' else ignore time; continue search
End Select
Next x2 ' SearchBegin To SearchEnd
Set c = Cells(x2, BreakTableColumns.SampleDate)
End If ' c Is Nothing Then lot not found! -- look for date

With c ' scroll to selected row
.Show
.EntireRow.Select
End With

Select Case newPlant
Case "MCSS" 'Set MCSS Summary not printed flag
flagMCSS = True
With Selection.Cells(1, 5).Interior
.ColorIndex = 6
.Pattern = xlSolid
End With
With Cells(4, 5)
.Value = "MCSS Summary not printed"
With .Interior
.ColorIndex = 6
.Pattern = xlSolid
End With
End With
End Select

End If

'' restore screen updates
'put this at the end of your code
With Application
.Calculation = calcState ' cancels cut/copy mode
.EnableEvents = eventsState
.DisplayStatusBar = statusBarState
.ScreenUpdating = screenUpdateState
End With

If Not c Is Nothing Then
Rows(newRow).Copy
End If

Set c = Nothing
Set r = Nothing

End Sub

Wow. I'll need a few days to digest that. Did you just type that up on
the fly or did you already have some similar code?
 
C

Clif McIrvin

Wow. I'll need a few days to digest that. Did you just type that up on
the fly or did you already have some similar code?

It happens that I'd just done some fine-tuning on that routine to speed
it up, so I just opened the module and did a copy / paste for you <g>.

Feel free to come back with questions ... I didn't really offer any
explanation. That code is taking new data that has been placed into row
2 and finding the correct place in the table for that new data. When
originally written, it was based on a recorded macro and used lots of
..Select and .Activate ... and as the table grew it got slower and
slower. Now, it finishes in sub-second times.

Oh ... that routine was one of my very first ventures into VBA, and what
it is now reflects a lot of what I've learned in the three years since.

Have fun with it!
 
H

Henrietta Horne

Trying to copy your data for testing didn't work so what you need is a
looping macro from the bottom up to do this.

"If desired, send your file to dguillett @gmail.com I will only look
if:
1. You send a copy of this message on an inserted sheet
2. You give me the newsgroup and the subject line
3. You send a clear explanation of what you want
4. You send before/after examples and expected results."

I'll try it myself first. I (when) I fail, I may send it over. Thanks.
 
H

Henrietta Horne

Hello,

Another way without VBA:

I assumed your data are in columns A,B,C
(e.g. Word,Score,Index <==> A1 to C10000)

Select an empty cell on your sheet ( e.g. cell F1)
Select in your menu 'Data' the command 'Consolidate'
Into the windows:
- select in the list function "Sum"
- select in the textbox "Référence" the area of your data to sum
(i.e. A1:B10000)
- click 'Add'
- then select the option 'Top row' and 'Left column')
- clic 'OK'

You will get a table with the sum of score.

Into cell F2, put the following formula to get your index:
=MAX(FIND(MID(F2,ROW(INDIRECT("1:" &
LEN(F2))),1),"etaoinsrhldcumgfpwybvkjxzq-"))
(this is an array formula, you should validate this formula with the three
keys CTRL+SHIFT+Enter instead of the single key Enter)

Then drag down this formula to the end of your consolidated data.

Thanks. I'll look into this solution, too.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top