KillDuplicateParas

G

Greg Maxey

I came across an old post in Google groups for deleted duplicated
lines of text in a document.

It used a For x = Count method to go through and check the range of
one paragraph to the the range of every other paragraph and delete any
duplicates.

It had two If ... End If blocks. The first check the para range
length. If = then the second performed a text comparison. I assume
the author thought that it would save time by doing a text comparison
only on paras of equal length.

The procedure worked as advertised, however with a longer document it
took a long time.

I created about 800 paragraphs and determined that it was actually
much quicker to bypass the the first length check and just do a range
comparison on every paragraph. Down from 200 seconds to 75 seconds!

Next I remembered an method that Jezebel showed me for stepping
through items using the .Next (property or method I am never sure
which).
I adapted the code as follows and the time taken was down to 3
seconds!

Anyway, I just wanted to share this with the group:

Sub KillDuplicateParagraphs()
Dim SBar As Boolean
Dim TrkStatus As Boolean
Dim eTime As Single
Dim oParRef As Paragraph
Dim oParChk As Paragraph
eTime = Timer
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
With Application
SBar = .DisplayStatusBar
.DisplayStatusBar = True
.ScreenUpdating = False
End With
Set oParRef = ActiveDocument.Range.Paragraphs(1)
Set oParChk = oParRef.Next
Do
'*** Stet out first if block to delete duplicated empty paragraphs.
If Len(oParRef.Range.Text) > 1 Then
Do
'An empty last paragraph may throw an error on the last loop.
On Error GoTo Err_Exit
If oParRef.Range = oParChk.Range Then
oParChk.Range.Delete
Else
Set oParChk = oParChk.Next
End If
Loop Until oParChk Is Nothing
End If '***
Set oParRef = oParRef.Next
'Skip errors.
On Error Resume Next
Set oParChk = oParRef.Next
On Error GoTo 0
Loop Until oParRef Is Nothing
Err_Exit:
With Application
.StatusBar = False
.DisplayStatusBar = SBar
.ScreenUpdating = True
End With
ActiveDocument.TrackRevisions = TrkStatus
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400
& " seconds."
End Sub
 
G

Greg Maxey

Oops. I forgot to mention the method used one of the For loops to
update the status bar.

Add:
'Application.StatusBar = ActiveDocument.Paragraphs.Count & "
paragraphs to check. "

Just after the On Error GoTo 0 line.

Also all my test where done with only two or three different
paragraphs repeated many times. As the number of different paragraphs
increases (in any method I suppose) the time to process will increase.
 
H

Helmut Weber

Hi Submariner,

how about this one:

Sub Makro6x()
Dim t As Single
t = Timer
Dim prg1 As Paragraph
Dim prg2 As Paragraph
For Each prg1 In ActiveDocument.Range.Paragraphs
For Each prg2 In ActiveDocument.Range.Paragraphs
If prg1.Range.Text = prg2.Range.Text Then
If prg1.Range.start <> prg2.Range.start Then
prg2.Range.Delete
End If
End If
Next
Next
MsgBox Timer - t
End Sub

800 paragraphs of kind rand(1,10)
104 pages
3.4 seconds
10 paragraphs left over.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
G

Greg Maxey

Helmet,

That is clear code and fast, but not as fast as 2.714 seconds (maybe
my processors is faster) ;-)

Here is the real difference. Make each of those 800 paragraphs
slighgly different:

Sub ScratchMacro()
Dim i As Long
For i = 1 To ActiveDocument.Paragraphs.Count
ActiveDocument.Range.Paragraphs(i).Range.Characters.First = i
Next i
End Sub

Then delete the numbers from the last two. We now have 799 different
paragraphs and 2 duplicates. Run both codes again:

Your version: 123 seconds
My version: 69 seconds

Both are lightening fast when all the paragraphs are duplicates,
because that first loop only runs once.

I think the speed efficiency in the .Next method is due to the
processor doesn't have to a keep track of the paragraph count. ???

Cheers
 
H

Helmut Weber

Hi Greg,

I see now that you are talking about the case
that a paragraph is immediatly followed by a duplicate,
whereas my code was meant to remove duplicate paragraphs
wherever they appear.

For removing empty paragraphs from the doc's end,
I use this code:

While ActiveDocument.Characters.Last.Previous = Chr(13)
ActiveDocument.Characters.Last.Delete
Wend

Cheers

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
M

macropod

Hi Greg,

Looks like a derivative of something I developed & posted. My full version is:

Dim SBar As Boolean ' Status Bar flag
Dim TrkStatus As Boolean ' Track Changes flag

Sub KillDuplicateParas()
Call MacroEntry
Dim i As Long, j As Long
Dim eTime As Single
eTime = Timer
With ActiveDocument
If .Paragraphs.Count > 1 Then
' Loop backwards to preserve paragraph count & indexing.
' Start at 2nd-last paragraph.
For i = .Paragraphs.Count - 1 To 1 Step -1
' Ignore empty paragraphs
If Len(.Paragraphs(i).Range.Text) > 1 Then
' Loop backwards to preserve paragraph count & indexing.
' Stop at last preceding paragraph.
For j = .Paragraphs.Count To i + 1 Step -1
' Report progress on Status Bar.
Application.StatusBar = i & " paragraphs to check. "
' No point in checking paragraphs of unequal length.
If Len(.Paragraphs(i).Range) = Len(.Paragraphs(j).Range) Then
' Test strings of paragraphs of equal length.
If .Paragraphs(i).Range = .Paragraphs(j).Range Then
' Delete duplicate paragraph.
.Paragraphs(j).Range.Delete
End If
End If
Next
End If
Next
End If
End With
' Report time taken. Elapsed time calculation allows for execution to extend past midnight.
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds."
Call MacroExit
End Sub

Private Sub MacroEntry()
' Store current Status Bar status, then switch on
SBar = Application.DisplayStatusBar
Application.DisplayStatusBar = True
' Store current Track Changes status, then switch off
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
' Turn Off Screen Updating
Application.ScreenUpdating = False
End Sub

Private Sub MacroExit()
' Clear the Status Bar
Application.StatusBar = False
' Restore original Status Bar status
Application.DisplayStatusBar = SBar
' Restore original Track Changes status
ActiveDocument.TrackRevisions = TrkStatus
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub

I think they key difference is that my code checks all paras against each other, whereas yours only checks adjacent paras. I might
incorporate Helmut's revisions, though, since they seem to speed things up a bit.

Cheers
 
H

Helmut Weber

Hi,

hmm...
maybe it is the structure of the data,
but this takes minutes or maybe will run endlessly,
though I can't see a reason for that:

Sub KillDuplicateParas()
Dim i As Long, j As Long
Dim eTime As Single
eTime = Timer
With ActiveDocument
For i = .Paragraphs.Count - 1 To 1 Step -1
For j = .Paragraphs.Count To i + 1 Step -1
If Len(.Paragraphs(i).Range) = Len(.Paragraphs(j).Range) Then
If .Paragraphs(i).Range = .Paragraphs(j).Range Then
.Paragraphs(j).Range.Delete
End If
End If
Next
Next
End With
MsgBox Timer - eTime

End Sub

800 Paragraphs, 4 unique paragraphs.
paragraph length about 460 characters.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
H

Helmut Weber

Hi,

this one takes 1.15 seconds, here and now,
under the above mentioned conditions:

Sub Makro6xx()
Dim prg1 As Paragraph
Dim t As Single
t = Timer
For Each prg1 In ActiveDocument.Range.Paragraphs
If Not prg1.Next Is Nothing Then
If prg1.Range.Text = prg1.Next.Range.Text Then
prg1.Next.Range.Delete
End If
End If
Next
MsgBox Timer - t
End Sub

However, IMHO, like other alternative solutions,
except from comparing each paragraph to each other paragraph,
see above, too, it does not take care of the fact that deleting
one of two immediatly adjacent paragraphs may
result again in two adjacent equal paragraphs.

Happy thinking!

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
G

Greg Maxey

Hi macropod.

It is a derivative of your code and there is a point of checking paragraphs
of unequal length because it is faster than first checking all paragraphs
to see if they are of equal length ;-).

< I think they key difference is that my code checks all paras against each
other, whereas yours only checks adjacent paras. I might
<incorporate Helmut's revisions, though, since they seem to speed things up
a bit.

I don't know your test results but the code I posted has nothing to do with
adjacent pararapraphs. You can take those 799 unique paragraphs and add 10,
20 or a 100 duplicates anywhere in the mix, run the code and the duplicates
are removed.

All that said, there appears to be more to this that I don't understand. At
work yesterday with the The quick brown fox example Helmut gave, my code
was 69 seconds compared to Helmut's 123 seconds. Today at home with the
much longer Word2007 =(rand) text the processing 800 paragraphs takes my
method 272 seconds and Helmut's 236.

Perhaps there is no best way ;-)




--
Greg Maxey/Word MVP
See:
http://gregmaxey.mvps.org/word_tips.htm
For some helpful tips using Word.

macropod said:
Hi Greg,

Looks like a derivative of something I developed & posted. My full version
is:

Dim SBar As Boolean ' Status Bar flag
Dim TrkStatus As Boolean ' Track Changes flag

Sub KillDuplicateParas()
Call MacroEntry
Dim i As Long, j As Long
Dim eTime As Single
eTime = Timer
With ActiveDocument
If .Paragraphs.Count > 1 Then
' Loop backwards to preserve paragraph count & indexing.
' Start at 2nd-last paragraph.
For i = .Paragraphs.Count - 1 To 1 Step -1
' Ignore empty paragraphs
If Len(.Paragraphs(i).Range.Text) > 1 Then
' Loop backwards to preserve paragraph count & indexing.
' Stop at last preceding paragraph.
For j = .Paragraphs.Count To i + 1 Step -1
' Report progress on Status Bar.
Application.StatusBar = i & " paragraphs to check. "
' No point in checking paragraphs of unequal length.
If Len(.Paragraphs(i).Range) =
Len(.Paragraphs(j).Range) Then
' Test strings of paragraphs of equal length.
If .Paragraphs(i).Range = .Paragraphs(j).Range Then
' Delete duplicate paragraph.
.Paragraphs(j).Range.Delete
End If
End If
Next
End If
Next
End If
End With
' Report time taken. Elapsed time calculation allows for execution to
extend past midnight.
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & "
seconds."
Call MacroExit
End Sub

Private Sub MacroEntry()
' Store current Status Bar status, then switch on
SBar = Application.DisplayStatusBar
Application.DisplayStatusBar = True
' Store current Track Changes status, then switch off
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
' Turn Off Screen Updating
Application.ScreenUpdating = False
End Sub

Private Sub MacroExit()
' Clear the Status Bar
Application.StatusBar = False
' Restore original Status Bar status
Application.DisplayStatusBar = SBar
' Restore original Track Changes status
ActiveDocument.TrackRevisions = TrkStatus
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub

I think they key difference is that my code checks all paras against each
other, whereas yours only checks adjacent paras. I might incorporate
Helmut's revisions, though, since they seem to speed things up a bit.

Cheers

--
macropod
[MVP - Microsoft Word]
-------------------------

Greg Maxey said:
I came across an old post in Google groups for deleted duplicated
lines of text in a document.

It used a For x = Count method to go through and check the range of
one paragraph to the the range of every other paragraph and delete any
duplicates.

It had two If ... End If blocks. The first check the para range
length. If = then the second performed a text comparison. I assume
the author thought that it would save time by doing a text comparison
only on paras of equal length.

The procedure worked as advertised, however with a longer document it
took a long time.

I created about 800 paragraphs and determined that it was actually
much quicker to bypass the the first length check and just do a range
comparison on every paragraph. Down from 200 seconds to 75 seconds!

Next I remembered an method that Jezebel showed me for stepping
through items using the .Next (property or method I am never sure
which).
I adapted the code as follows and the time taken was down to 3
seconds!

Anyway, I just wanted to share this with the group:

Sub KillDuplicateParagraphs()
Dim SBar As Boolean
Dim TrkStatus As Boolean
Dim eTime As Single
Dim oParRef As Paragraph
Dim oParChk As Paragraph
eTime = Timer
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
With Application
SBar = .DisplayStatusBar
.DisplayStatusBar = True
.ScreenUpdating = False
End With
Set oParRef = ActiveDocument.Range.Paragraphs(1)
Set oParChk = oParRef.Next
Do
'*** Stet out first if block to delete duplicated empty paragraphs.
If Len(oParRef.Range.Text) > 1 Then
Do
'An empty last paragraph may throw an error on the last loop.
On Error GoTo Err_Exit
If oParRef.Range = oParChk.Range Then
oParChk.Range.Delete
Else
Set oParChk = oParChk.Next
End If
Loop Until oParChk Is Nothing
End If '***
Set oParRef = oParRef.Next
'Skip errors.
On Error Resume Next
Set oParChk = oParRef.Next
On Error GoTo 0
Loop Until oParRef Is Nothing
Err_Exit:
With Application
.StatusBar = False
.DisplayStatusBar = SBar
.ScreenUpdating = True
End With
ActiveDocument.TrackRevisions = TrkStatus
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400
& " seconds."
End Sub
 
K

Klaus Linke

I do that with a wildcard replacement...

Find what: ([!^13]@^13){2,}
Replace with: \1

It's not perfect -- Say

abcd
cd
cd

would be replaced with

abcd

But usually, I just risk that. And it's fast <g>

Klaus
 
G

Greg Maxey

Klaus,

No argument that that is the fastest method for eliminating "adjacent"
ducplicate paragraphs. However run that code with:

The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 the lazy dogs.
The quick and extremely agile brown fox jumped over 3 the lazy dogs.
The quick and extremely agile brown fox jumped over 4 the lazy dogs.
The quick and extremely agile brown fox jumped over 5 the lazy dogs.
The quick brown fox jumped over the lazy dog.

You are left with:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 lazy dogs.
The quick and extremely agile brown fox jumped over 3 lazy dogs.
The quick and extremely agile brown fox jumped over 4 lazy dogs.
The quick and extremely agile brown fox jumped over 5 lazy dogs.
The quick brown fox jumped over the lazy dog.

Where my desired result is:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 lazy dogs.
The quick and extremely agile brown fox jumped over 3 lazy dogs.
The quick and extremely agile brown fox jumped over 4 lazy dogs.
The quick and extremely agile brown fox jumped over 5 lazy dogs.








I do that with a wildcard replacement...

Find what: ([!^13]@^13){2,}
Replace with: \1

It's not perfect -- Say

abcd
cd
cd

would be replaced with

abcd

But usually, I just risk that. And it's fast <g>

Klaus



I came across an old post in Google groups for deleted duplicated
lines of text in a document.
It used a For x = Count method to go through and check the range of
one paragraph to the the range of every other paragraph and delete any
duplicates.
It had two If ... End If blocks. The first check the para range
length. If = then the second performed a text comparison. I assume
the author thought that it would save time by doing a text comparison
only on paras of equal length.
The procedure worked as advertised, however with a longer document it
took a long time.
I created about 800 paragraphs and determined that it was actually
much quicker to bypass the the first length check and just do a range
comparison on every paragraph. Down from 200 seconds to 75 seconds!
Next I remembered an method that Jezebel showed me for stepping
through items using the .Next (property or method I am never sure
which).
I adapted the code as follows and the time taken was down to 3
seconds!
Anyway, I just wanted to share this with the group:
Sub KillDuplicateParagraphs()
Dim SBar As Boolean
Dim TrkStatus As Boolean
Dim eTime As Single
Dim oParRef As Paragraph
Dim oParChk As Paragraph
eTime = Timer
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
With Application
SBar = .DisplayStatusBar
.DisplayStatusBar = True
.ScreenUpdating = False
End With
Set oParRef = ActiveDocument.Range.Paragraphs(1)
Set oParChk = oParRef.Next
Do
'*** Stet out first if block to delete duplicated empty paragraphs.
If Len(oParRef.Range.Text) > 1 Then
Do
'An empty last paragraph may throw an error on the last loop.
On Error GoTo Err_Exit
If oParRef.Range = oParChk.Range Then
oParChk.Range.Delete
Else
Set oParChk = oParChk.Next
End If
Loop Until oParChk Is Nothing
End If '***
Set oParRef = oParRef.Next
'Skip errors.
On Error Resume Next
Set oParChk = oParRef.Next
On Error GoTo 0
Loop Until oParRef Is Nothing
Err_Exit:
With Application
.StatusBar = False
.DisplayStatusBar = SBar
.ScreenUpdating = True
End With
ActiveDocument.TrackRevisions = TrkStatus
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400
& " seconds."
End Sub- Hide quoted text -

- Show quoted text -
 
K

Klaus Linke

Hi Greg,

Hadn't noticed that. What you could do is autonumber the paragraphs, turn
the numbering into hard text.
Selection.Range.ListFormat.ConvertNumbersToText

Now each paragraph has a number followed by a tab followed by the old text.

Then sort by fields (with the tab as separator, sorting by the second
field).
Then remove the duplicates with a wildcard search:
Find what: ([0-9]@^t)([!^13]@^13)([0-9]@^t)\2
Replace with: \1\2
(Repeat until nothing more is found)

Sort by the first field (numbers) to get back the old sequence.

Then remove the numbers and tabs with a wildcard search,
Find what: (^13)[0-9]@^t
Replace with: \1

Regards,
Klaus



Greg Maxey said:
Klaus,

No argument that that is the fastest method for eliminating "adjacent"
ducplicate paragraphs. However run that code with:

The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 the lazy dogs.
The quick and extremely agile brown fox jumped over 3 the lazy dogs.
The quick and extremely agile brown fox jumped over 4 the lazy dogs.
The quick and extremely agile brown fox jumped over 5 the lazy dogs.
The quick brown fox jumped over the lazy dog.

You are left with:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 lazy dogs.
The quick and extremely agile brown fox jumped over 3 lazy dogs.
The quick and extremely agile brown fox jumped over 4 lazy dogs.
The quick and extremely agile brown fox jumped over 5 lazy dogs.
The quick brown fox jumped over the lazy dog.

Where my desired result is:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 lazy dogs.
The quick and extremely agile brown fox jumped over 3 lazy dogs.
The quick and extremely agile brown fox jumped over 4 lazy dogs.
The quick and extremely agile brown fox jumped over 5 lazy dogs.








I do that with a wildcard replacement...

Find what: ([!^13]@^13){2,}
Replace with: \1

It's not perfect -- Say

abcd
cd
cd

would be replaced with

abcd

But usually, I just risk that. And it's fast <g>

Klaus

Newsbeitrag

I came across an old post in Google groups for deleted duplicated
lines of text in a document.
It used a For x = Count method to go through and check the range of
one paragraph to the the range of every other paragraph and delete any
duplicates.
It had two If ... End If blocks. The first check the para range
length. If = then the second performed a text comparison. I assume
the author thought that it would save time by doing a text comparison
only on paras of equal length.
The procedure worked as advertised, however with a longer document it
took a long time.
I created about 800 paragraphs and determined that it was actually
much quicker to bypass the the first length check and just do a range
comparison on every paragraph. Down from 200 seconds to 75 seconds!
Next I remembered an method that Jezebel showed me for stepping
through items using the .Next (property or method I am never sure
which).
I adapted the code as follows and the time taken was down to 3
seconds!
Anyway, I just wanted to share this with the group:
Sub KillDuplicateParagraphs()
Dim SBar As Boolean
Dim TrkStatus As Boolean
Dim eTime As Single
Dim oParRef As Paragraph
Dim oParChk As Paragraph
eTime = Timer
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
With Application
SBar = .DisplayStatusBar
.DisplayStatusBar = True
.ScreenUpdating = False
End With
Set oParRef = ActiveDocument.Range.Paragraphs(1)
Set oParChk = oParRef.Next
Do
'*** Stet out first if block to delete duplicated empty paragraphs.
If Len(oParRef.Range.Text) > 1 Then
Do
'An empty last paragraph may throw an error on the last loop.
On Error GoTo Err_Exit
If oParRef.Range = oParChk.Range Then
oParChk.Range.Delete
Else
Set oParChk = oParChk.Next
End If
Loop Until oParChk Is Nothing
End If '***
Set oParRef = oParRef.Next
'Skip errors.
On Error Resume Next
Set oParChk = oParRef.Next
On Error GoTo 0
Loop Until oParRef Is Nothing
Err_Exit:
With Application
.StatusBar = False
.DisplayStatusBar = SBar
.ScreenUpdating = True
End With
ActiveDocument.TrackRevisions = TrkStatus
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400
& " seconds."
End Sub- Hide quoted text -

- Show quoted text -
 
G

Greg Maxey

Klaus,

The method looks plausble, but I couldn't work out the code :-(

I used

oRng.ListFormat.DefualtNumberFormat

to apply the numbering. This adss a "period" between the number and
your the tab.

What did you have in mind when you said "autonumber?"



Hi Greg,

Hadn't noticed that. What you could do is autonumber the paragraphs, turn
the numbering into hard text.
Selection.Range.ListFormat.ConvertNumbersToText

Now each paragraph has a number followed by a tab followed by the old text.

Then sort by fields (with the tab as separator, sorting by the second
field).
Then remove the duplicates with a wildcard search:
Find what: ([0-9]@^t)([!^13]@^13)([0-9]@^t)\2
Replace with: \1\2
(Repeat until nothing more is found)

Sort by the first field (numbers) to get back the old sequence.

Then remove the numbers and tabs with a wildcard search,
Find what: (^13)[0-9]@^t
Replace with: \1

Regards,
Klaus



Greg Maxey said:
No argument that that is the fastest method for eliminating "adjacent"
ducplicate paragraphs. However run that code with:
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 the lazy dogs.
The quick and extremely agile brown fox jumped over 3 the lazy dogs.
The quick and extremely agile brown fox jumped over 4 the lazy dogs.
The quick and extremely agile brown fox jumped over 5 the lazy dogs.
The quick brown fox jumped over the lazy dog.
You are left with:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 lazy dogs.
The quick and extremely agile brown fox jumped over 3 lazy dogs.
The quick and extremely agile brown fox jumped over 4 lazy dogs.
The quick and extremely agile brown fox jumped over 5 lazy dogs.
The quick brown fox jumped over the lazy dog.
Where my desired result is:
The quick brown fox jumped over the lazy dog.
The quick and extremely agile brown fox jumped over 2 lazy dogs.
The quick and extremely agile brown fox jumped over 3 lazy dogs.
The quick and extremely agile brown fox jumped over 4 lazy dogs.
The quick and extremely agile brown fox jumped over 5 lazy dogs.
I do that with a wildcard replacement...
Find what: ([!^13]@^13){2,}
Replace with: \1
It's not perfect -- Say
abcd
cd
cd
would be replaced with
abcd
But usually, I just risk that. And it's fast <g>
Klaus
Newsbeitrag
I came across an old post in Google groups for deleted duplicated
lines of text in a document.
It used a For x = Count method to go through and check the range of
one paragraph to the the range of every other paragraph and delete any
duplicates.
It had two If ... End If blocks. The first check the para range
length. If = then the second performed a text comparison. I assume
the author thought that it would save time by doing a text comparison
only on paras of equal length.
The procedure worked as advertised, however with a longer document it
took a long time.
I created about 800 paragraphs and determined that it was actually
much quicker to bypass the the first length check and just do a range
comparison on every paragraph. Down from 200 seconds to 75 seconds!
Next I remembered an method that Jezebel showed me for stepping
through items using the .Next (property or method I am never sure
which).
I adapted the code as follows and the time taken was down to 3
seconds!
Anyway, I just wanted to share this with the group:
Sub KillDuplicateParagraphs()
Dim SBar As Boolean
Dim TrkStatus As Boolean
Dim eTime As Single
Dim oParRef As Paragraph
Dim oParChk As Paragraph
eTime = Timer
With ActiveDocument
TrkStatus = .TrackRevisions
.TrackRevisions = False
End With
With Application
SBar = .DisplayStatusBar
.DisplayStatusBar = True
.ScreenUpdating = False
End With
Set oParRef = ActiveDocument.Range.Paragraphs(1)
Set oParChk = oParRef.Next
Do
'*** Stet out first if block to delete duplicated empty paragraphs.
If Len(oParRef.Range.Text) > 1 Then
Do
'An empty last paragraph may throw an error on the last loop.
On Error GoTo Err_Exit
If oParRef.Range = oParChk.Range Then
oParChk.Range.Delete
Else
Set oParChk = oParChk.Next
End If
Loop Until oParChk Is Nothing
End If '***
Set oParRef = oParRef.Next
'Skip errors.
On Error Resume Next
Set oParChk = oParRef.Next
On Error GoTo 0
Loop Until oParRef Is Nothing
Err_Exit:
With Application
.StatusBar = False
.DisplayStatusBar = SBar
.ScreenUpdating = True
End With
ActiveDocument.TrackRevisions = TrkStatus
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400
& " seconds."
End Sub- Hide quoted text -
- Show quoted text -- Hide quoted text -

- Show quoted text -
 
K

Klaus Linke

Never mind :-(

It's actually slower than code (Makro6x) that's already been posted.

If there are thousands of paragraphs, it might pay off to get more
sophisticated.

Say, to read the whole document into a string, operate on that, and then
delete the duplicate paragraphs found.
If there aren't many duplicate paragraphs, that should be faster:

Sub Makro6n()
Dim t As Single
t = Timer
Dim prg1 As Paragraph
Dim vText As Variant
vText = ActiveDocument.Content.Text
vText = Split(vText, vbCr)
Dim i As Long, j As Long
For i = LBound(vText) To UBound(vText)
For j = LBound(vText) To UBound(vText)
If vText(i) = vText(j) And i <> j Then
vText(j) = "<delete>" & STR(j)
End If
Next j
Next i
MsgBox Timer - t
For i = UBound(vText) To LBound(vText) Step -1
If vText(i) = "<delete>" & STR(i) Then
ActiveDocument.Paragraphs(i + 1).Range.Delete
End If
Next i
End Sub


Or to avoid the double loop (comparing each paragraph with every other,
which takes an amount of time proportional to the square of the number of
paragraphs), it might pay off to use an efficient sorting algorithm (which
takes a time proportional to N log N, or even to N), then remove doubles
(proportional to N), then sort back.

Klaus
 
K

Klaus Linke

What did you have in mind when you said "autonumber?"

Something like this:
Dim myLT As ListTemplate
Set myLT = ActiveDocument.ListTemplates.Add(OutlineNumbered:=False,
Name:="Test")
With myLT.ListLevels(1)
.NumberFormat = "%1"
.TrailingCharacter = wdTrailingTab
.NumberStyle = wdListNumberStyleArabic
.StartAt = 1
.LinkedStyle = ""
End With
Selection.WholeStory
Selection.Range.ListFormat.ApplyListTemplate ListTemplate:=myLT

But it isn't as fast as I thought... see last reply.

Klaus
 
K

Klaus Linke

A bit better optimized... (Can't believe I looped all paragraphs in the
inner loop <g>)

Klaus

Sub Makro6n2()
Dim t As Single, t1 As Single
t = Timer
Dim prg1 As Paragraph
Dim vText As Variant
Dim delFrom As Long, delTo As Long
vText = ActiveDocument.Content.Text
vText = Split(vText, vbCr)
Dim i As Long, j As Long
For i = LBound(vText) To UBound(vText)
For j = i + 1 To UBound(vText)
If vText(i) = vText(j) Then
vText(j) = STR(j)
End If
Next j
Next i
t1 = Timer - t
delFrom = 0
For i = UBound(vText) To LBound(vText) Step -1
If vText(i) = STR(i) Then
delFrom = i + 1
If delTo = 0 Then
delTo = i + 1
End If
Else
If delTo <> 0 Then
ActiveDocument.Range( _
ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _
ActiveDocument.Paragraphs(delTo).Range.End - 1 _
).Delete
End If
delFrom = 0
delTo = 0
End If
Next i
MsgBox Timer - t, , t1
End Sub
 
G

Greg Maxey

Klaus,

Looks like you hit the home run. Your method is fast as lightening (less
than a second) for all tests using 800 paragraphs. All the same leaving 1,
All difference nothing deleted, and several tests with a mixed bag.

Excellent work!

Helmut take notice. A master is in our midst ;-)
 
K

Klaus Linke

Thanks!! An error I built in:

Replace

ActiveDocument.Range( _
ActiveDocument.Paragraphs(delFrom).Range.Start -1, _
ActiveDocument.Paragraphs(delTo).Range.End - 1 _
).Delete

with
ActiveDocument.Range( _
ActiveDocument.Paragraphs(delFrom).Range.Start, _
ActiveDocument.Paragraphs(delTo).Range.End _
).Delete

I thought deleting the last paragraph mark might cause problems (... it
doesn't), and my "work-around" could change the paragraph style.

Klaus
 
H

Helmut Weber

Hi all,

hmm...

Great job, Klaus!

Let me explain to my excuse,
I assumed there might be some not mentioned
bordering questions lurking behind, like:
"Excellent, but now I want to enclude some formatting conditions".

Which is often the case with posters, of course other than Greg.

Regardles of formatting, regardless of Word at all,
it might still be faster not to delete Word-ranges at all,
but to process an array of strings, delimited by chr(13),
representing the doc's content, and writing it back.

Don't let me be misunderstood,
I'm just enjoying a good discussion.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
K

Klaus Linke

"Excellent, but now I want to enclude some formatting conditions".
Which is often the case with posters, of course other than Greg.

Should be easy: Replace
ActiveDocument.Range( _
ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _
ActiveDocument.Paragraphs(delTo).Range.End - 1 _
).Delete
with
ActiveDocument.Range( _
ActiveDocument.Paragraphs(delFrom).Range.Start - 1, _
ActiveDocument.Paragraphs(delTo).Range.End - 1 _
).Font.Color = wdColorRed

You could also leave the .Delete, and turn on "Track Changes" before you run
the macro, so you can review (accept/reject) the deletions later.
it might still be faster not to delete Word-ranges at all,
but to process an array of strings, delimited by chr(13),
representing the doc's content, and writing it back.

True... but I don't expect that to make a real big difference.
An improvement for very large files (1000, 10.000 or more paragraphs) can be
achieved by avoiding the nested loop, as I i mentioned earlier.
You'd need an efficient sorting algorithm (can be done in a single loop,
that is, in a time proportional to the number of paragraphs), and a way to
remember the old order (say by using a two-dimensional array where every
paragraph keeps its old index). After the paragraphs are sorted
alphabetically, duplicates can be deleted (or marked) in a single loop. Then
you restore the old order based on the index (...can also be done in a
single loop).

Regards,
Klaus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top