Microsoft Office Forums

Go Back   Microsoft Office Forums > Microsoft Word > Word VBA

Reply
 
LinkBack Thread Tools Display Modes
  #1  
Old 12-26-2016, 10:33 AM
iamgator iamgator is offline Windows 7 64bit Office 2007
Novice
 
Join Date: Dec 2016
Posts: 3
iamgator is on a distinguished road
Default How to find duplicate phrases/paragraphs in a long document

Hi,



I am in the final stages of editing my very first manuscript and, over the past 6 months, may have repeated several phrases throughout my text. The manuscript is about 400 pages long. Now, some of these phrases are a little obvious to find and replace/re-purpose/delete, but I am having trouble going through the entire material.

I have tried out several suggested solutions. With [1], I kept receiving compile and error handling errors, among others. [2] only generated a word frequency report that run 300+ pages long. [3] is not particularly helpful neither.

Would it be possible to automatically highlight these sections, leaving me the task of editing the material...? I would appreciate any help in that regard.

PS: I am using 64-bit Windows 7 Ultimate (MS Word 2007).

PSS: Also, a total VBA noob. My technical chops are, frankly, non-existent. But I would be willing to apply myself to diligently follow any proposed solutions.

PSSS: [4] is somewhat related.

[1]- http://stackoverflow.com/questions/1...-word-document

[2]- http://gregmaxey.com/word_tip_pages/...cy_report.html

[3]- http://www.techandlife.com/2012/06/f...icrosoft-word/

[4]- http://www.msofficeforums.com/word-v...rds-texts.html
Reply With Quote
  #2  
Old 12-26-2016, 02:48 PM
macropod's Avatar
macropod macropod is offline Windows 7 64bit Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 16,561
macropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to behold
Default

The code at stackoverflow (which runs fine for me on a 50-page document, though it does give many spurious 'hits', often consisting of just the last character & period of a sentence) probably comes closest to doing what you want, but it purports to work on whole sentences. The problem with that approach, though, is that VBA has no idea what a grammatical sentence is. For example, consider the following:
Mr. Smith spent $1,234.56 at Dr. John's Grocery Store, to buy: 10.25kg of potatoes; 10kg of avocados; and 15.1kg of Mrs. Green's Mt. Pleasant macadamia nuts.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.

Likewise, VBA has no idea what a phrase or clause is - it doesn't even have a 'phrase' or 'clause' equivalent to the problematic 'sentence' property.

Finding duplicate paragraphs, though, is easy enough:
Code:
Sub FindDuplicateParas()
Application.ScreenUpdating = False
Dim i As Long, RngSrc As Range, RngFnd As Range
Const Clr As Long = wdBrightGreen
Dim eTime As Single
eTime = Timer
Options.DefaultHighlightColorIndex = Clr
With ActiveDocument
  With .Range.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchWildcards = False
    .MatchSoundsLike = False
    .MatchAllWordForms = False
    .Execute
  End With
  For i = 1 To .Paragraphs.Count
    If i Mod 100 = 0 Then DoEvents
    On Error Resume Next
    Set RngSrc = .Paragraphs(i).Range
    If RngSrc.HighlightColorIndex <> Clr Then
      Set RngFnd = .Range(.Paragraphs(i).Range.End, .Range.End)
      If Len(RngSrc.Text) < 256 Then
        With RngFnd.Find
          .Text = RngSrc.Text
          .Replacement.Text = "^&"
          .Replacement.Highlight = True
          .Wrap = wdFindStop
          .Execute Replace:=wdReplaceAll
        End With
      Else
        With RngFnd
          With .Find
            .Text = Left(RngSrc.Text, 255)
            .Wrap = wdFindStop
            .Execute
          End With
          Do While .Find.Found
            If RngSrc.Text = .Duplicate.Text Then
              RngSrc.HighlightColorIndex = Clr
              .Duplicate.HighlightColorIndex = Clr
            End If
            .Collapse wdCollapseEnd
            .Find.Execute
          Loop
        End With
      End If
    End If
  Next
End With
' Report time taken. Elapsed time calculation allows for execution to extend past midnight.
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds."
Application.ScreenUpdating = True
End Sub
For PC macro installation & usage instructions, see: http://www.gmayor.com/installing_macro.htm
For Mac macro installation & usage instructions, see: http://word.mvps.org/Mac/InstallMacro.html
__________________
Cheers,
Paul Edstein
[MS MVP - Word]
Reply With Quote
  #3  
Old 12-26-2016, 05:11 PM
iamgator iamgator is offline Windows 7 64bit Office 2007
Novice
 
Join Date: Dec 2016
Posts: 3
iamgator is on a distinguished road
Default

This worked really well. Although it took a little longer than I had expected (c. 60 minutes) and reported several false positives, I am glad that I have something to work with.
I really appreciate your help on this matter. Immensely.
In case there are other macros that you can point me towards (super anxious to get my manuscript in the best shape before submission next week), I would be equally grateful.
Thanks again and have a Happy Gregorian New Year.
Reply With Quote
  #4  
Old 12-26-2016, 05:33 PM
macropod's Avatar
macropod macropod is offline Windows 7 64bit Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 16,561
macropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to behold
Default

Quote:
Originally Posted by iamgator View Post
This worked really well. Although it took a little longer than I had expected (c. 60 minutes)
Probably because I left in an unnecessary 'RngSrc.Select' from testing. Deleting that line (which I've now done) should make the code run faster.

You could use almost identical code for 'sentences' (note my previous caveat):
Code:
Sub FindDuplicateSentences()
Application.ScreenUpdating = False
Dim i As Long, RngSrc As Range, RngFnd As Range
Const Clr As Long = wdBrightGreen
Dim eTime As Single
eTime = Timer
Options.DefaultHighlightColorIndex = Clr
With ActiveDocument
  With .Range.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchWildcards = False
    .MatchSoundsLike = False
    .MatchAllWordForms = False
    .Execute
  End With
  For i = 1 To .Sentences.Count
    If i Mod 100 = 0 Then DoEvents
    On Error Resume Next
    Set RngSrc = .Sentences(i)
    If RngSrc.HighlightColorIndex <> Clr Then
      Set RngFnd = .Range(.Sentences(i).End, .Range.End)
      If Len(RngSrc.Text) < 256 Then
        With RngFnd.Find
          .Text = RngSrc.Text
          .Replacement.Text = "^&"
          .Replacement.Highlight = True
          .Wrap = wdFindStop
          .Execute Replace:=wdReplaceAll
        End With
      Else
        With RngFnd
          With .Find
            .Text = Left(RngSrc.Text, 255)
            .Wrap = wdFindStop
            .Execute
          End With
          Do While .Find.Found
            If RngSrc.Text = .Duplicate.Text Then
              RngSrc.HighlightColorIndex = Clr
              .Duplicate.HighlightColorIndex = Clr
            End If
            .Collapse wdCollapseEnd
            .Find.Execute
          Loop
        End With
      End If
    End If
  Next
End With
' Report time taken. Elapsed time calculation allows for execution to extend past midnight.
MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds."
Application.ScreenUpdating = True
End Sub
I'd expect this to take somewhat longer, though. However, if you've already highlighted to duplicate paras, execution should be a bit quicker with those paragraphs already highlighted.
__________________
Cheers,
Paul Edstein
[MS MVP - Word]
Reply With Quote
  #5  
Old 12-26-2016, 09:28 PM
iamgator iamgator is offline Windows 7 64bit Office 2007
Novice
 
Join Date: Dec 2016
Posts: 3
iamgator is on a distinguished road
Default

Once I deleted the line, it took me about 400 seconds...as for the second task, a little over 4500 seconds... But it was worth the wait...this task would have driven me nuts without your help.. I genuinely appreciate it.. However, I would also like to add that, with the "sentences" macro, I also received several false positives....is this because the script also calculates familiarity between sentences such that sentences beyond a given threshold are automatically flagged as duplicates, even without word-for-word duplication...? That would be interesting to know...

Last edited by iamgator; 12-26-2016 at 09:32 PM. Reason: needed to add extra information
Reply With Quote
  #6  
Old 12-27-2016, 01:34 AM
macropod's Avatar
macropod macropod is offline Windows 7 64bit Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 16,561
macropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to beholdmacropod is a splendid one to behold
Default

Quote:
Originally Posted by iamgator View Post
I also received several false positives....is this because the script also calculates familiarity between sentences such that sentences beyond a given threshold are automatically flagged as duplicates, even without word-for-word duplication...? That would be interesting to know...
There shouldn't be any false positives in terms of VBA 'sentences'. However, because of the limitations in what VBA counts as a sentence, parts of grammatical sentences may be highlighted even though the grammatical sentences differ. This could even lead to a situation where all of a grammatical sentence somewhere in the document is highlighted because the VBA 'sentence' parts of it are found elsewhere.
__________________
Cheers,
Paul Edstein
[MS MVP - Word]
Reply With Quote
Reply

Tags
macro, vba

Thread Tools
Display Modes


Similar Threads
Thread Thread Starter Forum Replies Last Post
find a way to show a closest to using lat and long cordinates Steve81uk Excel Programming 4 02-02-2015 07:04 PM
How to find and delete duplicate words in doc cinvest Word 1 09-29-2014 08:34 PM
How can I find paragraphs all in italics? Robert2 Word 1 01-28-2014 02:54 PM
How do I find Repeating Words/Phrases? CCD2016 PowerPoint 0 12-01-2013 09:37 PM
How can I find paragraphs all in italics? Robert2 Word 1 06-30-2013 03:57 AM


All times are GMT -7. The time now is 03:51 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
SEO by vBSEO ©2011, Crawlability, Inc.
MSOfficeForums.com is not affiliated with Microsoft