#1
|
|||
|
|||
How to find duplicate phrases/paragraphs in a long document
Hi,
I am in the final stages of editing my very first manuscript and, over the past 6 months, may have repeated several phrases throughout my text. The manuscript is about 400 pages long. Now, some of these phrases are a little obvious to find and replace/re-purpose/delete, but I am having trouble going through the entire material. I have tried out several suggested solutions. With [1], I kept receiving compile and error handling errors, among others. [2] only generated a word frequency report that run 300+ pages long. [3] is not particularly helpful neither. Would it be possible to automatically highlight these sections, leaving me the task of editing the material...? I would appreciate any help in that regard. PS: I am using 64-bit Windows 7 Ultimate (MS Word 2007). PSS: Also, a total VBA noob. My technical chops are, frankly, non-existent. But I would be willing to apply myself to diligently follow any proposed solutions. PSSS: [4] is somewhat related. [1]- http://stackoverflow.com/questions/1...-word-document [2]- http://gregmaxey.com/word_tip_pages/...cy_report.html [3]- http://www.techandlife.com/2012/06/f...icrosoft-word/ [4]- https://www.msofficeforums.com/word-...rds-texts.html |
#2
|
||||
|
||||
The code at stackoverflow (which runs fine for me on a 50-page document, though it does give many spurious 'hits', often consisting of just the last character & period of a sentence) probably comes closest to doing what you want, but it purports to work on whole sentences. The problem with that approach, though, is that VBA has no idea what a grammatical sentence is. For example, consider the following:
Mr. Smith spent $1,234.56 at Dr. John's Grocery Store, to buy: 10.25kg of potatoes; 10kg of avocados; and 15.1kg of Mrs. Green's Mt. Pleasant macadamia nuts. For you and me, that would count as one sentence; for VBA it counts as 5 sentences. Likewise, VBA has no idea what a phrase or clause is - it doesn't even have a 'phrase' or 'clause' equivalent to the problematic 'sentence' property. Finding duplicate paragraphs, though, is easy enough: Code:
Sub FindDuplicateParas() Application.ScreenUpdating = False Dim i As Long, RngSrc As Range, RngFnd As Range Const Clr As Long = wdBrightGreen Dim eTime As Single eTime = Timer Options.DefaultHighlightColorIndex = Clr With ActiveDocument With .Range.Find .ClearFormatting .Replacement.ClearFormatting .Forward = True .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False .Execute End With For i = 1 To .Paragraphs.Count If i Mod 100 = 0 Then DoEvents On Error Resume Next Set RngSrc = .Paragraphs(i).Range If RngSrc.HighlightColorIndex <> Clr Then Set RngFnd = .Range(.Paragraphs(i).Range.End, .Range.End) If Len(RngSrc.Text) < 256 Then With RngFnd.Find .Text = RngSrc.Text .Replacement.Text = "^&" .Replacement.Highlight = True .Wrap = wdFindStop .Execute Replace:=wdReplaceAll End With Else With RngFnd With .Find .Text = Left(RngSrc.Text, 255) .Wrap = wdFindStop .Execute End With Do While .Find.Found If RngSrc.Text = .Duplicate.Text Then RngSrc.HighlightColorIndex = Clr .Duplicate.HighlightColorIndex = Clr End If .Collapse wdCollapseEnd .Find.Execute Loop End With End If End If Next End With ' Report time taken. Elapsed time calculation allows for execution to extend past midnight. MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds." Application.ScreenUpdating = True End Sub For Mac macro installation & usage instructions, see: http://word.mvps.org/Mac/InstallMacro.html
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#3
|
|||
|
|||
This worked really well, taking about 400 seconds. Although it reported several false positives, I am glad that I have something to work with.
I really appreciate your help on this matter. Immensely. In case there are other macros that you can point me towards (super anxious to get my manuscript in the best shape before submission next week), I would be equally grateful. Thanks again and have a Happy Gregorian New Year. |
#4
|
||||
|
||||
You could use almost identical code for 'sentences' (note my previous caveat):
Code:
Sub FindDuplicateSentences() Application.ScreenUpdating = False Dim i As Long, RngSrc As Range, RngFnd As Range Const Clr As Long = wdBrightGreen Dim eTime As Single eTime = Timer Options.DefaultHighlightColorIndex = Clr With ActiveDocument With .Range.Find .ClearFormatting .Replacement.ClearFormatting .Forward = True .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False .Execute End With For i = 1 To .Sentences.Count If i Mod 100 = 0 Then DoEvents On Error Resume Next Set RngSrc = .Sentences(i) If RngSrc.HighlightColorIndex <> Clr Then Set RngFnd = .Range(.Sentences(i).End, .Range.End) If Len(RngSrc.Text) < 256 Then With RngFnd.Find .Text = RngSrc.Text .Replacement.Text = "^&" .Replacement.Highlight = True .Wrap = wdFindStop .Execute Replace:=wdReplaceAll End With Else With RngFnd With .Find .Text = Left(RngSrc.Text, 255) .Wrap = wdFindStop .Execute End With Do While .Find.Found If RngSrc.Text = .Duplicate.Text Then RngSrc.HighlightColorIndex = Clr .Duplicate.HighlightColorIndex = Clr End If .Collapse wdCollapseEnd .Find.Execute Loop End With End If End If Next End With ' Report time taken. Elapsed time calculation allows for execution to extend past midnight. MsgBox "Finished. Elapsed time: " & (Timer - eTime + 86400) Mod 86400 & " seconds." Application.ScreenUpdating = True End Sub
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#5
|
|||
|
|||
the second task took a little over 4500 seconds... But it was worth the wait...this task would have driven me nuts without your help.. I genuinely appreciate it.. However, I would also like to add that, with the "sentences" macro, I also received several false positives....is this because the script also calculates familiarity between sentences such that sentences beyond a given threshold are automatically flagged as duplicates, even without word-for-word duplication...? That would be interesting to know...
Last edited by iamgator; 12-26-2016 at 09:32 PM. Reason: needed to add extra information |
#6
|
||||
|
||||
Quote:
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
Tags |
macro, vba |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
find a way to show a closest to using lat and long cordinates | Steve81uk | Excel Programming | 4 | 02-02-2015 07:04 PM |
How to find and delete duplicate words in doc | cinvest | Word | 1 | 09-29-2014 08:34 PM |
How can I find paragraphs all in italics? | Robert2 | Word | 1 | 01-28-2014 02:54 PM |
How do I find Repeating Words/Phrases? | CCD2016 | PowerPoint | 0 | 12-01-2013 09:37 PM |
How can I find paragraphs all in italics? | Robert2 | Word | 1 | 06-30-2013 03:57 AM |