Microsoft Office Forums

Go Back   Microsoft Office Forums > >

Reply
 
Thread Tools Display Modes
  #1  
Old 10-28-2014, 01:35 AM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Question How can I tag and selectively extract text (multiple files)?

Hi,
I have about 50 files (MS Word 2010) comprising transcriptions of interviews and meetings. I want to determine the total number of words spoken by each individual participant in each file.
All the transcription files have been formatted as follows:
Each new paragraph indicates the start of the next 'turn' in the interview/meeting and is marked with a bracketed timestamp - [00:47:15] for example - followed by the speaker’s upper case initials - LW. for example - followed by the text of that speaker’s 'turn'.
Can you suggest how I might best go about grouping and/or extracting the transcribed text of each individual participant in each of the 50 files? Whether the individual's text is converted into a table or compiled into a separate file (or into a specific column of an Excel worksheet) is not important to me, just so long as I can easily count the total number of words spoken by each individual in a given file. Ideally, I would like to do this as an easily repeatable sequence of operations (filling in the appropriate string, of course, to identify each separate speaker).
Thanks in advance for any guidance or suggestions you can offer. I hope I have expressed the question/problem clearly enough. If not, please don’t hesitate to seek clarification.
Best regards,


Paul Fitz

Last edited by macropod; 10-28-2014 at 04:08 PM. Reason: email address removed for privacy
Reply With Quote
  #2  
Old 10-29-2014, 05:23 AM
macropod's Avatar
macropod macropod is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,374
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

Based on your description, the basic manual approach would be to
1. use a wildcard Find/Replace, where:
Find = (\[[0-9]{2}:[0-9]{2}:[0-9]{2}\])^32([A-Z]{2})^32
Replace = \1^t\2^t
2. Select the whole document and use Insert|Table>Convert Text To Table, to convert the document into a 3-column table
3. Sort the table by the 2nd column.
4. In column 3, select all the rows for a set of initials in column 2, then check the word count stats on the Word status bar.

Does the above do the job? If so a macro can be used to automate the process, but what kind of output do you want - and where?
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #3  
Old 10-29-2014, 09:57 AM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Default

Thanks very much Paul. You evidently gave my problem careful thought and I greatly appreciate the guidance. The Find string you've suggested is a clinic in it's own right! I will give the Find & Replace a shot tomorrow and let you know how it goes. The end goal is to produce charts for an academic paper that give an indication of team members' comparative level of participation both within individual group meetings as well as showing participation trajectory over 15 meetings spread over a 2-year collaborative development project. Best regards & thanks again. PF
Reply With Quote
  #4  
Old 10-29-2014, 08:29 PM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Default

Hello again,
Updating as promised...
With the 'Use wildcards' box checked, I tried applying
Find = (\[[0-9]{2}:[0-9]{2}:[0-9]{2}\])^32([A-Z]{2})^32
Replace = \1^t\2^t
to a representative file but, alas, the search returned "0 replacements"
However, I'm not discouraged.
Your first reply to my posted question reinforce the direction that my ongoing efforts /experiments have been taking. Indeed, I have constructed a multi-step procedure that is doing the job - albeit in a way that lacks the elegance of a carefully crafted, single search string. (Something which your model solution suggests can be achieved through deft design - not to mention experience.)
Moreover, your 'eye-opener' prompted me to discover a web article by Graham Mayor: "Finding and replacing characters using wildcards" http://word.mvps.org/FAQs/General/UsingWildcards.htm which provides a concise overview of Theory and Practice.
Thanks again. I'll be back!
PF
Reply With Quote
  #5  
Old 10-29-2014, 09:42 PM
macropod's Avatar
macropod macropod is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,374
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

Can you attach a document to a post with some representative data (delete anything sensitive)? You do this via the paperclip symbol on the 'Go Advanced' tab at the bottom of this screen.
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #6  
Old 10-29-2014, 10:29 PM
gmayor's Avatar
gmayor gmayor is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Expert
 
Join Date: Aug 2014
Posts: 4,138
gmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud of
Default

I saw this question earlier, but couldn't find it when I returned to reply, and assumed it was in another forum that was down at the time. However I had worked out a simple macro solution that should do what you ask.

The following will count the words in paragraphs that start with a time in square brackets as shown followed by two or more initials. The result is displayed in a message box. When the macro is run the input box is not case sensitive.

Code:
Option Explicit
Sub WordsFromSpeaker()
Dim oRng As Range
Dim oPara As Paragraph
Dim strSpeaker As String
Dim strFirstWord As String
Dim lng_speaker As Long
Dim lng_Count As Long
    lng_Count = 0
    strSpeaker = UCase(InputBox("Enter speaker's initials", "Count Words", "LW"))
    For Each oPara In ActiveDocument.Paragraphs
        strFirstWord = ""
        Set oRng = oPara.Range
        oRng.MoveStartUntil "]"
        oRng.Start = oRng.Start + 1
        oRng.Collapse 1
        oRng.MoveEndUntil Chr(32)
        If Trim(oRng.Text) = Trim(strSpeaker) Then
            Set oRng = oPara.Range
            oRng.MoveStartUntil "]"
            oRng.Start = oRng.Start + Len(strSpeaker) + 2
            oRng.End = oRng.End - 1
             lng_Count = lng_Count + CountWords(oRng.Text)
        End If
    Next oPara
    MsgBox strSpeaker & Chr(32) & lng_Count & " words"
End Sub

Private Function CountWords(strText As String) As Long
Dim vWords As Variant
Dim strFirst As String
Dim i As Long
    vWords = Split(strText)
    For i = LBound(vWords) To UBound(vWords)
        strFirst = UCase$(Left$(vWords(i), 1))
        If strFirst Like "[A-Z]" Then
            CountWords = CountWords + 1
        End If
    Next i
End Function
__________________
Graham Mayor - MS MVP (Word) (2002-2019)
Visit my web site for more programming tips and ready made processes www.gmayor.com
Reply With Quote
  #7  
Old 10-30-2014, 08:37 AM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Default

Thank you, gmayor, for lending your time, attention and expertise to my request for assistance. Frankly I am a bit overwhelmed, not just by the generosity of spirit I've met in this forum, but because using macros is totally new territory for me. I shall attempt to run the script you prepared over the coming weekend once I've familiarized myself with the 'how-to' rudiments of running a macro. (Novice indeed! – but there’s no better place to learn how to swim than in the water!)
Thanks also for elsewhere posting/sharing your very clear & helpful article on using Wildcards, etc.
Regards,
PF
Reply With Quote
  #8  
Old 10-30-2014, 07:47 PM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Default

Quote:
Originally Posted by macropod View Post
Can you attach a document to a post with some representative data (delete anything sensitive)? You do this via the paperclip symbol on the 'Go Advanced' tab at the bottom of this screen.
Hi Paul,
Thanks for offering to look at a sample file of data to see where the Find/Replace hitch is occurring - that would be helpful indeed.
Let me first get the 'OK' from the PI (Principal Investigator) regarding 'sensitivity', etc, and get back to you Monday?
Cheers,
PF
Reply With Quote
  #9  
Old 10-30-2014, 08:14 PM
macropod's Avatar
macropod macropod is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,374
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

All I need is a file with representative content - it doesn't have to be the content from any particular file, which is why I suggested "delete anything sensitive". You could, for example, copy & paste some mixed content from a 'live' file, obfuscate/delete anything that might be sensitive, then attach that.
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #10  
Old 10-30-2014, 10:44 PM
gmayor's Avatar
gmayor gmayor is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Expert
 
Join Date: Aug 2014
Posts: 4,138
gmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud ofgmayor has much to be proud of
Default

Quote:
Originally Posted by PaulFitz View Post
I shall attempt to run the script you prepared over the coming weekend once I've familiarized myself with the 'how-to' rudiments of running a macro.PF
http://www.gmayor.com/installing_macro.htm will point the way.
__________________
Graham Mayor - MS MVP (Word) (2002-2019)
Visit my web site for more programming tips and ready made processes www.gmayor.com
Reply With Quote
  #11  
Old 10-31-2014, 10:00 AM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Default

I've attached for F&R (Find & Replace) testing purposes the following file:
Transcript_ELA_Meeting 06B_TDT_temp.docx
You will notice small bits of text frequently appearing - enclosed in square brackets* - within a speaker's 'turn'. These are just short particles of discourse uttered by other interlocutors. You'll also notice that pauses and other non-linguistic events are indicated by single or double parentheses. I want to expunge all instances of these –[*], (*), ((*)) - from the text before measuring participants' word counts. What remains should be the SID (Speaker Identifier) and associated text.
*I wonder if these square brackets aren't a factor that prevented your F&R script from working as it should have.
I've also attached a file with a draft version of the table I'll be using to place/compile the SID/ word count data:
ELA_Meeting 01_TDTable.docx
I plan to migrate this data into an Excel worksheet in order to convert the raw word count numbers into percentages of total word count, and, ultimately, to display the data relationships in four different tables/charts:
1) For each meeting, the WC (word count) of each participant expressed as a percentage of the total WC. (as a Table; alternatively, as a Stacked column chart)
2) For each meeting, the ratio LW's WC to that of the combined total of the other 5 team members: MB + FN + LKW + CL + CSB. (as a Stacked column chart)
3) For each meeting, the ratio of MB's WC to that of the combined total of the other 4 team members: FN + LKW + CL + CSB. (also as a Stacked column chart)
4) The trajectory of each of the team member's WC as a percentage across all 15 meetings; i.e. LW, MB, FN, LKW, CL, CSB. (as a Bar graph)
I'm a complete novice with Excel, so there's plenty of learning coming up. (Any tips you can offer on converting the WCs to percentages, etc., - or helpful resources you can point me towards - will be most welcome.)
Finally, Paul, the TS (time stamps) are superfluous for this stage of my work. In my own efforts to convert to files I used F & R to insert tabs (between TS, Speaker ID/initial, & text), then Converted to Table, and deleted the TS column. I pasted the resulting two columns into Excel, sorted A > Z, selected each Speaker’s total text, and pasted into word for a WC. Voilà!
Just for clarification, adding the TS to the transcriptions at an earlier stage was a useful feature of Transcribe Lite - a Chrome extension/application we've found to be a big time-saver. TS make it quick and painless to check specific audio file locations. Also, I had imagined that the TS might provide a means of calculating the TD (talk distribution) based on proportional time (as opposed to WC), once we had loaded the audio and text files into the NVivo application for coding & analysis, etc. Alas, I haven't yet found (or devised) a way to structure that sort of query or functional sorting of the data. I imagine Excel is the right tool to extrapolate the duration of each turn from the TS and then sum that value together with all the other values for a given Speaker ID, and then report/display the result. Ah but how, exactly? That is another question. Therein, I suppose, lies the software developer's art. In any event, WC will still reveal the patterns the PI needs to show.
Once again, Paul, I appreciate your willingness to share your knowledge and problem solving skill (not to mention time) in helping me with these research procedures. The 'models' that you and GMayor have offered are like beacons in the fog, and will guide me in my own skill development in the weeks and months ahead. Thank you.
Regards,
PF
Attached Files
File Type: docx Transcript_ELA_Meeting 06B_TDT_temp.docx (40.7 KB, 16 views)
File Type: docx ELA_Meeting 01_TDTable.docx (15.7 KB, 16 views)
Reply With Quote
  #12  
Old 10-31-2014, 04:36 PM
macropod's Avatar
macropod macropod is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,374
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

Try the macro below. It has a folder browser, so all you need do is to point it to the folder with your transcript files and it will process all files in that folder. The results of the processing will be output to a table in the document you run the macro from. I suggest using an empty document. I don't know what you want for your 'what' column, since a speaker could have much to say and I don't know what you want to capture. The output also includes the filename for each set of records - presently in speaker column.

In addition to the processing discussed earlier, I've added some code to do a bit of related clean-up work, such as removing tabs, double-spaces and interjections, so an accurate word count without those can be obtained.
Code:
Sub Get_Speaker_Stats()
Application.ScreenUpdating = False
Dim strFolder As String, strFile As String, strDocNm As String, wdDoc As Document
Dim i As Long, j As Long, k As Long, RngSpkr As Range
Dim StrSpkr As String, StrTmp As String, StrStats As String
strDocNm = ActiveDocument.FullName
StrStats = "Speaker" & vbTab & "Turns" & vbTab & "Words"
'Get the folder to process
strFolder = GetFolder
If strFolder = "" Then Exit Sub
strFile = Dir(strFolder & "\*.doc", vbNormal)
'Loop through all documents in the folder
While strFile <> ""
  i = 0: j = 0: k = 1
  'If it's not this document, open it
  If strFolder & "\" & strFile <> strDocNm Then
    Set wdDoc = Documents.Open(FileName:=strFolder & "\" & strFile, _
      AddToRecentFiles:=False, Visible:=False, ReadOnly:=True)
    With wdDoc
      'Store the document's name
      StrStats = StrStats & vbCr & .Name
      With .Range
        With .Find
          .ClearFormatting
          .Replacement.ClearFormatting
          .Format = False
          .Forward = True
          .Wrap = wdFindContinue
          .MatchWildcards = True
          'Delete interjections
          .Text = "[\[\(]@[!0-9]@[\)\]]{1,}"
          .Replacement.Text = ""
          .Execute Replace:=wdReplaceAll
          'Do some basic clean-up work
          .Text = "[^t^0160]"
          .Replacement.Text = " "
          .Execute Replace:=wdReplaceAll
          .Text = "[ ]{2,}"
          .Replacement.Text = " "
          .Execute Replace:=wdReplaceAll
          .Text = " ^13"
          .Replacement.Text = "^p"
          .Execute Replace:=wdReplaceAll
          'Reformat the data for tabulation
          .Text = "(\[[0-9]{2}:[0-9]{2}:[0-9]{2}\])[ ]@([A-Z]@)[. ]@([! ])"
          .Replacement.Text = "\1^t\2.^t\3"
          .Execute Replace:=wdReplaceAll
        End With
        'Convert the document to a 3-column table
        .ConvertToTable Separator:=vbTab, NumColumns:=3
        With .Tables(1)
          'Sort the table by the 2nd (speaker) column
          .Sort ExcludeHeader:=False, FieldNumber:=2, SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending
          'add a temporary row so we get the last record
          .Rows.Add
          'Get the first speaker's details & stats for their first words
          Set RngSpkr = .Cell(1, 2).Range
          With RngSpkr
            .End = .End - 1
            StrSpkr = .Text
            If Len(.Cells(1).Next.Range.Text) > 2 Then j = j + UBound(Split(.Cells(1).Next.Range.Text, " ")) + 1
          End With
          'Process the rest of the table
          For i = 2 To .Rows.Count
            'Check who the speaker is
            Set RngSpkr = .Cell(i, 2).Range
            With RngSpkr
              .End = .End - 1
              'If it's the same speaker, update the word count
              If .Text = StrSpkr Then
                If Len(.Cells(1).Next.Range.Text) > 2 Then j = j + UBound(Split(.Cells(1).Next.Range.Text, " ")) + 1
              'Otherwise, store the stats & get details of the new speaker
              Else
                If StrSpkr <> "" Then StrStats = StrStats & vbCr & StrSpkr & vbTab & i - k & vbTab & j
                StrSpkr = .Text: j = 0: k = i
                If Len(.Cells(1).Next.Range.Text) > 2 Then j = j + UBound(Split(.Cells(1).Next.Range.Text, " ")) + 1
              End If
            End With
          Next
        End With
      End With
      'We're done with this file, so close it
      .Close SaveChanges:=False
    End With
  End If
  'Get the next file
  strFile = Dir()
Wend
Set wdDoc = Nothing: Set RngSpkr = Nothing
'Output the stats for all files
With ActiveDocument.Range
  .InsertAfter StrStats
  .ConvertToTable Separator:=vbTab, NumColumns:=3
  With .Tables(1)
    .Rows(1).HeadingFormat = True
    .Borders.Enable = True
  End With
End With
Application.ScreenUpdating = True
End Sub
 
Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #13  
Old 10-31-2014, 08:40 PM
PaulFitz PaulFitz is offline How can I tag and selectively extract text (multiple files)? Windows 7 64bit How can I tag and selectively extract text (multiple files)? Office 2010 32bit
Novice
How can I tag and selectively extract text (multiple files)?
 
Join Date: Oct 2014
Location: Singapore
Posts: 7
PaulFitz is on a distinguished road
Default

Thanks very much Paul. With some minor adjustments the macro worked a treat!

Regards,
PF
Reply With Quote
Reply

Tags
format style, transcriptions, word count by speaker



Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I tag and selectively extract text (multiple files)? how to selectively highlight text in word cnyoon2 Word 1 08-04-2015 08:16 AM
How can I tag and selectively extract text (multiple files)? Importing multiple text files and getting certain figures. X82 Excel Programming 1 09-26-2012 09:29 PM
How can I tag and selectively extract text (multiple files)? VBA code to extract specific bookmarks from multiple word files Rattykins Word VBA 4 06-27-2012 10:02 PM
How can I tag and selectively extract text (multiple files)? how to extract wav files from ppts t-4-2 PowerPoint 2 01-19-2012 02:24 AM
How can I tag and selectively extract text (multiple files)? Copying multiple files as text without extensions Metamag Office 3 05-09-2011 06:25 PM

Other Forums: Access Forums

All times are GMT -7. The time now is 09:26 PM.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2025, vBulletin Solutions Inc.
Search Engine Optimisation provided by DragonByte SEO (Lite) - vBulletin Mods & Addons Copyright © 2025 DragonByte Technologies Ltd.
MSOfficeForums.com is not affiliated with Microsoft