![]() |
#1
|
|||
|
|||
![]()
Hi,
I have about 50 files (MS Word 2010) comprising transcriptions of interviews and meetings. I want to determine the total number of words spoken by each individual participant in each file. All the transcription files have been formatted as follows: Each new paragraph indicates the start of the next 'turn' in the interview/meeting and is marked with a bracketed timestamp - [00:47:15] for example - followed by the speaker’s upper case initials - LW. for example - followed by the text of that speaker’s 'turn'. Can you suggest how I might best go about grouping and/or extracting the transcribed text of each individual participant in each of the 50 files? Whether the individual's text is converted into a table or compiled into a separate file (or into a specific column of an Excel worksheet) is not important to me, just so long as I can easily count the total number of words spoken by each individual in a given file. Ideally, I would like to do this as an easily repeatable sequence of operations (filling in the appropriate string, of course, to identify each separate speaker). Thanks in advance for any guidance or suggestions you can offer. I hope I have expressed the question/problem clearly enough. If not, please don’t hesitate to seek clarification. Best regards, Paul Fitz Last edited by macropod; 10-28-2014 at 04:08 PM. Reason: email address removed for privacy |
#2
|
||||
|
||||
![]()
Based on your description, the basic manual approach would be to
1. use a wildcard Find/Replace, where: Find = (\[[0-9]{2}:[0-9]{2}:[0-9]{2}\])^32([A-Z]{2})^32 Replace = \1^t\2^t 2. Select the whole document and use Insert|Table>Convert Text To Table, to convert the document into a 3-column table 3. Sort the table by the 2nd column. 4. In column 3, select all the rows for a set of initials in column 2, then check the word count stats on the Word status bar. Does the above do the job? If so a macro can be used to automate the process, but what kind of output do you want - and where?
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#3
|
|||
|
|||
![]()
Thanks very much Paul. You evidently gave my problem careful thought and I greatly appreciate the guidance. The Find string you've suggested is a clinic in it's own right! I will give the Find & Replace a shot tomorrow and let you know how it goes. The end goal is to produce charts for an academic paper that give an indication of team members' comparative level of participation both within individual group meetings as well as showing participation trajectory over 15 meetings spread over a 2-year collaborative development project. Best regards & thanks again. PF
|
#4
|
|||
|
|||
![]()
Hello again,
Updating as promised... With the 'Use wildcards' box checked, I tried applying Find = (\[[0-9]{2}:[0-9]{2}:[0-9]{2}\])^32([A-Z]{2})^32 Replace = \1^t\2^t to a representative file but, alas, the search returned "0 replacements" However, I'm not discouraged. Your first reply to my posted question reinforce the direction that my ongoing efforts /experiments have been taking. Indeed, I have constructed a multi-step procedure that is doing the job - albeit in a way that lacks the elegance of a carefully crafted, single search string. (Something which your model solution suggests can be achieved through deft design - not to mention experience.) Moreover, your 'eye-opener' prompted me to discover a web article by Graham Mayor: "Finding and replacing characters using wildcards" http://word.mvps.org/FAQs/General/UsingWildcards.htm which provides a concise overview of Theory and Practice. Thanks again. I'll be back! PF |
#5
|
||||
|
||||
![]()
Can you attach a document to a post with some representative data (delete anything sensitive)? You do this via the paperclip symbol on the 'Go Advanced' tab at the bottom of this screen.
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#6
|
||||
|
||||
![]()
I saw this question earlier, but couldn't find it when I returned to reply, and assumed it was in another forum that was down at the time. However I had worked out a simple macro solution that should do what you ask.
The following will count the words in paragraphs that start with a time in square brackets as shown followed by two or more initials. The result is displayed in a message box. When the macro is run the input box is not case sensitive. Code:
Option Explicit Sub WordsFromSpeaker() Dim oRng As Range Dim oPara As Paragraph Dim strSpeaker As String Dim strFirstWord As String Dim lng_speaker As Long Dim lng_Count As Long lng_Count = 0 strSpeaker = UCase(InputBox("Enter speaker's initials", "Count Words", "LW")) For Each oPara In ActiveDocument.Paragraphs strFirstWord = "" Set oRng = oPara.Range oRng.MoveStartUntil "]" oRng.Start = oRng.Start + 1 oRng.Collapse 1 oRng.MoveEndUntil Chr(32) If Trim(oRng.Text) = Trim(strSpeaker) Then Set oRng = oPara.Range oRng.MoveStartUntil "]" oRng.Start = oRng.Start + Len(strSpeaker) + 2 oRng.End = oRng.End - 1 lng_Count = lng_Count + CountWords(oRng.Text) End If Next oPara MsgBox strSpeaker & Chr(32) & lng_Count & " words" End Sub Private Function CountWords(strText As String) As Long Dim vWords As Variant Dim strFirst As String Dim i As Long vWords = Split(strText) For i = LBound(vWords) To UBound(vWords) strFirst = UCase$(Left$(vWords(i), 1)) If strFirst Like "[A-Z]" Then CountWords = CountWords + 1 End If Next i End Function
__________________
Graham Mayor - MS MVP (Word) (2002-2019) Visit my web site for more programming tips and ready made processes www.gmayor.com |
#7
|
|||
|
|||
![]()
Thank you, gmayor, for lending your time, attention and expertise to my request for assistance. Frankly I am a bit overwhelmed, not just by the generosity of spirit I've met in this forum, but because using macros is totally new territory for me. I shall attempt to run the script you prepared over the coming weekend once I've familiarized myself with the 'how-to' rudiments of running a macro. (Novice indeed! – but there’s no better place to learn how to swim than in the water!)
Thanks also for elsewhere posting/sharing your very clear & helpful article on using Wildcards, etc. Regards, PF |
#8
|
|||
|
|||
![]() Quote:
Thanks for offering to look at a sample file of data to see where the Find/Replace hitch is occurring - that would be helpful indeed. Let me first get the 'OK' from the PI (Principal Investigator) regarding 'sensitivity', etc, and get back to you Monday? Cheers, PF |
#9
|
||||
|
||||
![]()
All I need is a file with representative content - it doesn't have to be the content from any particular file, which is why I suggested "delete anything sensitive". You could, for example, copy & paste some mixed content from a 'live' file, obfuscate/delete anything that might be sensitive, then attach that.
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#10
|
||||
|
||||
![]() Quote:
__________________
Graham Mayor - MS MVP (Word) (2002-2019) Visit my web site for more programming tips and ready made processes www.gmayor.com |
#11
|
|||
|
|||
![]()
I've attached for F&R (Find & Replace) testing purposes the following file:
Transcript_ELA_Meeting 06B_TDT_temp.docx You will notice small bits of text frequently appearing - enclosed in square brackets* - within a speaker's 'turn'. These are just short particles of discourse uttered by other interlocutors. You'll also notice that pauses and other non-linguistic events are indicated by single or double parentheses. I want to expunge all instances of these –[*], (*), ((*)) - from the text before measuring participants' word counts. What remains should be the SID (Speaker Identifier) and associated text. *I wonder if these square brackets aren't a factor that prevented your F&R script from working as it should have. I've also attached a file with a draft version of the table I'll be using to place/compile the SID/ word count data: ELA_Meeting 01_TDTable.docx I plan to migrate this data into an Excel worksheet in order to convert the raw word count numbers into percentages of total word count, and, ultimately, to display the data relationships in four different tables/charts: 1) For each meeting, the WC (word count) of each participant expressed as a percentage of the total WC. (as a Table; alternatively, as a Stacked column chart) 2) For each meeting, the ratio LW's WC to that of the combined total of the other 5 team members: MB + FN + LKW + CL + CSB. (as a Stacked column chart) 3) For each meeting, the ratio of MB's WC to that of the combined total of the other 4 team members: FN + LKW + CL + CSB. (also as a Stacked column chart) 4) The trajectory of each of the team member's WC as a percentage across all 15 meetings; i.e. LW, MB, FN, LKW, CL, CSB. (as a Bar graph) I'm a complete novice with Excel, so there's plenty of learning coming up. (Any tips you can offer on converting the WCs to percentages, etc., - or helpful resources you can point me towards - will be most welcome.) Finally, Paul, the TS (time stamps) are superfluous for this stage of my work. In my own efforts to convert to files I used F & R to insert tabs (between TS, Speaker ID/initial, & text), then Converted to Table, and deleted the TS column. I pasted the resulting two columns into Excel, sorted A > Z, selected each Speaker’s total text, and pasted into word for a WC. Voilà! Just for clarification, adding the TS to the transcriptions at an earlier stage was a useful feature of Transcribe Lite - a Chrome extension/application we've found to be a big time-saver. TS make it quick and painless to check specific audio file locations. Also, I had imagined that the TS might provide a means of calculating the TD (talk distribution) based on proportional time (as opposed to WC), once we had loaded the audio and text files into the NVivo application for coding & analysis, etc. Alas, I haven't yet found (or devised) a way to structure that sort of query or functional sorting of the data. I imagine Excel is the right tool to extrapolate the duration of each turn from the TS and then sum that value together with all the other values for a given Speaker ID, and then report/display the result. Ah but how, exactly? That is another question. Therein, I suppose, lies the software developer's art. In any event, WC will still reveal the patterns the PI needs to show. Once again, Paul, I appreciate your willingness to share your knowledge and problem solving skill (not to mention time) in helping me with these research procedures. The 'models' that you and GMayor have offered are like beacons in the fog, and will guide me in my own skill development in the weeks and months ahead. Thank you. Regards, PF |
#12
|
||||
|
||||
![]()
Try the macro below. It has a folder browser, so all you need do is to point it to the folder with your transcript files and it will process all files in that folder. The results of the processing will be output to a table in the document you run the macro from. I suggest using an empty document. I don't know what you want for your 'what' column, since a speaker could have much to say and I don't know what you want to capture. The output also includes the filename for each set of records - presently in speaker column.
In addition to the processing discussed earlier, I've added some code to do a bit of related clean-up work, such as removing tabs, double-spaces and interjections, so an accurate word count without those can be obtained. Code:
Sub Get_Speaker_Stats() Application.ScreenUpdating = False Dim strFolder As String, strFile As String, strDocNm As String, wdDoc As Document Dim i As Long, j As Long, k As Long, RngSpkr As Range Dim StrSpkr As String, StrTmp As String, StrStats As String strDocNm = ActiveDocument.FullName StrStats = "Speaker" & vbTab & "Turns" & vbTab & "Words" 'Get the folder to process strFolder = GetFolder If strFolder = "" Then Exit Sub strFile = Dir(strFolder & "\*.doc", vbNormal) 'Loop through all documents in the folder While strFile <> "" i = 0: j = 0: k = 1 'If it's not this document, open it If strFolder & "\" & strFile <> strDocNm Then Set wdDoc = Documents.Open(FileName:=strFolder & "\" & strFile, _ AddToRecentFiles:=False, Visible:=False, ReadOnly:=True) With wdDoc 'Store the document's name StrStats = StrStats & vbCr & .Name With .Range With .Find .ClearFormatting .Replacement.ClearFormatting .Format = False .Forward = True .Wrap = wdFindContinue .MatchWildcards = True 'Delete interjections .Text = "[\[\(]@[!0-9]@[\)\]]{1,}" .Replacement.Text = "" .Execute Replace:=wdReplaceAll 'Do some basic clean-up work .Text = "[^t^0160]" .Replacement.Text = " " .Execute Replace:=wdReplaceAll .Text = "[ ]{2,}" .Replacement.Text = " " .Execute Replace:=wdReplaceAll .Text = " ^13" .Replacement.Text = "^p" .Execute Replace:=wdReplaceAll 'Reformat the data for tabulation .Text = "(\[[0-9]{2}:[0-9]{2}:[0-9]{2}\])[ ]@([A-Z]@)[. ]@([! ])" .Replacement.Text = "\1^t\2.^t\3" .Execute Replace:=wdReplaceAll End With 'Convert the document to a 3-column table .ConvertToTable Separator:=vbTab, NumColumns:=3 With .Tables(1) 'Sort the table by the 2nd (speaker) column .Sort ExcludeHeader:=False, FieldNumber:=2, SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending 'add a temporary row so we get the last record .Rows.Add 'Get the first speaker's details & stats for their first words Set RngSpkr = .Cell(1, 2).Range With RngSpkr .End = .End - 1 StrSpkr = .Text If Len(.Cells(1).Next.Range.Text) > 2 Then j = j + UBound(Split(.Cells(1).Next.Range.Text, " ")) + 1 End With 'Process the rest of the table For i = 2 To .Rows.Count 'Check who the speaker is Set RngSpkr = .Cell(i, 2).Range With RngSpkr .End = .End - 1 'If it's the same speaker, update the word count If .Text = StrSpkr Then If Len(.Cells(1).Next.Range.Text) > 2 Then j = j + UBound(Split(.Cells(1).Next.Range.Text, " ")) + 1 'Otherwise, store the stats & get details of the new speaker Else If StrSpkr <> "" Then StrStats = StrStats & vbCr & StrSpkr & vbTab & i - k & vbTab & j StrSpkr = .Text: j = 0: k = i If Len(.Cells(1).Next.Range.Text) > 2 Then j = j + UBound(Split(.Cells(1).Next.Range.Text, " ")) + 1 End If End With Next End With End With 'We're done with this file, so close it .Close SaveChanges:=False End With End If 'Get the next file strFile = Dir() Wend Set wdDoc = Nothing: Set RngSpkr = Nothing 'Output the stats for all files With ActiveDocument.Range .InsertAfter StrStats .ConvertToTable Separator:=vbTab, NumColumns:=3 With .Tables(1) .Rows(1).HeadingFormat = True .Borders.Enable = True End With End With Application.ScreenUpdating = True End Sub Function GetFolder() As String Dim oFolder As Object GetFolder = "" Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0) If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path Set oFolder = Nothing End Function
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#13
|
|||
|
|||
![]()
Thanks very much Paul. With some minor adjustments the macro worked a treat!
Regards, PF |
![]() |
Tags |
format style, transcriptions, word count by speaker |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
![]() |
cnyoon2 | Word | 1 | 08-04-2015 08:16 AM |
![]() |
X82 | Excel Programming | 1 | 09-26-2012 09:29 PM |
![]() |
Rattykins | Word VBA | 4 | 06-27-2012 10:02 PM |
![]() |
t-4-2 | PowerPoint | 2 | 01-19-2012 02:24 AM |
![]() |
Metamag | Office | 3 | 05-09-2011 06:25 PM |