|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Extract Line of Text w/ specific characters up to the paragraph character, send to Excel
Hi all,
Long time no see . I swear I've seen similar questions but I can't find them now... So, I receive the below data (I've "x"ed out the sensitive stuff) in a 380ish page report every day. There are about 1,000 instances of the text paragraph example. I need to extract the first line of each instance and send it to Excel so then I can import it into Access and build a query to search for businesses that have certain words in their names. However, getting the first line of text out of Word and into Excel so that I can then import it into Access is not as simple as I thought it would be. I found this, but it doesn't work at my end and I don't understand how to limit it by the paragraph character or get it to Excel. Example: 150710000XXX CORP NAME, LLC LAW : XXX LLC *FILER ADDRESS / PROCESS ADDRESS* COUNTY : XXXX XXXXXXX NAME EFF. DATE: 07/10/2015 123 XXXXXX ST. #XX DUR. DATE: NEW YORK, NY XXXXX I'd like to be able to extract the first line (bolded above) that holds that ID number, the corporation name and then ends at a paragraph return, and send those to Excel, two columns per record, one for the ID and one for the corporation name. The corp names don't always end in LLC so the paragraph return after the ID number should grab everything. -The first six digits of the number beginning with 150710 above indicate the year, month and day. I'd like to leave the option open to enter either 4 or 6 digits, if we have to specify this now, because we will probably combine the reports for each week in to a single document. -There is a paragraph character (^p) at the end of the line I want to extract because I was able to "find" them using Ctrl+F. There is not always a paragraph character preceding the 150710XXXXXXX number. -Additionally there are three spaces between the ID Number and the corporation name. Based on my limited knowledge of VBA I'd say a pop-up box that requests the prefix I want to search for would be best? Then, hypothetically, Word would use the entered prefix with a wild card to find all the instances that contain that prefix, extracting the line of text to the paragraph return. Ultimately those would all go into an Excel spreadsheet that I could then put in to Access so that I can build a query to search the keywords we're looking for as opposed to the poor soul that is currently Ctrl+Fing her way through almost 400 pages every day. I'm using MS Office 2013. Going to update my profile right now. Thank you in advance for your assistance. Donna |
#2
|
||||
|
||||
The lines you want can probably be found using a wildcard Find, something along the lines of:
Find = <[0-9]{4,6}[!^13]{4,} but we'd probably need more of a description of what the lines to be found contain than what you've provided so far. In your example: 150710000XXX CORP NAME, LLC I can understand what the first six digits (150710) represent, but then you have a further three 0s followed by what I suppose are just three random alpha characters followed by what appears to be just a single space before the corporation name. After the date, are there always three digits followed by three alpha characters, all as an unbroken string? Are there actually three spaces before the corporation name, or just those three alpha characters?
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#3
|
|||
|
|||
Hi Paul,
I was hoping you'd be the one to post. The last "x"s represent the ID number assigned by the state. The ID number that contains the 1507 prefix seems to be a standard length in all the reports I've seen so far but I don't know if the number of characters would change if all of a sudden there was a surge in corporations for that day. After that number there are three blank spaces before the corporation name. At the end of every line containing the above there is a return. I wasn't sure if it was possible to find a prefix and then somehow search to "the end of the line". Donna |
#4
|
||||
|
||||
In that case, the lines you want can probably be found using a wildcard Find, along the lines of:
Find = <[0-9]{7,9}[ ]{3}[!^13]{1,} If you do some testing with that, you should be able to confirm that it correctly matches all your extract records with no false matches. As for transferring the data to Access, I really don't see why you need to involve Excel - I'd have thought one just export from Word to a delimited text file and import that directly into Access?
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#5
|
||||
|
||||
While Paul is a whizz with wildcards, based on your comments I would do it differently. If the required paragraphs always begin with four or 6 numeric characters, the required paragraphs can be obtained by analysing those characters at the start of each paragraph e.g.
Code:
Dim oPara As Paragraph Dim oRng As Range For Each oPara In ActiveDocument.Paragraphs If IsNumeric(Left(oPara.Range, 4)) Or IsNumeric(Left(oPara.Range, 6)) Then Set oRng = oPara.Range oRng.End = oRng.End - 1 'do what you want here with orng.text MsgBox oRng.Text End If Next oPara
__________________
Graham Mayor - MS MVP (Word) (2002-2019) Visit my web site for more programming tips and ready made processes www.gmayor.com |
#6
|
||||
|
||||
I was thinking of something like:
Code:
Sub Demo() Application.ScreenUpdating = False Dim StrTmp As String, StrRec As String, StrDat As String Dim i As Long, j As Long, x As Long, StrFlNm As String StrFlNm = "C:\Users\" & Environ("UserName") & "\Documents\CorpData.txt" With ActiveDocument.Range With .Find .ClearFormatting .Replacement.ClearFormatting .Text = "<[0-9]{7,9}[ ]{3}[!^13]{1,}" .Replacement.Text = "" .Forward = True .Wrap = wdFindStop .Format = False .MatchWildcards = True .Execute End With Do While .Find.Found = True x = x + 1 StrRec = .Text StrTmp = Split(StrRec, " ")(0) i = Len(StrRec): j = Len(StrTmp) StrDat = StrDat & Chr(34) & StrTmp & Chr(34) & vbTab & _ Chr(34) & Trim(Right(StrRec, i - j)) & Chr(34) & vbCrLf .Collapse wdCollapseEnd .Find.Execute Loop If Len(StrDat) > 1 Then Close #1 StrDat = Left(StrDat, Len(StrDat) - 1) Open StrFlNm For Output As #1 Print #1, StrDat Close #1 End If End With StatusBar = "Done! The output for " & x & " records is now in: " & StrFlNm Application.ScreenUpdating = True End Sub To produce a csv file instead (e.g. for import into Excel), simply change CorpData.txt to CorpData.csv and change vbTab to ",".
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#7
|
|||
|
|||
I'm on my way in to the office now and will try both!!! THANK YOU THANK YOU THANK YOU.
|
#8
|
|||
|
|||
Hi Paul,
Thank you for your help!! Once again MS Office is singing because of the macro you wrote. Initially when I ran the macro it wasn't extracting anything but I played around with the .text line and figured it out. It was my description to you that was not accurate, I used "XXX" which was interpreted as characters, I should've said digits. I'm happy I figured it out . Now I have more questions, please let me know if I should start a new thread and mark this one solved... So, I've processed about 10 of the reports with the macro and then queried the imported text file (which is AWESOME) in Access. I have around 45 search terms in the Access query but I'm only getting hits on about 2% of the names so I'm wondering if it would be better/possible to search what the macro extracts? When you helped me with this: https://www.msofficeforums.com/word-...xtensions.html, the macro used a spreadsheet to search and replace which I think could be applied here? How would I build wildcards in to that? For example, if I want to search for "* MD*", so a space plus "MD", with wildcards at either end but outside the space that precedes the "M". There are only 5 parameters in Access that need a wildcard then a space then the word to be searched and then another wildcard. Maybe I could have Word use an Excel spreadsheet for concrete search words and then still use Access for the 5 different ones? Throwing ideas out, not sure what the best approach is... This is similar to the bulk find replace macro in that my string got to long, that what lead me to search and ultimately brought me to the world of VBA haha. If you think it's better to use the current macro and import to Access I trust your advice. This is what is working for me right now, I changed the second curly bracket to "12": Code:
Sub CorpData() Application.ScreenUpdating = False Dim StrTmp As String, StrRec As String, StrDat As String Dim i As Long, j As Long, x As Long, StrFlNm As String StrFlNm = "C:\Users\" & Environ("UserName") & "\Documents\CorpData.txt" With ActiveDocument.Range With .Find .ClearFormatting .Replacement.ClearFormatting .Text = "<[0-9]{12}[!^13]{1,}" .Replacement.Text = "" .Forward = True .Wrap = wdFindStop .Format = False .MatchWildcards = True .Execute End With Do While .Find.Found = True x = x + 1 StrRec = .Text StrTmp = Split(StrRec, " ")(0) i = Len(StrRec): j = Len(StrTmp) StrDat = StrDat & Chr(34) & StrTmp & Chr(34) & vbTab & _ Chr(34) & Trim(Right(StrRec, i - j)) & Chr(34) & vbCrLf .Collapse wdCollapseEnd .Find.Execute Loop If Len(StrDat) > 1 Then Close #1 StrDat = Left(StrDat, Len(StrDat) - 1) Open StrFlNm For Output As #1 Print #1, StrDat Close #1 End If End With StatusBar = "Done! The output for " & x & " records is now in: " & StrFlNm Application.ScreenUpdating = True End Sub |
#9
|
|||
|
|||
Hi Graham,
Thank you for this option. The pop-up window doesn't work for this specific issue because I get about 1500 results per report so that's a lot of times to hit the "enter" key . Initially I thought the pop-up was the best option but there aren't any other 12 digit numbers in the report, lucky for me. I was working all weekend so I didn't have much time to work with what you wrote for me but I did want to say thank you and also thank you for all the work you do on your blog. http://www.gmayor.com/replace_using_wildcards.htm is what helped me figure out why I couldn't get Paul's macro to extract what I wanted. You guys are great!! Every time I submit something I learn a ton. Once I have a chance to "play" with what you wrote I'll be back with more questions Thanks again! Donna |
#10
|
||||
|
||||
The message box was merely an indication that the macro had recovered the correct piece of text. In a real world situation it would not be required. You would use the value shown displayed in the text box i.e. oRng.Text, to add the required content to the CSV fille.
__________________
Graham Mayor - MS MVP (Word) (2002-2019) Visit my web site for more programming tips and ready made processes www.gmayor.com |
#11
|
||||
|
||||
Hi Donna,
Quote:
In that macro, you'd change: .MatchCase = True .MatchWholeWord = True to: .MatchWildcards = True Naturally, you'd need a workbook with valid wildcard Find/Replace expressions, too. Quote:
Quote:
.Text = "<[0-9]{7,9}[ ]{3}[!^13]{1,}" to: .Text = "<[0-9]{10,12} [!^13]{1,}"
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
Tags |
excel 2013, extract word to excel, word 2013 |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to insert paragraph character after every 500 characters? | aditya_bokade | Word VBA | 28 | 11-13-2021 10:48 PM |
Macro to Insert text into the beginning on specific paragraphs unless the paragraph is blank | caboy | Word VBA | 2 | 04-01-2015 07:00 AM |
How can select from a specific character to another character | mohsen.amiri | Word | 2 | 02-19-2015 11:38 PM |
Replace paragraph-marks (line-breaks) in tables with a character-string | Aztec | Word VBA | 2 | 04-02-2013 10:52 PM |
How to import a text file but skip the first line regardless of characters? | omahadivision | Excel Programming | 7 | 02-01-2013 08:30 PM |