![]() |
|
#1
|
||||
|
||||
![]()
Can I Extract webdata to word based on the pattern in the microsoft word document
For example: I have doccument like this ![]() I want to find numbers in the documents and extract data from google patents based on pattern In case of cell having Multiple numbers I want first line to be searched only for pattern In case of WO numbers I want to extract title from google patents In case of Other like US, EP and CN I want Only Claims to be extracted from google patents. there will be some numbers also for which google patent link might not work. I want code to Ignore Them The pattern for Numbering to google patent link is as follow I learned from this forum how to convert numbers to link Code:
Sub AddGPHLink() Dim oRng As Range Dim strLink As String Set oRng = ActiveDocument.Range With oRng.Find Do While .Execute(FindText:="([USEPCNAWO]{2}) ([0-9]{4,}) ([A-Z0-9]{1,2})", MatchWildcards:=True) strLink = oRng.Text strLink = Replace(strLink, Chr(32), "") strLink = "https://www.google.co.in/patents/" & strLink & "?cl=en" ActiveDocument.Hyperlinks.Add Anchor:=oRng, _ Address:=strLink, _ TextToDisplay:=oRng.Text oRng.End = oRng.Fields(1).Result.End oRng.Collapse 0 Loop End With lbl_Exit: Set oRng = Nothing Exit Sub End Sub this is the webpage of exemplary googlepatent page https://www.google.co.in/patents/EP2431370B1?cl=en ![]() I want to extract data from web to word document like this ![]() on webpage there are two fields I want to target first goes like this on line 3 it is written like that <html style="height: 100%;"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents</title><script src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/cb=gapi.loaded_0" async=""></script><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?bnew Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a;window.performance&&(a=window.performance.timing );var f=a?new e(a.responseStart):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart,d=a.responseStart;0<c&&d>=c&&( window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt" ,void 0,c),b.tick("wtsrt_", I want "Optically active diamine derivative and process for producing the same" to extract to word as title Second is claims there are two fields in claim 1. claim number claim nubers are having two types of claim independent claims having pattern like this <li class="claim"> <div id="c-en-0001" num="0001" class="claim"> dependent claims having pattern like this </li> <li class="claim-dependent"> <div id="c-en-0002" num="0002" class="claim"> 2. claim text <div class="claim-text">A process for producing a compound represented by formula (II): <chemistry id="chem0047" num="0047"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png"> <img id="ib0047" file="imgb0047.tif" wi="40" he="40" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png" class="patent-full-image" width="160" height="160" alt="Figure imgb0047"> </a> </div> <attachments> <attachment idref="chem0047" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0047" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry> (wherein Y represents -COR, wherein R represents a C1-C8 alkoxy group, a C6-C14 aryloxy group, a C2-C8 alkenyloxy group, a C7-C26 aralkyloxy group, or a di(C1-C6 alkyl)amino group; and R<sup>1</sup> represents a C2-C7 alkoxycarbonyl group), which comprises treating a compound represented by formula (I): <chemistry id="chem0048" num="0048"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png"> <img id="ib0048" file="imgb0048.tif" wi="26" he="34" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png" class="patent-full-image" width="104" height="136" alt="Figure imgb0048"> </a> </div> <attachments> <attachment idref="chem0048" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0048" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry> (wherein Y has the same meaning as defined above) in a solvent with aqueous ammonia or a solution of ammonia in C1-C4 alcohol and, subsequently, with a di(C1-C6 alkyl) dicarbonate.</div> I would like to have two different macro or same macro to choose between what i want in term of claims weather dependent of independent Claim text is followed by claim number which also governs wether claims is depenedant of indepedent I want claim numbering to be plain text Is it possible through word macro? Word Files can be found from here Sorry for not attaching them in attachements as it is forbidden on my PC. https://sites.google.com/site/rahula...edirects=0&d=1 https://sites.google.com/site/rahula...edirects=0&d=1 |
#2
|
||||
|
||||
![]()
I've never done much web data retrieval, but here's some code to get you started:
Code:
Sub AddGPHLink() Dim Rng As Range, StrTxt As String, Tbl As Table, i As Long Const StrLnk As String = "https://www.google.co.in/patents/" Dim strLink As String With ActiveDocument With .Range With .Find .ClearFormatting .Replacement.ClearFormatting .Format = False .Forward = True .Wrap = wdFindStop .MatchWildcards = True .Text = "[UECW][SPNO] [0-9]{4,} [A-Z0-9]{1,2}" .Replacement.Text = "" .Execute End With Do While .Find.Found Set Rng = .Duplicate With Rng StrTxt = .Text .Delete End With .Hyperlinks.Add Anchor:=Rng, TextToDisplay:=StrTxt, _ Address:=StrLnk & Replace(StrTxt, " ", "") & "?cl=en" .Collapse wdCollapseEnd .Find.Execute Loop .Start = ActiveDocument.Range.Start With .Find .Text = "IN [0-9A-Z]{7,}" .Execute End With Do While .Find.Found Set Rng = .Duplicate With Rng StrTxt = .Text .Delete End With .Hyperlinks.Add Anchor:=Rng, TextToDisplay:=StrTxt, _ Address:=StrLnk & Replace(StrTxt, " ", "") & "?cl=en" .Collapse wdCollapseEnd .Find.Execute Loop End With For Each Tbl In .Tables With Tbl For i = 1 To .Rows.Count StrTxt = "" With .Cell(i, 2).Range If .Hyperlinks.Count > 0 Then StrTxt = Get_URL_Data(.Hyperlinks(1).Address) End If End With .Cell(i, 5).Range.Text = StrTxt Next End With Next End With Set Rng = Nothing End Sub Function Get_URL_Data(StrUrl As String) As String 'References to Internet Explorer & Microsoft HTML required Dim Browser As SHDocVw.InternetExplorer Dim HTMLDoc As MSHTML.HTMLDocument Dim StrTmp As String, StrTxt As String Set Browser = New SHDocVw.InternetExplorer 'Open the web page Browser.navigate StrUrl Do While Browser.Busy DoEvents Loop Set HTMLDoc = Browser.Document Do While Browser.Busy DoEvents Loop 'Get the data On Error Resume Next StrTmp = Split(HTMLDoc.Title, " - ")(1) Get_URL_Data = StrTmp 'Close the browser Browser.Quit Set HTMLDoc = Nothing: Set Browser = Nothing Application.ScreenUpdating = True End Function
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#3
|
||||
|
||||
![]()
Thanks Paul. I didn't even know weather this is possible or not. Thanks for help.
|
#4
|
||||
|
||||
![]()
In case I want multiple string in case EP1925611A1 if I also want claim text to be pasted after title which is
<div class="claim-text">A process for producing a compound represented by formula (II): <chemistry id="chem0047" num="0047"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png"> <img id="ib0047" file="imgb0047.tif" wi="40" he="40" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png" class="patent-full-image" width="160" height="160" alt="Figure imgb0047"> </a> </div> <attachments> <attachment idref="chem0047" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0047" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry> (wherein Y represents -COR, wherein R represents a C1-C8 alkoxy group, a C6-C14 aryloxy group, a C2-C8 alkenyloxy group, a C7-C26 aralkyloxy group, or a di(C1-C6 alkyl)amino group; and R<sup>1</sup> represents a C2-C7 alkoxycarbonyl group), which comprises treating a compound represented by formula (I): <chemistry id="chem0048" num="0048"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png"> <img id="ib0048" file="imgb0048.tif" wi="26" he="34" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png" class="patent-full-image" width="104" height="136" alt="Figure imgb0048"> </a> </div> <attachments> <attachment idref="chem0048" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0048" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry> (wherein Y has the same meaning as defined above) in a solvent with aqueous ammonia or a solution of ammonia in C1-C4 alcohol and, subsequently, with a di(C1-C6 alkyl) dicarbonate.</div> how to find this string and get data with images also? How to paste only case without WO Patents? Sorry for extended questions but I don't have basic knowledge of HTML and MACRO. |
#5
|
||||
|
||||
![]()
As I said in my previous reply, I've never done much web data retrieval, so I'm not familiar with how to get all the data you're after. I suggest looking on the web for code samples using the Browser.Document object (which is what the code I posted uses to define the HTMLDoc variable) and related methods, etc.
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#6
|
||||
|
||||
![]()
Cross-posted at: http://answers.microsoft.com/en-us/o...3-5bd69975ea31
For cross-posting etiquette, please read: http://www.excelguru.ca/content.php?184
__________________
Cheers, Paul Edstein [Fmr MS MVP - Word] |
#7
|
||||
|
||||
![]()
Apology For Not mentioning. I Generally do Mention While Cross posting.
This time as I have done it from home, that might be the reason. I Also cross posted this message and did mention: http://stackoverflow.com/questions/3...microsoft-word Did not get answer. I asked it different way still did not get the answer. I get suggestion from Cindy Meister[Microsoft MVP since 1996] which was similar to the suggestion given by You. Quote:
I started learning and will get the answer anyway on my own and will surely post it here. |
#8
|
||||
|
||||
![]()
This is one step towards answering my question.
Code:
Sub USPTOAbstHTML1() Application.ScreenUpdating = False Dim Rng As Range, Tbl As table, StrTxt As String, HttpReq As Object, i As Long, oHtml As MSHTML.HTMLDocument, IE As SHDocVw.InternetExplorer Set HttpReq = CreateObject("Microsoft.XMLHTTP") Set oHtml = New HTMLDocument Set IE = CreateObject("InternetExplorer.Application") With ActiveDocument.Range For Each Tbl In .Tables With Tbl For i = 1 To .Rows.Count With .Cell(i, 2).Range If .Hyperlinks.Count > 0 Then MsgBox .Hyperlinks(1).Address HttpReq.Open "GET", .Hyperlinks(1).Address, False HttpReq.send oHtml.body.innerHTML = HttpReq.responseText MsgBox HttpReq.responseText StrTxt = oHtml.getElementsByClassName("claim").Item.innerHTML With IE .Visible = False .navigate "about:blank" .Document.body.innerHTML = StrTxt .Document.execCommand "SelectAll" .Document.execCommand "Copy" End With With Tbl.Cell(i, 5).Range Selection.PasteAndFormat (wdPasteDefault) End With End If .Collapse wdCollapseEnd .Find.Execute End With Next End With Next End With Set HttpReq = Nothing Application.ScreenUpdating = True End Sub Can we by default allow access to internet explorer so that it doesn't popup everytime? Last edited by PRA007; 12-02-2015 at 09:18 AM. |
#9
|
||||
|
||||
![]()
In above code, one of problem is with Selection.PasteAndFormat (wdPasteDefault).
It doesn't go to specific location but goes to colum which are selected. How to pate range to specific location in word? |
#10
|
||||
|
||||
![]()
this works perfect
Code:
Set rng = Tbl.Cell(i, 5).Range rng.Collapse wdCollapseStart rng.PasteAndFormat wdPasteDefault |
![]() |
Tags |
macro, website, word 2010 |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extract VBA code to save in Word document | Dave T | Word VBA | 4 | 01-26-2015 08:41 PM |
![]() |
Maxwell314 | Excel | 3 | 12-08-2014 06:17 PM |
![]() |
iliauk | Word | 3 | 11-08-2013 04:37 PM |
Is there a way to extract various text in Word? | barnkeeper410 | Word | 4 | 07-08-2013 10:58 PM |
![]() |
donlincolnmsof | Word VBA | 12 | 06-19-2012 05:21 PM |