Thread: [Solved] Extract Webdata from word
View Single Post
 
Old 10-12-2015, 11:31 PM
PRA007's Avatar
PRA007 PRA007 is offline Windows 7 32bit Office 2010 32bit
Competent Performer
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Post

Can I Extract webdata to word based on the pattern in the microsoft word document

For example:

I have doccument like this



I want to find numbers in the documents and extract data from google patents based on pattern

In case of cell having Multiple numbers I want first line to be searched only for pattern
In case of WO numbers I want to extract title from google patents
In case of Other like US, EP and CN I want Only Claims to be extracted from google patents.

there will be some numbers also for which google patent link might not work. I want code to Ignore Them

The pattern for Numbering to google patent link is as follow

I learned from this forum how to convert numbers to link


Code:
Sub AddGPHLink()
Dim oRng As Range
Dim strLink As String
    Set oRng = ActiveDocument.Range
    With oRng.Find
        Do While .Execute(FindText:="([USEPCNAWO]{2}) ([0-9]{4,}) ([A-Z0-9]{1,2})", MatchWildcards:=True)
            strLink = oRng.Text
            strLink = Replace(strLink, Chr(32), "")
            strLink = "https://www.google.co.in/patents/" & strLink & "?cl=en"
            ActiveDocument.Hyperlinks.Add Anchor:=oRng, _
                                          Address:=strLink, _
                                          TextToDisplay:=oRng.Text
            oRng.End = oRng.Fields(1).Result.End
            oRng.Collapse 0
        Loop
    End With
lbl_Exit:
    Set oRng = Nothing
    Exit Sub
End Sub
This can be used for target link

this is the webpage of exemplary googlepatent page

https://www.google.co.in/patents/EP2431370B1?cl=en



I want to extract data from web to word document like this



on webpage there are two fields I want to target


first goes like this


on line 3 it is written like that
<html style="height: 100%;"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents</title><script src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/cb=gapi.loaded_0" async=""></script><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?bnew Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a;window.performance&&(a=window.performance.timing );var f=a?new e(a.responseStart):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart,d=a.responseStart;0<c&&d>=c&&( window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt" ,void 0,c),b.tick("wtsrt_",

I want
"Optically active diamine derivative and process for producing the same"
to extract to word as title

Second is claims
there are two fields in claim

1. claim number
claim nubers are having two types of claim

independent claims having pattern like this
<li class="claim"> <div id="c-en-0001" num="0001" class="claim">

dependent claims having pattern like this
</li> <li class="claim-dependent"> <div id="c-en-0002" num="0002" class="claim">


2. claim text
<div class="claim-text">A process for producing a compound represented by formula (II):
<chemistry id="chem0047" num="0047"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png"> <img id="ib0047" file="imgb0047.tif" wi="40" he="40" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png" class="patent-full-image" width="160" height="160" alt="Figure imgb0047"> </a> </div> <attachments> <attachment idref="chem0047" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0047" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry>
(wherein Y represents -COR, wherein R represents a C1-C8 alkoxy group, a C6-C14 aryloxy group, a C2-C8 alkenyloxy group, a C7-C26 aralkyloxy group, or a di(C1-C6 alkyl)amino group; and R<sup>1</sup> represents a C2-C7 alkoxycarbonyl group), which comprises treating a compound represented by formula (I):
<chemistry id="chem0048" num="0048"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png"> <img id="ib0048" file="imgb0048.tif" wi="26" he="34" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png" class="patent-full-image" width="104" height="136" alt="Figure imgb0048"> </a> </div> <attachments> <attachment idref="chem0048" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0048" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry>
(wherein Y has the same meaning as defined above) in a solvent with aqueous ammonia or a solution of ammonia in C1-C4 alcohol and, subsequently, with a di(C1-C6 alkyl) dicarbonate.</div>

I would like to have two different macro or same macro to choose between what i want in term of claims weather dependent of independent

Claim text is followed by claim number which also governs wether claims is depenedant of indepedent

I want claim numbering to be plain text

Is it possible through word macro?

Word Files can be found from here
Sorry for not attaching them in attachements as it is forbidden on my PC.

https://sites.google.com/site/rahula...edirects=0&d=1

https://sites.google.com/site/rahula...edirects=0&d=1
Reply With Quote