Microsoft Office Forums

Go Back   Microsoft Office Forums > >

Reply
 
Thread Tools Display Modes
  #1  
Old 10-12-2015, 11:31 PM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 32bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Post

Can I Extract webdata to word based on the pattern in the microsoft word document

For example:

I have doccument like this



I want to find numbers in the documents and extract data from google patents based on pattern

In case of cell having Multiple numbers I want first line to be searched only for pattern
In case of WO numbers I want to extract title from google patents
In case of Other like US, EP and CN I want Only Claims to be extracted from google patents.

there will be some numbers also for which google patent link might not work. I want code to Ignore Them

The pattern for Numbering to google patent link is as follow

I learned from this forum how to convert numbers to link


Code:
Sub AddGPHLink()
Dim oRng As Range
Dim strLink As String
    Set oRng = ActiveDocument.Range
    With oRng.Find
        Do While .Execute(FindText:="([USEPCNAWO]{2}) ([0-9]{4,}) ([A-Z0-9]{1,2})", MatchWildcards:=True)
            strLink = oRng.Text
            strLink = Replace(strLink, Chr(32), "")
            strLink = "https://www.google.co.in/patents/" & strLink & "?cl=en"
            ActiveDocument.Hyperlinks.Add Anchor:=oRng, _
                                          Address:=strLink, _
                                          TextToDisplay:=oRng.Text
            oRng.End = oRng.Fields(1).Result.End
            oRng.Collapse 0
        Loop
    End With
lbl_Exit:
    Set oRng = Nothing
    Exit Sub
End Sub
This can be used for target link

this is the webpage of exemplary googlepatent page

https://www.google.co.in/patents/EP2431370B1?cl=en





I want to extract data from web to word document like this



on webpage there are two fields I want to target


first goes like this


on line 3 it is written like that
<html style="height: 100%;"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents</title><script src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/cb=gapi.loaded_0" async=""></script><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?bnew Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a;window.performance&&(a=window.performance.timing );var f=a?new e(a.responseStart):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart,d=a.responseStart;0<c&&d>=c&&( window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt" ,void 0,c),b.tick("wtsrt_",

I want
"Optically active diamine derivative and process for producing the same"
to extract to word as title

Second is claims
there are two fields in claim

1. claim number
claim nubers are having two types of claim

independent claims having pattern like this
<li class="claim"> <div id="c-en-0001" num="0001" class="claim">

dependent claims having pattern like this
</li> <li class="claim-dependent"> <div id="c-en-0002" num="0002" class="claim">


2. claim text
<div class="claim-text">A process for producing a compound represented by formula (II):
<chemistry id="chem0047" num="0047"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png"> <img id="ib0047" file="imgb0047.tif" wi="40" he="40" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png" class="patent-full-image" width="160" height="160" alt="Figure imgb0047"> </a> </div> <attachments> <attachment idref="chem0047" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0047" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry>
(wherein Y represents -COR, wherein R represents a C1-C8 alkoxy group, a C6-C14 aryloxy group, a C2-C8 alkenyloxy group, a C7-C26 aralkyloxy group, or a di(C1-C6 alkyl)amino group; and R<sup>1</sup> represents a C2-C7 alkoxycarbonyl group), which comprises treating a compound represented by formula (I):
<chemistry id="chem0048" num="0048"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png"> <img id="ib0048" file="imgb0048.tif" wi="26" he="34" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png" class="patent-full-image" width="104" height="136" alt="Figure imgb0048"> </a> </div> <attachments> <attachment idref="chem0048" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0048" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry>
(wherein Y has the same meaning as defined above) in a solvent with aqueous ammonia or a solution of ammonia in C1-C4 alcohol and, subsequently, with a di(C1-C6 alkyl) dicarbonate.</div>

I would like to have two different macro or same macro to choose between what i want in term of claims weather dependent of independent

Claim text is followed by claim number which also governs wether claims is depenedant of indepedent

I want claim numbering to be plain text

Is it possible through word macro?

Word Files can be found from here
Sorry for not attaching them in attachements as it is forbidden on my PC.

https://sites.google.com/site/rahula...edirects=0&d=1

https://sites.google.com/site/rahula...edirects=0&d=1
Reply With Quote
  #2  
Old 10-15-2015, 12:55 AM
macropod's Avatar
macropod macropod is offline Extract Webdata from word Windows 7 64bit Extract Webdata from word Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,384
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

I've never done much web data retrieval, but here's some code to get you started:
Code:
Sub AddGPHLink()
Dim Rng As Range, StrTxt As String, Tbl As Table, i As Long
Const StrLnk As String = "https://www.google.co.in/patents/"
Dim strLink As String
With ActiveDocument
  With .Range
    With .Find
      .ClearFormatting
      .Replacement.ClearFormatting
      .Format = False
      .Forward = True
      .Wrap = wdFindStop
      .MatchWildcards = True
      .Text = "[UECW][SPNO] [0-9]{4,} [A-Z0-9]{1,2}"
      .Replacement.Text = ""
      .Execute
    End With
    Do While .Find.Found
      Set Rng = .Duplicate
      With Rng
        StrTxt = .Text
        .Delete
      End With
      .Hyperlinks.Add Anchor:=Rng, TextToDisplay:=StrTxt, _
        Address:=StrLnk & Replace(StrTxt, " ", "") & "?cl=en"
      .Collapse wdCollapseEnd
      .Find.Execute
    Loop
    .Start = ActiveDocument.Range.Start
    With .Find
      .Text = "IN [0-9A-Z]{7,}"
      .Execute
    End With
    Do While .Find.Found
      Set Rng = .Duplicate
      With Rng
        StrTxt = .Text
        .Delete
      End With
      .Hyperlinks.Add Anchor:=Rng, TextToDisplay:=StrTxt, _
        Address:=StrLnk & Replace(StrTxt, " ", "") & "?cl=en"
      .Collapse wdCollapseEnd
      .Find.Execute
    Loop
  End With
  For Each Tbl In .Tables
    With Tbl
      For i = 1 To .Rows.Count
        StrTxt = ""
        With .Cell(i, 2).Range
          If .Hyperlinks.Count > 0 Then
            StrTxt = Get_URL_Data(.Hyperlinks(1).Address)
          End If
        End With
        .Cell(i, 5).Range.Text = StrTxt
      Next
    End With
  Next
End With
Set Rng = Nothing
End Sub
 
Function Get_URL_Data(StrUrl As String) As String
'References to Internet Explorer & Microsoft HTML required
Dim Browser As SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim StrTmp As String, StrTxt As String
Set Browser = New SHDocVw.InternetExplorer
'Open the web page
Browser.navigate StrUrl
Do While Browser.Busy
  DoEvents
Loop
Set HTMLDoc = Browser.Document
Do While Browser.Busy
  DoEvents
Loop
'Get the data
On Error Resume Next
StrTmp = Split(HTMLDoc.Title, " - ")(1)
Get_URL_Data = StrTmp
'Close the browser
Browser.Quit
Set HTMLDoc = Nothing: Set Browser = Nothing
Application.ScreenUpdating = True
End Function
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #3  
Old 10-15-2015, 04:11 AM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 32bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Default Thanks Paul

Thanks Paul. I didn't even know weather this is possible or not. Thanks for help.
Reply With Quote
  #4  
Old 10-15-2015, 05:34 AM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 32bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Default In case of text with embeded images

In case I want multiple string in case EP1925611A1 if I also want claim text to be pasted after title which is

<div class="claim-text">A process for producing a compound represented by formula (II):
<chemistry id="chem0047" num="0047"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png"> <img id="ib0047" file="imgb0047.tif" wi="40" he="40" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0047.png" class="patent-full-image" width="160" height="160" alt="Figure imgb0047"> </a> </div> <attachments> <attachment idref="chem0047" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0047" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry>
(wherein Y represents -COR, wherein R represents a C1-C8 alkoxy group, a C6-C14 aryloxy group, a C2-C8 alkenyloxy group, a C7-C26 aralkyloxy group, or a di(C1-C6 alkyl)amino group; and R<sup>1</sup> represents a C2-C7 alkoxycarbonyl group), which comprises treating a compound represented by formula (I):
<chemistry id="chem0048" num="0048"> <div class="patent-image"> <a href="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png"> <img id="ib0048" file="imgb0048.tif" wi="26" he="34" img-content="chem" img-format="tif" src="./Patent EP1925611A1 - Optically active diamine derivative and process for producing the same - Google Patents_files/imgb0048.png" class="patent-full-image" width="104" height="136" alt="Figure imgb0048"> </a> </div> <attachments> <attachment idref="chem0048" attachment-type="cdx" file="CDX"> </attachment> <attachment idref="chem0048" attachment-type="mol" file="MOL"> </attachment> </attachments> </chemistry>
(wherein Y has the same meaning as defined above) in a solvent with aqueous ammonia or a solution of ammonia in C1-C4 alcohol and, subsequently, with a di(C1-C6 alkyl) dicarbonate.</div>



how to find this string and get data with images also?
How to paste only case without WO Patents?

Sorry for extended questions but I don't have basic knowledge of HTML and MACRO.
Reply With Quote
  #5  
Old 10-15-2015, 05:58 AM
macropod's Avatar
macropod macropod is offline Extract Webdata from word Windows 7 64bit Extract Webdata from word Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,384
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

As I said in my previous reply, I've never done much web data retrieval, so I'm not familiar with how to get all the data you're after. I suggest looking on the web for code samples using the Browser.Document object (which is what the code I posted uses to define the HTMLDoc variable) and related methods, etc.
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #6  
Old 11-01-2015, 03:26 PM
macropod's Avatar
macropod macropod is offline Extract Webdata from word Windows 7 64bit Extract Webdata from word Office 2010 32bit
Administrator
 
Join Date: Dec 2010
Location: Canberra, Australia
Posts: 22,384
macropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond reputemacropod has a reputation beyond repute
Default

Cross-posted at: http://answers.microsoft.com/en-us/o...3-5bd69975ea31
For cross-posting etiquette, please read: http://www.excelguru.ca/content.php?184
__________________
Cheers,
Paul Edstein
[Fmr MS MVP - Word]
Reply With Quote
  #7  
Old 11-01-2015, 09:57 PM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 32bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Default

Apology For Not mentioning. I Generally do Mention While Cross posting.
This time as I have done it from home, that might be the reason.

I Also cross posted this message and did mention:

http://stackoverflow.com/questions/3...microsoft-word

Did not get answer. I asked it different way still did not get the answer.

I get suggestion from Cindy Meister[Microsoft MVP since 1996] which was similar to the suggestion given by You.
Quote:
This is the third time you've posted this question. It's probably not getting any answers because the core of your question has nothing to do with VBA. Word VBA doesn't know how to parse web pages - it's specialized for working with the Word application. In one of your previous attempts you had some code from a library that works with the HTML DOM. You need to pursue that route - figure out how to extract the data you need using that library. Once you get that, writing it to Word won't be a problem. Target tags for that technology, not Word, and stop spamming the word tags.
I will stop posting this question anywehre now.
I started learning and will get the answer anyway on my own and will surely post it here.
Reply With Quote
  #8  
Old 12-02-2015, 05:30 AM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 64bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Default

This is one step towards answering my question.

Code:
Sub USPTOAbstHTML1()
Application.ScreenUpdating = False
Dim Rng As Range, Tbl As table, StrTxt As String, HttpReq As Object, i As Long, oHtml As MSHTML.HTMLDocument, IE As SHDocVw.InternetExplorer
Set HttpReq = CreateObject("Microsoft.XMLHTTP")
Set oHtml = New HTMLDocument
Set IE = CreateObject("InternetExplorer.Application")
With ActiveDocument.Range
    For Each Tbl In .Tables
        With Tbl
            For i = 1 To .Rows.Count
                With .Cell(i, 2).Range
                            If .Hyperlinks.Count > 0 Then
                            MsgBox .Hyperlinks(1).Address
                                HttpReq.Open "GET", .Hyperlinks(1).Address, False
                                HttpReq.send
                                oHtml.body.innerHTML = HttpReq.responseText
                                MsgBox HttpReq.responseText
                                StrTxt = oHtml.getElementsByClassName("claim").Item.innerHTML
                                With IE
                                    .Visible = False
                                    .navigate "about:blank"
                                    .Document.body.innerHTML = StrTxt
                                    .Document.execCommand "SelectAll"
                                    .Document.execCommand "Copy"
                                End With
                                With Tbl.Cell(i, 5).Range
                                   Selection.PasteAndFormat (wdPasteDefault)
                                End With
                            End If
                            .Collapse wdCollapseEnd
                            .Find.Execute
                End With
            Next
        End With
    Next
End With
Set HttpReq = Nothing
Application.ScreenUpdating = True
End Sub
The Problem with above code is it asks for permission for allowing ie to navigate the page.
Can we by default allow access to internet explorer so that it doesn't popup everytime?

Last edited by PRA007; 12-02-2015 at 09:18 AM.
Reply With Quote
  #9  
Old 12-02-2015, 09:30 AM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 64bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Default

In above code, one of problem is with Selection.PasteAndFormat (wdPasteDefault).

It doesn't go to specific location but goes to colum which are selected.

How to pate range to specific location in word?
Reply With Quote
  #10  
Old 12-02-2015, 10:49 PM
PRA007's Avatar
PRA007 PRA007 is offline Extract Webdata from word Windows 7 64bit Extract Webdata from word Office 2010 32bit
Competent Performer
Extract Webdata from word
 
Join Date: Dec 2014
Location: Ahmedabad, Gujrat, India
Posts: 145
PRA007 is on a distinguished road
Default

this works perfect
Code:
Set rng = Tbl.Cell(i, 5).Range
rng.Collapse wdCollapseStart
rng.PasteAndFormat wdPasteDefault
As this has solved the question, can be marked as solved.
Reply With Quote
Reply

Tags
macro, website, word 2010



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract VBA code to save in Word document Dave T Word VBA 4 01-26-2015 08:41 PM
Extract Webdata from word Need to extract two word domains from a list (BULK) Maxwell314 Excel 3 12-08-2014 06:17 PM
Extract Webdata from word How to Extract key data from word iliauk Word 3 11-08-2013 04:37 PM
Is there a way to extract various text in Word? barnkeeper410 Word 4 07-08-2013 10:58 PM
Extract Webdata from word Extract phone number from word file donlincolnmsof Word VBA 12 06-19-2012 05:21 PM

Other Forums: Access Forums

All times are GMT -7. The time now is 01:18 PM.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2025, vBulletin Solutions Inc.
Search Engine Optimisation provided by DragonByte SEO (Lite) - vBulletin Mods & Addons Copyright © 2025 DragonByte Technologies Ltd.
MSOfficeForums.com is not affiliated with Microsoft