#1
|
|||
|
|||
Converting the headers & footers into body
Hello,
I am converting PDF-s into docx format via Adobe Acrobat in an automated way with Python. I am then reading in the data from the word file to do some data analysis. The conversion is almost perfect, my problem is that some parts get converted into headers and footers (whether they should or should not is not important). Headers and footers are not searchable with the methods I am using. My question is: Is there a way in Word to convert all current headers and footer into body text? Both manual Word settings, VBA codes or any sort of other solutions are appreciated. |
#2
|
|||
|
|||
There is nothing built into Word.
Although simple in concept, headers and footers can be very complex. Each section in Word has three of each, independent of one another which may or may not be activated by section or document settings. A single page can have multiple sections. Here is my Header/Footer Settings Recap. Simple suggestion for manual work. Open the document in Word. Edit the Header/footer. Copy the header/footer. Paste into the body of your document on one page. |
#3
|
||||
|
||||
Acrobat conversions to Word typically put 'header' and 'footer' information into the body of the document anyway in my experience. This is because the structure of an acrobat document doesn't usually contain a physical header or footer - it is just more unstructured content on the page.
Perhaps you could explore the conversion options of your PDF>DOCX conversion to see why content is told to go into a header/footer. I'm also unsure why you think it is a good idea to convert to Word when your data analysis could also be done by processing the PDF file - why change formats when Acrobat is also programmable?
__________________
Andrew Lockton Chrysalis Design, Melbourne Australia |
#4
|
|||
|
|||
I need to collect the main headers from the document, convert it to word and add bookmarks to those headers in an automatic way. For this purpose it's better to do the mining in the word format instead of the pdf, as the data is much more structured and further information regarding the text is available is an easier manner which I need (color, font, size etc). Unfortunately there are no settings in Adobe Acrobat which change the headers. Only some very basic changes are available: Include Comments, include Images, Recognize text where needed and Retain Flowing Text / Retain Page Layout.
|
#5
|
|||
|
|||
Are you opening the pdf's in Word? Or, are you using some other conversion method.
|
#6
|
|||
|
|||
I am converting pdf-s into docx files with Adobe acrobat.
|
#7
|
|||
|
|||
Try opening directly in Word and using Word to convert one.
|
#8
|
||||
|
||||
I don't know how to convert VBA code to Python but this should give you an idea on how you could copy the headers and footers into the body of the document.
Code:
Sub HFExtractor() Dim aSect As Section, aHF As HeaderFooter, aRng As Range For Each aSect In ActiveDocument.Sections For Each aHF In aSect.Headers Set aRng = aSect.Range aRng.Collapse Direction:=wdCollapseEnd aRng.FormattedText = aHF.Range.FormattedText Next aHF For Each aHF In aSect.Footers Set aRng = aSect.Range aRng.Collapse Direction:=wdCollapseEnd aRng.FormattedText = aHF.Range.FormattedText Next aHF Next aSect End Sub
__________________
Andrew Lockton Chrysalis Design, Melbourne Australia |
#9
|
|||
|
|||
Quote:
Quote:
Might be a stupid idea as I am unfamiliar with how Word works, but can we for example get the x,y coordinates of these paragraphs and insert text "above them". Or something similar so we can keep the position as closely as possible? |
#10
|
||||
|
||||
The way Word works does not lend itself to extracting a header to place on the page content. Word is made up of containers for content and in a well designed Word document, the SAME header can appear on hundreds of pages with only minor differences (such as page number incrementing or StyleRef fields updating). I can't understand what your requirement might be to make you think you need to extract and repeat this information repetitively.
The 'page' in Word is not a fixed structure. The formatting and content in the body of the document when constrained by the page size and setup determines what appears on each page. Adding content earlier, pushes subsequent content down the page and potentially onto the next page. The headers/footers are associated with a section which might contain a paragraph, lots of paragraphs or the entire document - but they don't directly relate to 'pages' at all. Perhaps you need to post a sample document and describe why you think this is necessary. If we could understand what it is you need to achieve, I would think we could suggest a better way of achieving it. I'm still thinking that Word is not the best tool for this. Its concept of flowing content doesn't fit with the static pages that headers align with. I just watched a video on using LibreOffice Draw to open and edit a PDF. I think this might be a better fit for your requirements as it appears to be scriptable with Python and should retain the page content more accurately.
__________________
Andrew Lockton Chrysalis Design, Melbourne Australia Last edited by Guessed; 06-09-2021 at 04:42 AM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
headers/footers | scot | Word | 3 | 05-22-2015 09:45 AM |
Headers and Footers | Kingsmoss | Word | 3 | 04-28-2014 02:43 PM |
Odd and Even Headers/Footers | sarineochaos | Word | 1 | 02-04-2014 06:15 PM |
Headers and Footers | teza2k06 | Word | 1 | 05-14-2013 11:07 AM |
Headers and Footers | OverAchiever13 | Word | 1 | 05-27-2010 01:30 PM |