Saturday, January 12, 2008

PowerShell Script to Get Text From and OpenDocument ODT File

I've been putting together some general help snippets for a collection of PowerShell scripts that I've been writing. I wanted the information to be available from the command line with "command -h" and I wanted the same information available in a standard document - Word, OpenDocument (ODT) , or PDF. I've messed around a little bit with ODT files and thought that might be the way to go. ODT files are essentially zip files with xml files (and pictures) inside defining the contents of the document. So I thought why not see If I can define the help information in and ODT document and have a script that will actually get the info from the document. Seems a bit complicated at first glance, but it's really not bad at all.

So In this post I'll just show how to get the contents of an odt file into PowerShell...

Step One - Get the contents of the "contents.xml" file
The contents.xml file has all of the text for the document it's in the root of the odt archive. I chose to use the 7-zip program ( www.7-zip.org) to extract the file. This is done like so:

#get the contents of the odt file
$res = ."c:\program files\7-zip\7z.exe" e $ODTfile content.xml #extracts only the content.xml from the archive to the current directory
$content = Get-Content content.xml
remove-item content.xml
#modified content
$mc = concat $content " "

The above snippet extracts the contents.xml file and loads it's contents into the variable $contents. I then use a concatenation script I wrote to concatenate all of the lines together into a single string. This will make the searching we need to do a little bit easier and cleaner.

Step Two - Define some regular expressions so we can identify xml tags
We now have a whole lot of xml in $mc and want to process (I use that term loosely) it a little bit. There are only a couple of elements that we really are interested in to get some base functionality. So let's define our regular expressions...

#regular expressions for identifying relevant xml tabs
$rpar = New-Object -typename System.Text.RegularExpressions.Regex("<text:[p|h][^<>]*>") #a pagraph or header line
$rtab = New-Object -typename System.Text.RegularExpressions.Regex("<text:tab[^<>]*>") #a tab character
$rtag = New-Object -typename System.Text.RegularExpressions.Regex("<[^<>]+>") #any other xml tag
$rspace = New-Object -typename System.Text.RegularExpressions.Regex("<text:s text:c[^<>]*>") #a number of spaces in a row
$rint = New-Object -typename System.Text.RegularExpressions.Regex("\d+") #an integer

Process the tags
#process paragraphs
$rpar.matches($mc) | foreach{$mc = $mc.replace($_.value,"`r`n")}
#process tabs
$rtab.matches($mc) | foreach{$mc = $mc.replace($_.value,"`t")}

Spaces are a little trickier to handle. Multiple spaces in a row are handled with a tag that looks like <text:s text:c="4">. So we need to search for the tags, find out how many spaces are in each instance, and then create a string with that many spaces. Then we need to replace the xml tags with the strings of spaces...

#process spaces
$spaceCount = New-Object System.Collections.ArrayList
$spaces = New-Object System.Collections.ArrayList
#match the xml for the space tags
$m_spaces = $rspace.matches($mc)

if ($m_spaces.Count -gt 0) {
#get the number of spaces for each match
$m_spaces | foreach{
$result = $spaceCount.add(($rint.match($_.value)).value)
}
#create strings with the correct number of spaces
for ($i = 0;$i -lt $m_spaces.Count;$i++) {
$result = $spaces.add(("").padleft([int]$spaceCount[$i]))
}
#replace the xml space tag with the string of spaces
for ($i = 0;$i -lt $m_spaces.Count;$i++) {
$mc = $mc.Replace($m_spaces[$i].value,$spaces[$i])
}
}

Clean up a little more and return the modified string

#strip remaining xml tags
$rtag.Matches($mc) | foreach{$mc = $mc.replace($_.value,"")}

#clean up other characters
$mc = $mc.Replace("&gt;",">")
$mc = $mc.Replace("&lt;","<")
$mc = $mc.Replace("&apos;","'")

return $mc

Left to do
Alot. Some things that would be nice to add...
  • Ability to handle numbered and bulleted lists - currently you get the text next to the number or bullet, but not the number or bullet
  • Tables
  • Make headings a different color?
  • A write-OdtText script would be nice, and an interesting little challenge
If you have any commments and suggestions or improvements please let me know. I'm still a relative novice with regular expressions. I was amazed at how little code it took to do this.

-bc

No comments: