Mark Gilbert's Blog

Science and technology, served light and fluffy.

Shell-shocked, but in a good way – Backing up blogs with Powershell

One of Scott Hanselman’s more recent blog posts was titled "The Computer Backup Rule of Three”.  I was pleased to see that the backup solution that I have in place at Casa del Gilbert met his “Backup Rule of Three”:  I back my workstations up to my server nightly, which then contacts Carbonite and moves the files to the cloud nightly.  I also have a separate path for my media (copies of purchased digital music, LOTS of photos, etc.) being backed up twice a year to Data DVDs, and stored in my safe deposit box.  Despite all that, one of Scott’s bullets caught my eye.  Under "What should I do?" he had this:

"Don’t trust the cloud. I backup my gmail, too."

Doesn’t Google (or whatever cloud-based application providers you use) have this covered already?  The Scott-approved, paranoid answer would be, "Don’t assume that 1) they are doing that, 2) they know how to do it every time without fail, and 3) they won’t one day shutter their cloud-based doors suddenly, leaving you out in the cold.”  Why so paranoid?  "Simple. Because I care about my work, photos and data and I would be sad if I lost it."  (Hanselman)

That bullet point got me thinking.  I currently do not back up my Gmail, or any of my other email, but honestly would I really be upset if I suddenly lost that?  Probably not.  My contact list is stored with the Google as well.  I can recreate that based on my wife’s contact list, but that would be painful.  I’ll have to start backing that up regularly.

My blogs.

Dang.  I don’t have local copies of any of my blogs.  I could probably recreate all of the photos I’ve cropped and illustrations I’ve done, but that would be REALLY time-consuming.  I have to rectify that.

All of my blogs are hosted with WordPress, so I started poking around with the Export options, to see if I could just get a raw dump of the text and images for every post.  When I log into the Dashboard, and go to Tools/Export, I can generate an XML file that has all of the text of the posts, but they only have links to the images, which are still hosted with WordPress.  If WordPress were to go completely offline, I wouldn’t be able to get them, so my backup solution for this needs to pull those images down out of the cloud.

Ok, so I need something that can parse through the exported XML, pull down all of the images.  Ideally, that “something” could also split the posts up into separate folders, and dump the images for that post into the same folder.  At that point I’d have everything in one place, and could recreate every post from scratch if I had to.

Enter the Powershell.

A little aside

I think it was another Hanselman post years ago where I read that Powershell was a powerful tool for more than just system admins – programmers would be able to reap a lot of the same benefits from it, automating the tedious, error-prone tasks that we face.  Ever since then, I’ve had this on my things to look at, but it never got prioritized high enough to actually look at until a couple of months ago.  That’s when we set up a new TeamCity build server at my company, and we made the decision to use Powershell to script out the deployment step.  That proved to be an immense win – Powershell was cleaner, easier, and far more powerful than NAnt (which we were using with our older CruiseControl.NET build server) for this task.

Then later, I used Powershell to automate a development task that I was getting tired of doing by hand.  Wiring that script into a Slickrun command made for more automated-awesomeness.

At that point, I was a true believer.  So, what tool did I turn to for parsing WordPress XML, automating the process of creating folders for each post, and pulling the images out of the cloud?


After dropping the XML file into a folder of its own, I open a Powershell prompt, switch into the folder with my ParsePosts.ps1 script, and then execute it like so:

.\ParsePosts.ps1 {full_path_to_xml_file}

Let’s take a look at the script, piece by piece.  First, there is a small helper function that takes the blog post’s date and title, and constructs a unique folder name from it:

function CreateFolderName($DatePosted, $BlogTitle)
$FormattedDate = Get-Date $DatePosted -format "yyyy-MM-dd"
$CleanedUpTitle = $BlogTitle -replace ‘\W’, ”
return "$FormattedDate-$CleanedUpTitle"

The date is formatted into yyyy-mm-dd, and then the title, stripped of all non-alphanumeric characters, is appended.  Next, I set a flag that allows me to control how verbose the script messaging is during its run. I can turn this off by setting this variable to $False:

$IsVerbose = $True

After that, I make sure that I got exactly 1 command line parameter, and if I didn’t I write out an error message with the scripts usage:

If ($args.Length -ne 1)
    Write-Error "Usage: ParsePosts.ps1 {full_path_to_file_to_parse}"
    Exit 1

Next, I set two local variables – one with the file to be parsed, and the other with the directory that that file resides in:

$FileToParse = Get-ChildItem $args[0]
If ($IsVerbose) { Write-Host $FileToParse.Name }

$BaseDirectory = $FileToParse.DirectoryName
If ($IsVerbose) { Write-Host $BaseDirectory }

As the comment says, this next block first verifies that the "atom" namespace has been added to the XML document:

# First, make sure the "atom" namespace has been included.  For some reason, the WordPress export utility
# doesn’t do this, and without it the export file is not valid XML.
$RawFileContents = Get-Content $FileToParse.FullName | Out-String
If(-Not $RawFileContents.Contains("xmlns:atom="""""))
    $RawFileContents = $RawFileContents.ToString().Replace("xmlns:wp=""""", "xmlns:wp="""" xmlns:atom=""""")

$XmlContents = [xml]$RawFileContents

Powershell wouldn’t parse the XML document properly until that namespace was present, and for some reason WordPress wasn’t adding it.  So, I open the file raw, and cheat a little – if the file doesn’t contain the "atom" namespace declaration, I look for the "wp" namespace declaration, and replace it with both the "wp" and "atom" declarations.  After that, I open the file as an XML Document.

I found that the blog content was contained in a "content:encoded" tag, so I need to declare a "content" namespace for later use:

$ContentNamespace = New-Object Xml.XmlNamespaceManager $XmlContents.NameTable
$ContentNamespace.AddNamespace("content", "")

Then, I set up some a WebClient object, and a couple of regular expressions that I will need to parse through each blog post:

$ImageRequester = new-object System.Net.WebClient
$ImageRegex = [regex]’src="(?<ImageUrl>.+?)"’
$FileRegex = [regex]’\/(?<FileName>[^\/]*\.png)|\/(?<FileName>[^\/]*\.jpg)’

All of this is basically setup needed for the real meat of the script:

$BlogPostNodeList = $XmlContents.SelectNodes("//channel/item")
Foreach ($CurrentBlogPostNode in $BlogPostNodeList)
    $BlogPostFolderName = CreateFolderName $CurrentBlogPostNode.post_date $CurrentBlogPostNode.Title
    $BlogPostFullFolderPath = Join-Path $BaseDirectory $BlogPostFolderName

    if (-Not (Test-Path $BlogPostFullFolderPath))
        New-Item $BlogPostFullFolderPath -type directory
        If ($IsVerbose) { Write-Host "Created: $BlogPostFullFolderPath" }

    # Write out the blog file contents to the file
    $BlogPostContent = ($CurrentBlogPostNode.selectSingleNode("content:encoded", $ContentNamespace).get_InnerText())
    Set-Content (Join-Path $BlogPostFullFolderPath "Post.txt") $BlogPostContent
    If ($IsVerbose) { Write-Host "Content written to $BlogPostFullFolderPath" }

    # Pull down this blog post’s images, if any
    $ImageRegex.Matches($BlogPostContent) | Foreach {
        $CurrentUrl = [string]$_.Groups["ImageUrl"]

        $LocalFileName = ""
        $FileRegex.Matches($CurrentUrl) | Foreach { $LocalFileName = $_.Groups["FileName"] }

        if($LocalFileName -ne "")
            $LocalImagePath = Join-Path $BlogPostFullFolderPath $LocalFileName

            If ($IsVerbose) { Write-Host "$CurrentUrl downloaded" }

I first find all of the channel/item blocks, each of which represent a single blog post.  The loop then breaks down as follows:

    1. Check to see if the folder for that blog post has already been created, and if not create it.
    2. Break out the blog text itself, stored in the "content:encoded" tag, and save it to a Post.txt file in the blog post’s folder.
    3. Get a list of the src="…" tags found in the blog post.  Those represent the images I’ve embedded in the post.  For each of those images, extract the full URL, use the $ImageRequester WebClient object to pull it down, and save it to the blog post’s folder.

Finally, let the user know the script has finished:

Write-Host "Done"

There is definitely room for improvement here.  One of the things that I’ve found that trips the script up is the occasional blog post with a "post_date" of "0000-00-00 00:00:00".  When that happens, I have to dig up the original blog post, find its publication date, and manually update the XML file to include it.  This is one of many "hardening" steps that could be added to make it more capable of handling sloppy data.  And of course, the script could use another couple of rounds of refactoring as well.

My first official runs of the script seem to work well.  I pulled down all posts for all of my blogs, and even the one with the most posts only took a couple of minutes to parse.  I’ve set a monthly task to pull down that last month’s worth of posts.

Many, many places on the internets contributed to this script, and I would be remiss if I failed to mention them:


Parsing XML:

Selecting XML nodes that include namespaces:

Writing Functions:

Formatting Dates:

Writing to files:

Extracting the images from the content:

Thanks to all!

December 29, 2012 Posted by | Powershell | Comments Off on Shell-shocked, but in a good way – Backing up blogs with Powershell