It’s January, time to bundle up! TeamCity, PowerShell, and SharpZipLib

Late last year we deployed a new TeamCity build server at Biggs|Gilmore, and part of that process involves building a ZIP file that contains the files and folders to be deployed in that release.  We decided to use PowerShell to script out this piece of the process, and it has worked out tremendously well for us.

When it came time to actually create the ZIP file, we found that the PowerShell Community Extensions had a cmdlet called "Write-Zip", as well as the companion cmdlet "Expand-Archive".  We got that installed, and found its usage to be very easy – just pipe in an arbitrary list of files to be ZIPed up.  Things seemed to be going well.

However, shortly after we started using the new TeamCity server for production releases, we noticed that about half the time we did a build, a small number of files (usually 1-3) would just not get deployed.  Those files were being included in the build server’s ZIPs, but they simply weren’t being extracted.  When I tried to manually extract the contents to my desktop using 7Zip, it would report that the problem files were actually corrupt (usually with a cyclical redundency check error).  The source files in Subversion were fine, but once they were run through the build server, they became corrupt.  Initially I thought it may have been caused by the overall size of the ZIP files being built, or the folder depth that the files sat in, but there were other ZIPs that were larger or had a deeper folder structure, and were just fine.

I found precious little in the way of error reports about corrupt files coming out of Write-Zip.  Last week I decided to drop Write-Zip, and find a different way to build ZIPs.

I started by going back to our older build server – CruiseControl.NET.  We were using NAnt there to script out the package-and-deployment step, specifically the <zip> task.  Under the covers, I found that <zip> uses SharpZipLib, a compression library written in C#.  I looked into how I could use that library within PowerShell to create the archives.

I came across a blog post by John Marquez titled "PowerShell SharpZipLib Script".  In it, John shows how to wrap SharpZipLib in a custom function that can take as input a list of files to be compressed.  This seemed perfect.  When I tried it locally, though, I was plagued with errors like these:

Exception calling "CommitUpdate" with "0" argument(s): "Access to the path ‘C:\Users\mgilbert\Desktop\MyProject\bin\en-CA’ is denied."

The errors were coming from the .commitUpdate() method on the ICSharpCode.SharpZipLib.Zip.ZipFile object.  John said his example was built using SharpZipLib 0.85.5.  I had downloaded a slightly updated version of the library (v0.86.0), so I tried grabbing the exact version he had used.  That one didn’t build an archive at all – no errors, but no ZIP files either.

I opened SharpZipLib up in Telerik’s JustDecompile, and found that there were a few other ways to create ZIP files, but none that looked like they could update an existing archive with new files – something that I would need to do if I wanted to pipe in files one by one.  My other option (which is basically how we were using NAnt’s <zip> task) was to bundle up an entire folder and its files and subfolders.  Since most of the deployments I do are incremental builds, this would mean I would would shuffle off all of the files I needed to include into a separate, "corral", folder, bundle that up, and then clear it out (readying it for the next build).  Not a horrible solution, especially if it worked.

So, I reworked Marquez’s "zip" function to take an arbitrary list of files, copy them to a separate folder, bundle that folder up into a ZIP, and then clear the corral out again.  The actual compression of the corral folder’s contents would be done using the FastZip class in the SharpZipLib library.  Here is what I ended up with:

    [System.Reflection.Assembly]::LoadWithPartialName("ICSharpCode.SharpZipLib")

    function BiggsZip($zipFile, $CorralFolder, $LeadingFolderPathToRemove)
    {
        begin
        {
            if (Test-Path $CorralFolder) {
                Remove-Item $CorralFolder -Recurse -Force
            }
        }
        process
        {       
            $RelativeFile = $_.FullName.Remove(0, $LeadingFolderPathToRemove.length)
            if(-not $_.PSIsContainer) {
                New-Item -ItemType File -Path (Join-Path $CorralFolder $RelativeFile) -Force
            }
            Copy-Item $_.FullName (Join-Path $CorralFolder $RelativeFile) -Force
        }
        end
        {
            $zip = New-Object ICSharpCode.SharpZipLib.Zip.FastZip
            $zip.CreateZip($zipFile, $CorralFolder, $true,"")
            Remove-Item $CorralFolder -Recurse -Force
        }
    }

This is designed to have a list of files piped in using something like Get-ChildItem.  For example:

Get-ChildItem "C:\MyProject\trunk\wwwroot" -recurse | BiggsZip "MyArchive.zip" "c:\Users\mgilbert\Desktop\Corral" "C:\MyProject\trunk\wwwroot"

For this function (which I eventually installed as a PowerShell module), the SharpZipLib library was registered in the GAC on the TeamCity server.  The function has three components, denoted by the keywords "begin", "process", and "end".  I’ll describe these in turn.

 

begin
This is executed once at the beginning, before it starts processing the file list.  In this case, if it finds that $CorralFolder currently exists, it deletes it.

process
This is executed once for each file in the set that is piped into the function.  The archive should preserve the files’ paths relative to the web root.  The web root is passed into the function as the $LeadingFolderPathToRemove variable.  So, the "process" step copies the current file from its present location to the same relative path below CorralFolder.

The "New-Item" command is designed to get around a rather annoying issue with Copy-Item.  I found that Copy-Item by itself will happily copy an entire folder with structure below it to a new destination, and creates that folder and structure if it doesn’t exist.  It is also fine copying individual files to a folder that already exists.  However it will not copy a file into a folder that doesn’t already exist, even if I use the -Force switch.

I found this post on StackOverflow that confirmed this limitation of Copy-Item, and two possible ways to get around it.  I opted for option 2, which was to "touch" the destination file using New-Item (in the Unix sense of the word “touch”).  New-Item does not have the same limitation that Copy-Item does, so it is perfectly content to create a new, 0-byte file, and will create the folder structure above it if it doesn’t exist.  After that, I can just copy the real file over top of that one.

The check that surrounds New-Item "if(-not $_.PSIsContainer)" translates to "if the current item is not a folder".  I found that trying to "touch" a folder results in an error when it went to try to do the copy:

Copy-Item : Container cannot be copied onto existing leaf item.

I realized later that what was probably happening was when New-Item came across a directory, it would create it as an extension file – that is, a “leaf item”.  Then, when Copy-Item tried to copy the directory over it, it fails.  Since I could already use Copy-Item to create new folders, I limit the use of New-Item to just creating files.

end
This is executed once at the end, after the function has processed the entire file list.  At this point, all of the files to be included in the release are now in the CorralFolder, so the only thing left to do is create the actual ZIP file.  It does this, and then removes the CorralFolder.

 

That builds the ZIP file beautifully.  I then used John’s "unzip" function as is (other than renaming it) to extract it:

    function BiggsUnzip($zipFile, $unzipToFolder)
    {
      $zip = new-object icSharpCode.sharpZipLib.zip.fastZip
      $zip.extractZip($zipFile, $unzipToFolder, $null)
    }

I wrapped these two functions up into a BiggsZip.psm1 module file, making both functions publically accessible (note the "Export-ModuleMember" call at the very end):

    [System.Reflection.Assembly]::LoadWithPartialName("ICSharpCode.SharpZipLib")

    function BiggsZip($zipFile, $CorralFolder, $LeadingFolderPathToRemove)
    {
    …
    }

    function BiggsUnzip($zipFile, $unzipToFolder)
    {
      …
    }

    Export-ModuleMember -Function *

I then dropped it into a new BiggsZip folder under %windir%\System32\WindowsPowerShell\v1.0\Modules on the TeamCity server.  When I started a new PowerShell session, all I needed to use this library was to do this first:

Import-Module BiggsZip

And then use the BiggsZip and BiggsUnzip functions like I was using Write-Zip and Expand-Archive previously.  For the process of creating the PowerShell module file, and getting it installed on the server, I consulted the following:

The new solution has only been in place for a few days, but every one of my test archives have come through with flying colors – not a single corrupted file.  This took me three days of off-and-on troubleshooting and tinkering, but I think the result (trustworthy builds, and furthering experience with PowerShell) was absolutely worth it.

Advertisement

How I got my groove back – Music Files, Playlists, and the Sansa Clip

Before a couple of months ago, I had only really been using my MP3 player, a Sansa Clip, to listen to music while I was at work, but then I started finding other uses for it.  For example, I can connect it as an input to my guitar amp, and then play along with whatever song I cue up. I also found myself plugging it in at home, finding it far easier to use than Windows Media Player (WMP).

WMP works fine for playing music, but managing my collection is another matter.  I’d drop a new MP3 into a folder, and then fight for 15 minutes with WMP to get it to actually recognize it.  Sometimes it would appear under "Songs" but not "Albums".  Sometimes I’d drag it into a playlist, only to have it get duplicated.  Sometimes the file wouldn’t sync to my player at all: no errors, but no transferring bits either.  These are probably just cases of me just not doing it the "WMP-Way", but whatever that is is not intuitive.

The more I thought about it, the more I realized that the three most common things I was still using WMP for were:

  1. Ripping CDs and syncing music to the player.
  2. Syncing music to the Sansa Clip.
  3. Burning podcasts onto CD so I can listen to them in my car.

I haven’t ripped a CD in months because I’ve been buying all my recent music online.  Burning podcasts onto CD is actually very painless in WMP, so I will probably continue using it for that.

But syncing?  Could I manage the music on the player directly?  Plugging the player into a USB port registers it as another storage device, available in Windows Explorer.  Could I just drag music onto it?  The short answer is "yes", but to really make this useful, I’d need to do a few more things:

  1. Reorganize the media files to clean up where Windows Media Player originally dropped them.
  2. Edit the media tags on the files so that Artist and Song Titles are accurate and simple.
  3. Maximize the number of songs I could fit onto the player by converting everything to MP3 format.
  4. Organize them into separate playlists to accommodate my current given mood.
  5. Sit back and enjoy the sweet sounds of victory.

Reorganize the media files

Most of my digital collection was actually ripped from my CD collection using Windows Media Player, which organizes it into a folder structure that looks like this:

Folder Structure

I’m really only interested in Artist and Song Title.  If I’m in the mood for John Williams, for example, I want to hear all of his work – I don’t care if it came from the "The Spielberg/Williams Collaboration", "Harry Potter and The Sorcerer’s Stone: Soundtrack", or one of the Star Wars albums I own.  I just want to hear the music of John Williams.  So, I decided to flatten the music by removing the Album level:

Folder Structure Flattened

Next, the track numbers that prefaced the song titles were making me twitch, so I removed them:

Folder Structure No Track Numbers

The next step was to resolve all of the "Unknown Artist", "Various Artists", and other folders that had been created over time, and move those music files into folders with a real artist name.  Some of these became obvious just from the name of the song – "Takin’ Care of Business" by Bachman-Turner Overdrive, for example.  Some of these, especially the classical pieces like "Violin Concerto No. 1", took a little more work to track down.  A lot of these required me to look at the media tags attached to the file, which we’ll address next.

Edit the media tags

Each audio file has a series of tags such as Artist, Album, Song Title, Track #, etc.  I originally used these to help reorganize the music into their proper artist folders, but many of these needed to be cleaned up themselves.  Why?  Because my Sansa Clip would use organize the music by these tags.  Putting the files in a folder in Windows Explorer called "Hans Zimmer" wouldn’t be enough – the song’s Artist media tag would need to reflect that name.

Originally I thought I needed an application to allow me modify these, but I discovered that Windows Explorer can do it.  When you select a music file in Windows Explorer, the window shows a series of controls at the bottom:

Media Tag Controls

All you have to do to change these is click the tag you want to edit, type over it, and hit Enter:

Media Tags - Editing

So, my first task was going through and cleaning up the "Contributing Artists", "Album artist", and "Title" for each of my music files.  After updating a few, I realized how tedious this was going to.  I don’t have an enormous digital music collection, but it’s large enough that I figured I could write something to automate the process faster than just doing it manually.

So I did.

I had already organized each music file into a folder named after the artist responsible, and had renamed the files themselves to clean up the song title (Several songs were named things like "Satisfied* [bonus tracks].mp3", so I cleaned it up to just be "Satisfied.mp3").  What if I could write a Powershell script (my shiny new tool in my development toolbox) to rework the media tags for each file based on this information?

After consulting my good friend, Google, I found people here and here were already managing media tags from Powershell.  Using TagLib# (available from GitHub: https://github.com/mono/taglib-sharp ), it was very easy to walk through my entire music collection, updating media tags as I went:

[Reflection.Assembly]::LoadFrom( (Resolve-Path ".\taglib-sharp.dll") )

$BaseMusicPath = "C:\Users\Mark\Desktop\Music"

Get-ChildItem -Path $BaseMusicPath -Filter "*.mp3" -Recurse | ForEach-Object {
    Write-Host "Processing:" $_.FullName
    $CurrentMediaFile = [TagLib.File]::Create($_.FullName)
   
    # Set the song title to the file name
    $CurrentMediaFile.Tag.Title = [IO.FileInfo]$_.Name
   
    # Make the AlbumArtists match the Artists (contributing artists)
    $CurrentMediaFile.Tag.AlbumArtists = $CurrentMediaFile.Tag.Artists
   
    # Save the new album name into the file
    $CurrentMediaFile.Save()
}

The script looks through my music folders recursively for every MP3, opens it, sets the "Title" media tag to the file name and the "AlbumArtists" media tag to the "Artists" tag.  The latter corresponds to the "Contributing Artists" tag that appears in Windows Explorer.

The script worked like a charm.  It ran through my entire collection in a matter of seconds, and took me about half an hour to piece it together.  Overall, I estimate it saved me at least an hour of drudgery, and gave me a great excuse to do something in Powershell.

 

Maximize the number of songs

I still had a mix of WMA and MP3 files at this point.  In the course of updating the media tags, I noticed there was a pretty large gap between the average file size of a WMA file and the average file size of an MP3 – WMAs were much larger than the MP3s.  I found a free converter from KoyoteSoft that could process my entire music collection in batch – converting all WMA files to MP3 in place.  I didn’t think to capture before and after totals, but the size savings was tremendous: 30% smaller files were very common.

I actually put the media tag editing on pause to convert everything over to MP3s.  That is why the Powershell script above only handles MP3s.  By the time I got around to writing it, EVERYTHING was an MP3.

 

Organize them into Playlists

The next, and what ended up being the biggest challenge, was figuring out how I could create my own playlists.  To be fair, I had not tried this with the Sansa Clip before.  What got me thinking about it was that there was a "Playlists" option on the Clip, hinting that it was supported and that I only had to figure out how to do it.

My good friend, Google, turned out to be a good start down this path.  I found this post on the Sansa Clip forums that pointed to a couple of possible paths:

  1. If I browsed to the folder on the Clip in Windows Explorer, and right clicked on a folder or music file, I had an option for "Create Playlist".  I tried selecting multiple folders and created a playlist from them.  That dropped a .PLA file in the folder, and the player seemed to like it.  The weird thing was that this file was 0 bytes long.  Examining the file properties (again through Windows Explorer) revealed a tab called "References" that listed out all of the songs I just dropped in.  That tab would allow me to remove songs, or reorder them, but there did not appear to be any way to add new ones to an existing playlist.  If I added a new song, I’d have to reselect all of the other songs AND the new one to effectively update the playlist.  That would become unwieldy fast.
  2. The other option I found in this forum post talked about the M3U playlist file format.  This was billed as a simple text file format, which seemed much more likely to be manageable going forward.

I ended up consulting several other internet destinations to figure out what this file needed to look like, and how to get it to work on the Clip:

In addition to these posts, I did a fair amount of my own experimentation to figure out the following procedure:

  1. Create a Windows 1252 (ANSI) text file and name it with a ".m3u" file extension.
  2. Add this as the very first line of the file: #EXTM3U
  3. Add one or more relative paths to the music files to be included in the playlist.  These would be relative to the "Music" folder on the Clip where the Artist folders would be housed:

        #EXTM3U
        Antonio Vivaldi\12 Violin Concerto, for violin, strings & continuo in E major (‘La Primave.mp3
        Antonio Vivaldi\Concerto For 2 Violins In A Minor, Op. 3 No. 8 – Allegro (Mouvement 1).mp3
        Antonio Vivaldi\Four Seasons- Spring Allegro.mp3
        Émile Waldteufel\Skaters Waltz.mp3
        Franz Liszt\Hungarian Rhapsody No 2.mp3
        Franz Schubert\Moment Musical.mp3
        Frédéric Chopin\Minute Waltz.mp3
        Georges Bizet\Carmen Suite 1 Les Toreadors.mp3

    This seemed to be the minimum contents needed to get the playlist to be recognized.

    For the most part, if I kept the files in a subfolder below the Artist name, the player would not recognize them.  My decision to flatten the music files to just one level down proved to be beneficial here.  I say "for the most part" because I did have one instance where a file was 2 levels down, in an "album" folder below the Artist folder, and the player found it.  I couldn’t explain why this worked, or why moving the other files up to the Artist folder caused them to suddenly be recognized by the player.  I thought it might have something to do with the length of the overall path, but as you can see from the above samples, some of the songs I have are quite long, and the player found those just fine.

  4. Switch the player to "MTP" mode.  For the Clip, this is found under Settings\USB Mode.  My player had been set to "Auto Detect".  At least two of the posts I found mentioned the other mode, "MSC", as being completely unusable for transferring playlist files to the player.  I have not tried changing this back to "Auto Detect" or trying "MSC", and then copying the playlist files over and seeing if they still worked.  I also didn’t dig into what these two modes are.  I had been working on the playlist issue for the better part of the week, and honestly, was just interested to see it resolved rather than exploring every nook and cranny.  Perhaps another day.
  5. Place this file in the root of the "Music" folder.  I tried a few different other locations for the playlist files on the player, including the "Playlists" folder, but this was the only one where it worked.

At this point, assuming that the music files were already on the player, the "Playlists" option on the player will now show the new playlist, and let you play from it.  I decided to go one step further with the playlists, not wanting to manage the playlists file by hand, and created a small WinForms application called "Playlist Forge" that would allow me to drag and drop individual music files, or entire folders, and construct the playlist file myself.

Playlist Forge

If you drag an M3U file onto Playlist Forge, it opens it.

Dragging a single music file (MP3 or WMA) onto it adds it to the playlist, including the name of the file and the parent folder.  (Playlist Forge assumes the folder structure I mentioned previously, where the actual music files are in a folder named after the Artist.)

Dragging a folder onto Playlist Forge will recursively find all MP3 or WMA files, and include them in the playlist, regardless of their depth.  It would still only include the file name and the folder it was actually in, but it would dig down as deeply as needed in the folder structure to pull out all of the music files.

Once you have the right files in there, you hit "CTRL-S" to save it.  If you had opened an M3U file originally, it would overwrite it.  If you had just started dragging music files onto it, it will create a new file called "NewPlaylist.m3u" on your desktop.

Finally, you can hit “CTRL-N” to clear the utility out and start a new playlist from scratch.

While this is definitely rough, it proved to be much faster to write this utility and use it than trying to pull all of the paths and files out manually.  It will also allow me to easily edit the files later, as I add music to my collection.

The utility – both the source and the compiled application – are found in the PlaylistForge.zip archive found at http://tinyurl.com/MarkGilbertSource if you are interested.  (And yes, I did see that other people had built apps like this already, but this seemed like a fun little app to write.)

 

Sit back and enjoy the sweet sounds of victory

A lot of research and work for this, but after all of it I am much happier about the state of my music collection and the prospects for managing it going forward.

Shell-shocked, but in a good way – Backing up blogs with Powershell

One of Scott Hanselman’s more recent blog posts was titled "The Computer Backup Rule of Three”.  I was pleased to see that the backup solution that I have in place at Casa del Gilbert met his “Backup Rule of Three”:  I back my workstations up to my server nightly, which then contacts Carbonite and moves the files to the cloud nightly.  I also have a separate path for my media (copies of purchased digital music, LOTS of photos, etc.) being backed up twice a year to Data DVDs, and stored in my safe deposit box.  Despite all that, one of Scott’s bullets caught my eye.  Under "What should I do?" he had this:

"Don’t trust the cloud. I backup my gmail, too."

Doesn’t Google (or whatever cloud-based application providers you use) have this covered already?  The Scott-approved, paranoid answer would be, "Don’t assume that 1) they are doing that, 2) they know how to do it every time without fail, and 3) they won’t one day shutter their cloud-based doors suddenly, leaving you out in the cold.”  Why so paranoid?  "Simple. Because I care about my work, photos and data and I would be sad if I lost it."  (Hanselman)

That bullet point got me thinking.  I currently do not back up my Gmail, or any of my other email, but honestly would I really be upset if I suddenly lost that?  Probably not.  My contact list is stored with the Google as well.  I can recreate that based on my wife’s contact list, but that would be painful.  I’ll have to start backing that up regularly.

My blogs.

Dang.  I don’t have local copies of any of my blogs.  I could probably recreate all of the photos I’ve cropped and illustrations I’ve done, but that would be REALLY time-consuming.  I have to rectify that.

All of my blogs are hosted with WordPress, so I started poking around with the Export options, to see if I could just get a raw dump of the text and images for every post.  When I log into the Dashboard, and go to Tools/Export, I can generate an XML file that has all of the text of the posts, but they only have links to the images, which are still hosted with WordPress.  If WordPress were to go completely offline, I wouldn’t be able to get them, so my backup solution for this needs to pull those images down out of the cloud.

Ok, so I need something that can parse through the exported XML, pull down all of the images.  Ideally, that “something” could also split the posts up into separate folders, and dump the images for that post into the same folder.  At that point I’d have everything in one place, and could recreate every post from scratch if I had to.

Enter the Powershell.

***
A little aside

I think it was another Hanselman post years ago where I read that Powershell was a powerful tool for more than just system admins – programmers would be able to reap a lot of the same benefits from it, automating the tedious, error-prone tasks that we face.  Ever since then, I’ve had this on my things to look at, but it never got prioritized high enough to actually look at until a couple of months ago.  That’s when we set up a new TeamCity build server at my company, and we made the decision to use Powershell to script out the deployment step.  That proved to be an immense win – Powershell was cleaner, easier, and far more powerful than NAnt (which we were using with our older CruiseControl.NET build server) for this task.

Then later, I used Powershell to automate a development task that I was getting tired of doing by hand.  Wiring that script into a Slickrun command made for more automated-awesomeness.

At that point, I was a true believer.  So, what tool did I turn to for parsing WordPress XML, automating the process of creating folders for each post, and pulling the images out of the cloud?

Exactly.
***

After dropping the XML file into a folder of its own, I open a Powershell prompt, switch into the folder with my ParsePosts.ps1 script, and then execute it like so:

.\ParsePosts.ps1 {full_path_to_xml_file}

Let’s take a look at the script, piece by piece.  First, there is a small helper function that takes the blog post’s date and title, and constructs a unique folder name from it:

function CreateFolderName($DatePosted, $BlogTitle)
{
$FormattedDate = Get-Date $DatePosted -format "yyyy-MM-dd"
$CleanedUpTitle = $BlogTitle -replace ‘\W’, ”
return "$FormattedDate-$CleanedUpTitle"
}

The date is formatted into yyyy-mm-dd, and then the title, stripped of all non-alphanumeric characters, is appended.  Next, I set a flag that allows me to control how verbose the script messaging is during its run. I can turn this off by setting this variable to $False:

$IsVerbose = $True

After that, I make sure that I got exactly 1 command line parameter, and if I didn’t I write out an error message with the scripts usage:

If ($args.Length -ne 1)
{
    Write-Error "Usage: ParsePosts.ps1 {full_path_to_file_to_parse}"
    Exit 1
}

Next, I set two local variables – one with the file to be parsed, and the other with the directory that that file resides in:

$FileToParse = Get-ChildItem $args[0]
If ($IsVerbose) { Write-Host $FileToParse.Name }

$BaseDirectory = $FileToParse.DirectoryName
If ($IsVerbose) { Write-Host $BaseDirectory }

As the comment says, this next block first verifies that the "atom" namespace has been added to the XML document:

# First, make sure the "atom" namespace has been included.  For some reason, the WordPress export utility
# doesn’t do this, and without it the export file is not valid XML.
$RawFileContents = Get-Content $FileToParse.FullName | Out-String
If(-Not $RawFileContents.Contains("xmlns:atom=""
http://www.w3.org/2005/Atom"""))
{
    $RawFileContents = $RawFileContents.ToString().Replace("xmlns:wp=""
http://wordpress.org/export/1.2/""", "xmlns:wp=""http://wordpress.org/export/1.2/"" xmlns:atom=""http://www.w3.org/2005/Atom""")
}

$XmlContents = [xml]$RawFileContents

Powershell wouldn’t parse the XML document properly until that namespace was present, and for some reason WordPress wasn’t adding it.  So, I open the file raw, and cheat a little – if the file doesn’t contain the "atom" namespace declaration, I look for the "wp" namespace declaration, and replace it with both the "wp" and "atom" declarations.  After that, I open the file as an XML Document.

I found that the blog content was contained in a "content:encoded" tag, so I need to declare a "content" namespace for later use:

$ContentNamespace = New-Object Xml.XmlNamespaceManager $XmlContents.NameTable
$ContentNamespace.AddNamespace("content", "
http://purl.org/rss/1.0/modules/content/")

Then, I set up some a WebClient object, and a couple of regular expressions that I will need to parse through each blog post:

$ImageRequester = new-object System.Net.WebClient
$ImageRegex = [regex]’src="(?<ImageUrl>.+?)"’
$FileRegex = [regex]’\/(?<FileName>[^\/]*\.png)|\/(?<FileName>[^\/]*\.jpg)’

All of this is basically setup needed for the real meat of the script:

$BlogPostNodeList = $XmlContents.SelectNodes("//channel/item")
Foreach ($CurrentBlogPostNode in $BlogPostNodeList)
{
    $BlogPostFolderName = CreateFolderName $CurrentBlogPostNode.post_date $CurrentBlogPostNode.Title
    $BlogPostFullFolderPath = Join-Path $BaseDirectory $BlogPostFolderName

    if (-Not (Test-Path $BlogPostFullFolderPath))
    {
        New-Item $BlogPostFullFolderPath -type directory
        If ($IsVerbose) { Write-Host "Created: $BlogPostFullFolderPath" }
    }

    # Write out the blog file contents to the file
    $BlogPostContent = ($CurrentBlogPostNode.selectSingleNode("content:encoded", $ContentNamespace).get_InnerText())
    Set-Content (Join-Path $BlogPostFullFolderPath "Post.txt") $BlogPostContent
    If ($IsVerbose) { Write-Host "Content written to $BlogPostFullFolderPath" }

    # Pull down this blog post’s images, if any
    $ImageRegex.Matches($BlogPostContent) | Foreach {
        $CurrentUrl = [string]$_.Groups["ImageUrl"]

        $LocalFileName = ""
        $FileRegex.Matches($CurrentUrl) | Foreach { $LocalFileName = $_.Groups["FileName"] }

        if($LocalFileName -ne "")
        {
            $LocalImagePath = Join-Path $BlogPostFullFolderPath $LocalFileName
            $ImageRequester.DownloadFile($CurrentUrl,$LocalImagePath)

            If ($IsVerbose) { Write-Host "$CurrentUrl downloaded" }
        }
    }
}

I first find all of the channel/item blocks, each of which represent a single blog post.  The loop then breaks down as follows:

    1. Check to see if the folder for that blog post has already been created, and if not create it.
    2. Break out the blog text itself, stored in the "content:encoded" tag, and save it to a Post.txt file in the blog post’s folder.
    3. Get a list of the src="…" tags found in the blog post.  Those represent the images I’ve embedded in the post.  For each of those images, extract the full URL, use the $ImageRequester WebClient object to pull it down, and save it to the blog post’s folder.

Finally, let the user know the script has finished:

Write-Host "Done"

There is definitely room for improvement here.  One of the things that I’ve found that trips the script up is the occasional blog post with a "post_date" of "0000-00-00 00:00:00".  When that happens, I have to dig up the original blog post, find its publication date, and manually update the XML file to include it.  This is one of many "hardening" steps that could be added to make it more capable of handling sloppy data.  And of course, the script could use another couple of rounds of refactoring as well.

My first official runs of the script seem to work well.  I pulled down all posts for all of my blogs, and even the one with the most posts only took a couple of minutes to parse.  I’ve set a monthly task to pull down that last month’s worth of posts.

Many, many places on the internets contributed to this script, and I would be remiss if I failed to mention them:

Pathing:

Parsing XML:

Selecting XML nodes that include namespaces:

Writing Functions:

Formatting Dates:

Writing to files:

Extracting the images from the content:

Thanks to all!