GMail Backup with getmail

My primary laptop died on me the other day. Unfortunate due to how much it is going to cost to get a new mainboard, but not catastrophic. Most of the data on there is backed up in other locations and my mail is hosted by Google Apps for Your Domain (GAFYD). I sat down at another machine in my office, went to log into my mail...and nothing. It was offline for about an hour due to server problems on the other end. During this hour I had to time to come up with thoughts such as "what if Google loses my mail at the same time my laptop died?". Now this wasn't likely, but I decided the 'backup your mailbox' list item now moved up to the top.

I figured that I could just pull my mail via IMAP to a backup machine, but thought that I should investigate what other options folks were using. As it turns out the common method is to setup a mail client (Outlook, Outlook Express, Thunderbird, etc.) and just have it sync the mailbox. Some folks were using POP to retrieve the mail, but the protocol doesn't really lend itself to this, and Google's implementation is problematic. So I went with my initial thought of IMAP. I didn't really want to have a full-blown client doing this, but rather something I could run via a script. The obvious answer was fetchmail, the classic command line mail retriever. In my reading though I came up with another option that I hadn't used - getmail. From the getmail website:

"getmail is a mail retriever designed to allow you to get your mail from one or more mail accounts on various mail servers to your local machine for reading with a minimum of fuss. getmail is designed to be secure, flexible, reliable, and easy-to-use. getmail is designed to replace other mail retrievers such as fetchmail."

The author, Charles Cazabon, and others are critical of the security and overall design of fetchmail and getmail is designed to address these issues. Plus getmail is written in Python - which makes it more interesting to me. It is available with most unix-like systems as a package. The system I had in mind for the job was a Windows machine, luckily Cygwin includes it as a standard package as well. After installing getmail you will need to create a ~/.getmail directory with a getmailrc file spelling out what and how you would like your mail retrieved. The syntax of the file is quite simple and straightforward. You need to create a 'retriever' section defining the mailbox/mailserver, a 'destination' section denoting where you would like it stored an 'options' section with general parameters. Sample getmailrc file:

[retriever]
type = SimpleIMAPSSLRetriever
server = imap.gmail.com
username = jdoe@datalinkcontrol.com
password = p@ssw0rd
mailboxes = ("[Gmail]/All Mail",)
port = 993

[destination]
type = Maildir
path = ~/mailbackup/datalinkcontrol/

[options]
received = false
delivered_to = false
read_all = false
verbose = 0

The username & password fields are pretty obvious, but the mailboxes option perhaps needs a little clarification. When using IMAP you need to define which folders you would like retrieved. You can create a list of multiple folders, but I just wanted a bulk backup of all of my mail - handily GMail has a folder called 'All Mail'. Note that folder to IMAP clients shows up as '[Gmail]/All Mail'. Now on the destination section getmail has several options, including passing along the mail to another MTA or application for further processing. The two options for storage are maildir & mbox formats. The primary visible difference is that maildir stores each message a file in a directory, while mbox stores all of the messages in a single file. The maildir format is very straightforward, and the mbox format has some variants as well as file-locking issues. Due to the volume of the mail I was going to backup - I chose the maildir. The path needs to exist already, and it must have three subdirectories (cur, new & tmp) - they will not be created for you. Finally the options I defined...'received' and 'delivered_to' stop getmail from adding any headers to the messages as they are downloaded - I wanted the mail as is. The 'read_all = false' tells getmail to not re-read messages it has already pulled down, but rather only new messages, and 'verbose' eliminated a status update for each message it retrieved. I saved this file as 'datalinkcontrol' in the .getmail directory, and I created one file for each mail account I wanted to backup at Google. Now to use getmail...

getmail -r datalinkcontrol

That's it. It diligently went to work and a bit later it had retrieved a little over 24,000 messages from the mailbox. Each subsequent time it runs, only new messages are retrieved and is quite fast. Add an entry to cron or scheduled tasks for each mailbox and you are done. If you ever need to upload the messages or transfer them to another account, many email clients and scripts can easily handle the maildir format. A variation you might consider is creating a filter that labels the messages by a date range (i.e. Mail 2007, Mail 2008, etc.) and then specifying those on the mailboxes line in the rc file.