Reading a Numbered Email from an Mbox in PHP

We’ve settled on the name Email Activity Assitant for the project I’m doing this summer. Inspired by the name, I’m going to try to find time to work a helpful paper-clip character into the interface. Like Clippit, only more dynamic and proactive.

Today I had to write a Web application that sends a series of pre-stored emails. We’ll use it to test our application’s ability to classify emails as they arrive. This will save us the embarrassment of ordering five hundred Lindsey Lohan Original CD Clocks, one for each time we have to test the mail grouping and activity-identifying algorithms.

We have a few sets of emails in a series of mbox files. Mbox is the mail storage format used by many mail applications, including Thunderbird. It’s exact specification can be found in the mbox man page but the gist is easy: Emails are stored one after another, each one preceded by a line beginning "From " (note the space; quotes not included, obviously) called From_ lines, and succeeded by a single blank line.

My application had to extract each email in turn, given only the path to the mbox file and the number of the required email. Here follows my solution. It’s somewhat broken, as I’ll explain below.

The basic idea is that we really don’t want to have to read the entire mbox file, split it into seperate emails, and then count up to the one we want. That’s just bad. Imagine trying to get the first email, a one liner asking if you’re free on Thursday, from an mbox representing the ensuing multi-week life-altering intellectual and spiritual debate about the nature of freedom and, for that matter, the nature of Thursday. So we only want to read up to the end of our email. PHP doesn’t have any built-in methods for reading a file a line at a time, so I’ve gone for a different approach.

We read in the file a chunk at a time; I’ve hardcoded the number 4096 bytes because it’s nice and round. If you don’t think that 4096 is a round number you probably shouldn’t be reading this. I chose 4096 since it’s not too large — I don’t want to read much more than I have to of the file — and it’s also not too small — being too small would be catastrophic for this code for reasons noted below. Anyway, once we’ve read the first chunk we split it on any occurance of /\nFrom [^\n]*\n/. I would have preferred /^From .*$/ but PHP’s preg_split doesn’t treat ^ and $ as beginning and end of line unfortunately. Instead it treats them as beginning and end of string, which is useless to us here.

If our required email hasn’t been reached yet — say we’re looking for the fourth and only three emails (or parts of emails) form our chunk — then we read the next chunk and continue as before. If our email, or part of it, is in the chunk we just read then we extract it. In this case we will have to read the next chunk only if we didn’t get the complete email from this one, ie. if it spans the boundary. On the next chunk we’ll just take the first part, up to the first From_ line if there is one, and append it to what we have. We keep doing this until we get a chunk with the beginning of another email or an end-of-file.

We now have our email. We also have some additional information we could use. If we hit and end-of-file we might want to do something to indicate there are no emails remaining. If we hit the EOF before finding our email we could throw a little hissy fit; alternatively we could show a helpful error.

Here’s an edited version of the code I used. $msgnum is the number of the email we want to retrieve. $mbox is the path to the mailbox.

$fHandle = fopen($mbox, r);
$currentMail = 0;
$mailContent = "";
while (!feof($fHandle) && $currentMail <= $msgnum)
{
	$chunk = fread($fHandle, 4096);
	$mails = preg_split("/\nFrom [^\n]*\n/", $chunk);
	if ($currentMail == $msgnum)
	{
		// we're in the middle of reading our mail so append the first part of the chunk:
		$mailContent .= $mails[0];
		$currentMail += count($mails) - 1;
	}
	else if ($currentMail + count($mails) < $msgnum)
	{
		// we haven't reached the mail we want yet so move on:
		$currentMail += count($mails) - 1;
	}
	else
	{
		// our mail starts (and might end) in this chunk:
		$offset = $msgnum - $currentMail;
		$mailContent = $mails[$offset];
		$currentMail += count($mails) - 1;
	}
}
if (feof($fHandle))
{
	// you might want to add some code here to be run if your email was the last one in the mbox
}
fclose($fHandle);
// $mailContent contains the contents of your email, if it was found

I promised I’d tell you why this method still sucks. You might have spotted it. The most important thing, the show-stopper if the bug is ever met, is that if a chunk contains some but not all of a From_ line, that line will be lost and the two emails it separates will be considered to be one. The chance of 20-ish character line being broken when we split into 4000+ character chunks is small but significant. This bug wouldn’t be acceptible in a live application; the only reason I let it slide is that it would be easier for me to just pad the offending mbox if we ever see this bug. It’ll show for, on average, one in 200 emails so we might well see it at some point in our testing.

The second bug, not so catastrophic, is that this method leads to a Shlemiel the painter’s algorithm. If we want the nth email we read the first n emails. So when we put this function into an application which goes through each email in turn, this is what happens: We read the first email, we read the first two emails, we read the first three emails, etc. It’s like a road painter whose work rate decreases dramatically as he gets further and further from the paint bucket. In this case the function calls are all made on seperate page views so there wasn’t much I could do short of saving a pointer to the last read chunk in a session variable. That would have been overkill, but it would be an important improvement if our mboxes got quite large. I tested on a 30+ message mbox, which looks like an extreme case for our purposes, and saw no noticeable slowdown on the later emails so I don’t anticipate this being a practical issue in this case.