Using R, {msgxtractor}, and R Markdown to wrangle a set of msg files
A friend recently asked for assistance with data she received from an open records request.
emotionally I was not prepared for WISDOT to send me a Box folder filled with documents that I can’t figure out how to download in response to my obnoxious open records request from ~6 months ago
— marybeth (unfortunately) (@marbeff) August 16, 2022
The data she received consisted of a zip file with 70-something *.msg
files. This is a proprietary Microsoft file format used by Outlook for storing emails. I quickly checked if there was an R package for handling the data, and indeed there is: msgextractor
. With this package and some R and RMarkdown I figured converting the files to something more convenient would not be too difficult.
In the end it took me much longer than I had hoped, but it works. The process has two pieces: An R
script with a function to read in a single email, extract some information, and save the email’s attachments in a separate folder. The second piece: A RMarkdown .Rmd
template that uses variables generated in the function to create an HTML
file with email metadata, the body, and links to the attachments. The R
then script then iterates the function over the list of *.msg
files with the walk
function from the {purr} package. This creates one HTML
file for each of the emails.
Here’s the R script:
library(msgxtractr) #install with devtools::install_github("hrbrmstr/msgxtractr")
library(tidyverse)
library(here)
# read list of msg files in folders
files <- list.files("data/All Files/")
# extract and save attachments and return paths
write_msg_files <- function(msg_file){
msg <- read_msg(paste0("data/All Files/", msg_file))
dir_name <- paste0("output/", msg_file)
dir.create(dir_name)
save_attachments(msg_obj = msg, path = dir_name)
#return path and file names for attachments
attachment_names <- list.files(dir_name)
attachment_paths <- paste0(msg_file, "/", attachment_names)
# contents for output documents
email <- data_frame(email = msg_file,
from = unlist(msg$headers$From),
to = unlist(msg$headers$To),
cc = unlist(msg$headers$CC),
subject = unlist(msg$headers$Subject),
date = unlist(msg$headers$Date),
body = paste(msg$body$text, collapse = "\n"),
attachments = list(attachment_paths))
#render output document
rmarkdown::render(
input = "output_template.Rmd",
output_file = here("output", paste0(msg_file, ".html"))
)
}
# iterate over list of files
walk(files, write_attachments)
And this is what the output template looks like:
---
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = F)
```
```{r}
library(msgxtractr)
library(tidyverse)
```
`r email[1,]$email`
#
**From:** `r email[1,]$from`
**To:** `r email$to`
**CC:** `r email$cc`
**Date:** `r email$date`
**Subject:** `r email$subject`
```{r results='asis'}
cat(email[1,]$body)
```
## Attachments
```{r results='asis'}
#creates markdown links to attachments
map_chr(email$attachments[[1]], ~paste0("[", ., "](", ., ")"))
```
This works nicely, with one exception: There were two msg
files whose content was another msg
file as an attachment (which in turn had attachments). msgxtractor
currently can’t deal with these and so I had to remove them from the dataset.
Here’s what the output:
Maybe this can be useful for others. The repository is on Github
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kliems (2022, Aug. 26). Harald Kliems: Wrangling unwieldy open records data. Retrieved from https://haraldkliems.netlify.app/posts/2022-08-26-using-r-markdown-and-msgxtractr-to-process-an-open-records-request/
BibTeX citation
@misc{kliems2022wrangling, author = {Kliems, Harald}, title = {Harald Kliems: Wrangling unwieldy open records data}, url = {https://haraldkliems.netlify.app/posts/2022-08-26-using-r-markdown-and-msgxtractr-to-process-an-open-records-request/}, year = {2022} }