Makefiles and RMarkdown

Quite some time ago (October 2013, according to Amazon), I bought a copy of “Reproducible Research with R and RStudio” by Christopher Gandrud. And it was awesome. Since then, I’ve been using knitr and RMarkdown quite a lot. However, until recently, I never bothered with a makefile. At the time, I had assumed that it was something only available to people on *nix systems and back then I was developing exclusively on PC. I even wrote some R scripts that were more or less makefiles; reading the contents of a directory, checking for output and running render or knit or whatever. My workflow continued to evolve and get standardised, I moved to Linux and so I picked up Gandrud’s book again to review the bits about makefiles. I’m not sure when it was that I realized that RTools includes a make program for Windows, but I wish someone had told me that a couple years ago.

So, enough preamble. What’s the benefit and how does it work?

The benefit

A makefile ensures that all of your work gets done, but only when it needs to and that each step has the raw material it needs to work. Identify what output you expect to see and how to generate that output and the make utility will go to work. If the output is already there, it will skip to the next thing which needs to get done. Like a 12-bar blues, very simple in concept, but easy to extend to all sorts of complex derivations. Here’s my approach:

The assumptions

  1. Use RMarkdown files as your default. This will allow you to comment on everything that you’re doing and construct high quality output that you can share with folks down the line. For most steps, I render output to Word. Yes, yes, but my audience likes Word and there’s no drama about different browsers and I can easily edit the content, if I need to.
  2. The workflow breaks down into four discrete steps: Gather data, Process data, Analyze data, Present data. This is pretty close to what Gandrud proposes.
  3. Save output in .rda files at each step of the process.

The steps

  1. Gather data. Fetch it from the internet, from your data warehouse, or from wherever. This steps makes a copy of that information, informs where it came from, how you got it and how it’s structured. Save everything in a folder called ‘raw’. At this stage, I try to make no adjustments at all.
  2. Process data. Take the raw information and alter it. This step typically involves ensuring that data types are righteous- factors are characters if necessary, dates are dates, etc. Calculated and convenience columns (storing the year as well as the date, for example) are created. I might merge data frames into a single table, spread and/or gather as appropriate. Often, though, not a lot happens.
  3. Analyze data. This is usually exploratory, or even just descriptive. I’ll produce (hopefully) tons of plots and summary tables. At some point, I’ll come to a conclusion about models that I think make sense.
  4. Present. At the moment, my preference is to use slidy for presentation output. This keeps things fairly clean and simple. More complex explication should use something like LaTeX or Word. I can’t stand technical writing and I’m awful at it, so I usually stick to pictures and bullet points.

How does it work?

I don’t really know. Sorry. I’ve had a go with the GNU documentation and it’s pretty overwhelming. I took Christopher’s basic example and modified it for my purposes. Boiling it down to basic principles, know several things:

  1. make operates by building “targets”. Once the full set of targets is built, make is finished.
  2. Each target has (probably) a “prerequisite”, which it needs in order to get built. The prerequisite may also be (and often is) a target itself.
  3. The rule for building the target is called a “recipe” and is typically a shell command.
  4. make makes liberal use of variables and wildcards. A variable is often written in all caps. To refer to it, enclose it within parentheses and precede it with a dollar sign, e.g. $(MY_VARIABLE).
  5. You can also include a “clean” step, which will wipe out all the targets. This will ensure that everything gets rebuilt.

I took a few minutes to straighten out my just-for-fun Baseball repository to align it with my current preferred workflow. Here’s what I did: first, ensure that the directory structure stuck to the gather->process->analyze-> present flow. In this case, that just meant a bit of tidying in my data directory. Second, copy the boilerplate makefile from my gist. Finally, alter the “Project Options” section of the “Tools/Project Options” in RStudio to ensure that the build tool moves from “none” to “Makefile”. That’s it. Let me say that again. That’s it.

NOTE: Please observe the copyright and limited use license at the Lahman site.

Let’s walk through the makefile. The first thing we do is establish where the root directories are. Note the use of variable substition in defining the data directory.

    RDIR = .
    DATA_DIR = $(RDIR)/data

Next, we’ll use wildcards to establish all of the .Rmd files in each of our four steps as prerequisites. I’ll just show this for the “gather” step. In the second step, the wildcard command, will pull every .Rmd file. The third line will perform a substition to construct a list of targets.

    GATHER_DIR = $(DATA_DIR)/gather
    GATHER_SOURCE = $(wildcard $(GATHER_DIR)/*.Rmd)
    GATHER_OUT = $(GATHER_SOURCE:.Rmd=.docx)

The step with the target of “all” is the key. “all” has prerequisites that are targets of each of the four steps listed above. Just before defining our ultimate target, we define a variable which will act as the “recipe” for each of the steps. The $< will be substituted with the name of our various .Rmd files.

    KNIT = Rscript -e "require(rmarkdown); render('$<')"

We’re ready to roll. Again, we’ll just show the “gather” step. This will use another form of wildcard which will associate a .Rmd file with its target. That prerequisite name will be fed into the KNIT variable we defined earlier.


Within RStudio, you can just hit CTRL-SHIFT-B to execute make. You’ll see the markdown engine zip through its files and eventually, you’ll see a pile of documentation produced. If everything went well, the next time you execute make it will tell you that nothing needs to be done. Change a .Rmd file in the processing step, though, and it will recreate that file and reperform all of the analysis. Of course, it’s possible to define things in such a way that not all of the analysis, or all of the processing, or whatever gets done if something changes upstream. I tend to customize this basic makefile to be a bit more fine tuned. However, I’ll always keep these steps in. This will give me an extra margin of safety to make sure that I’ve not ignored a critical dependency.


Below is a list of things that are awesome:

  • Makesfiles. GNU/Linux are like the age of enlightment and the invention of moveable type combined.
  • Christopher Gandrud’s book. Seriously, buy it.
  • RStudio. I think everyone has gotten the memo on this, but it’s always nice to repeat. Syntax highlighting for R, SQL and shell? (And HTML and Python and C++ and CSS and JavaScript?) Single keystroke or GUI to clean or make my project? Yes, please!
  • Yhui Xie. I’ve not read his book yet, but his stuff is amazing and knitr is tremendous.

Session info:

## R version 3.1.3 (2015-03-09)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] knitr_1.6        RWordPress_0.2-3
## loaded via a namespace (and not attached):
##  [1] digest_0.6.4    evaluate_0.5.5  formatR_0.10    htmltools_0.2.6
##  [5] RCurl_1.95-4.1  rmarkdown_0.5.1 stringr_0.6.2   tools_3.1.3    
##  [9] XML_3.98-1.1    XMLRPC_0.3-0    yaml_2.1.13

Visualizing the History of Epidemics

I really like National Geographic. Their magazine is great, their television documentaries are done well and they helped give me a lifelong love of maps. They generate very good information and help shed light on the world we all share. So why is this graphic so awful?

Let's have a look:
National Geographic image

We'll start off by saying that no one will mistake me for Edward Tufte or Stephen Few or Nathan Yau, though I love their stuff, have read it and have tried to adopt as many of their more sensible recommendations as I can. That understood, I think I'm on solid footing when I say that at a minimum, all graphical elements should fit within the display surface. The first three quantities are so massive, that they can't be contained. How big are they? Well, we have the numbers within the circles, but beyond that, who knows? The plague of Justinian looks like it could be Jupiter to the Black Plague's Saturn, with modern epidemics having more of an Earthly size.

Speaking of circles, I try to avoid them. If those three aforementioned experts have taught me anything it's that the human brain cannot easily process the area of a round object. Quick: without looking at the numbers, tell me what's the relativity between HIV and ebola.

Did you have to scroll to look at both objects? I did. Not only do the largest epidemics spill over the display area, they make it difficult to view a large number of data points at the same time. As we scroll down, we eventually land on a display which has Asian flu at the top and the great plague of London at the bottom. Justinian, the black death and medieval history are erased from our thoughts.

And what's with the x-axis? The circles move from one side to the other, but this dimension conveys no meaning whatsoever.

As an aside, although I love having the years shown, it would have been good to use that to augment the graphic with something that conveys how epidemics have changed over time. Population has changed, medicine has changed and the character of human disease has changed. As I look at the graphic, what I tend to extrapolate from this is that surely the plague of Justinian wiped out most of southern Europe, Anatolia and Mesopotamia. In contrast, SARS likely appeared during a slow news cycle.

It would be disingenuous of me to criticize a display without proposing one of my own. So, here goes.

dfEpidemic = data.frame(Outbreak = c("Plague of Justinian", "Black Plague"
                                     , "HIV/AIDS", "1918 Flu", "Modern Plague"
                                     , "Asian Flu", "6th Cholera Pandemic"
                                     , "Russian Flu", "Hong Kong Flut"
                                     , "5th Cholera Pandemic", "4th Cholera Pandemic"
                                     , "7th Cholera Pandemic", "Swine Flu"
                                     , "2nd Cholera Pandemic", "First Cholera Pandemic"
                                     , "Great Plague of London", "Typhus Epidemic of 1847"
                                     , "Haiti Cholera Epidemic", "Ebola"
                                     , "Congo Measles Epidemic", "West African Meningitis"
                                     , "SARS")
                        , Count = c(100000000, 50000000, 39000000, 20000000
                                    , 10000000, 2000000, 1500000, 1000000
                                    , 1000000, 981899, 704596, 570000, 284000
                                    , 200000, 110000, 100000, 20000, 6631
                                    , 4877, 4555, 1210, 774)
                        , FirstYear = c(541, 1346, 1960, 1918, 1894, 1957, 1899, 1889
                                        , 1968, 1881, 1863, 1961, 2009, 1829, 1817
                                        , 1665, 1847, 2011, 2014, 2011, 2009, 2002))
dfEpidemic$Outbreak = factor(dfEpidemic$Outbreak
                             , levels=dfEpidemic$Outbreak[order(dfEpidemic$FirstYear
                                                                , decreasing=TRUE)])
plt = ggplot(data = dfEpidemic, aes(x=Outbreak, y=Count)) + geom_bar(stat="identity") + coord_flip()
plt = plt + scale_y_continuous(labels=comma)

plot of chunk GetDataFrame

I'm showing that data as a bar chart, so everything fits within the display and the relative size is easy to recognize. I also order the bars by starting year so that we can convey an additional item of information. Are diseases getting more extreme? Nope. Quite the reverse. 1918 flu and HIV have been significant health issues, but they pale in comparison to the plague of Justinian or the Black Death. HIV is significant, but we've been living with that disease for longer than I've been alive. If we want to convey a fourth dimension, we could shade the bars based on the length of the disease.

dfEpidemic$LastYear = c(542, 1350, 2014, 1920, 1903, 1958, 1923, 1890, 1969, 1896, 1879
                        , 2014, 2009, 1849, 1823, 1666, 1847, 2014, 2014, 2014, 2010, 2003)
dfEpidemic$Duration = with(dfEpidemic, LastYear - FirstYear + 1)
dfEpidemic$Rate = with(dfEpidemic, Count / Duration)

plt = ggplot(data = dfEpidemic, aes(x=Outbreak, y=Count, fill=Rate)) + geom_bar(stat="identity")
plt = plt + coord_flip() + scale_y_continuous(labels=comma)

plot of chunk AddDuration

The plague of Justinian dwarfs everything. We'll have one last look with this observation removed. I'll also take out the Black Death so that we're a bit more focused on modern epidemics.

dfEpidemic2 = dfEpidemic[-(1:2), ]
plt = ggplot(data = dfEpidemic2, aes(x=Outbreak, y=Count, fill=Rate)) + geom_bar(stat="identity")
plt = plt + coord_flip() + scale_y_continuous(labels=comma)

plot of chunk SansJustinian

HIV/AIDS now stands out as having the most victims, though the 1918 flu pandemic caused people to succomb more quickly.

These bar charts are hardly the last word in data visualization. Still, I think they convey more information, more objectively than the National Geographic's exhibit. I'd love to see further comments and refinements.

Session info:

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] knitr_1.6        RWordPress_0.2-3 scales_0.2.4     ggplot2_1.0.0   
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] grid_3.1.1       gtable_0.1.2     htmltools_0.2.4  labeling_0.2    
##  [9] MASS_7.3-34      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
## [13] Rcpp_0.11.2      RCurl_1.95-4.1   reshape2_1.4     rmarkdown_0.2.50
## [17] stringr_0.6.2    tools_3.1.1      XML_3.98-1.1     XMLRPC_0.3-0    
## [21] yaml_2.1.13

An Idiot Learns Bayesian Analysis: Part 3

A week or so ago, the grand Magus over at published a great, quick thought exercise taken from Daniel Kahneman’s book Thinking, Fast and Slow. Here are the particulars of the problem: you’re in a community with two different color vehicles; 85% are green and 15% are blue. A vehicle was involved in a hit and run accident. A witness says the car was blue. We can establish that the witness may correctly identify the color of a car 80% of the time. Given all that, what is the probability that the car is blue?

In an uncharacteristic fit of industriousness, I didn’t just read through the explanation, but tried to work it out myself. My natural inclination was to assume a table as follows:

Car is green Car is blue Marginal
Witness is correct 68% 12% 80%
Witness is incorrect 17% 3% 20%
Marginal 85% 15% 100%

The interior probabilities can be worked out by multiplying the marginals, which is a great thing for lazy people like me. However, structuring things in this way can make the problem a bit harder to work out. This configuration doesn’t directly address our question. If we want to know the chance that the car was blue- given that the witness says that it was blue- we have to pluck out the scenarios wherein the witness will state that she saw a blue car. These are: 1) the witness is correct and the car is blue and 2) the witness is incorrect and the car is green. The double negative does my head in. Notwithstanding that, if we add those two probabilities together, we get the 29% chance that the witness says blue and we can then normalize the 12% (car is blue) to get the 41% chance that the car is actually blue.

The 2×2 table suggested by the Mage’s approach is as follows. Note that he had to work out the marginals, as explained in his post.

Car is green Car is blue Marginal
Witness says green 68% 3% 71%
Witness says blue 17% 12% 29%
Marginal 85% 15% 100%

With this setup, the probability of a blue car is easy to isolate as the scenario now takes up a single row. Just divide the 12% by 29% to normalize the row and we arrive at the posterior probability of 41%.

It’s a subtle thing, but meaningful. The other nice thing about this approach is that it’s consistent with how we might look at a binary classification problem. It’s worth taking a quick moment to identify some of the salient attributes of the table. Positive predictive value is the ratio of true positives divided by total number of positive forecasts. This is the likelihood that the witness will be correct, when she says that a car is green. (“Positive” in this context may be regarded as a green car.) In this case, that number is 96%. This is very high. The negative predictive value, which is the likelihood the witness will be correct when she says blue is 41%.

This is what the Mage’s table looks like with the probabilities replaced by the terms we use to describe them. (Again, “positive” in this case is arbitrarily defined to be a green car.)

Car is green Car is blue
Witness says green Positive Predictive Value (PPV) False Discovery Rate (FDR)
Witness says blue Fale Omission Rate (FOR) Negative Predictive Value (NPV)
Marginal Prevalence 1-Prevalence

There’s a very important point that isn’t emphasized in the example. The witness’ Accuracy is less than the Prevalence. The witness could achieve an accuracy of 85% by simply guessing “green” for every car that she sees. This is what gives us the counterintuitive result that a witness who is accurate 80% of the time has less than a 50/50 chance of having seen a blue car. How accurate would they need to be to get at least 50%?

The quantity we’re interested in is the negative predictive value divided by the sum of NPV / FOR. I don’t have a letter for this, so I’m calling it q. (I’d love to hear that this thing has a name.)


Solving for A, we get:


For the case where q=0.5, this simplifies to the Prevalence. If the witness’ accuracy is equal to the percentage of green cars, we can be 50% confident that she really saw a blue car. (Note that we can’t get to accuracy= Prevalence by always guessing a green car. If this were the case, we’d never have a situation where the witness said she saw a blue car.)

How does our table look now?

Car is green Car is blue Marginal
Witness says green 72.25% 2.25% 74.5%
Witness says blue 12.75% 12.75% 25.5%
Marginal 85% 15% 100%

We can see that when the witness says the car is blue, the odds are even that the car is blue. And if she says green? That’s 97%.

There is a bit of contrivance at work here in that we know the population of vehicles with certainty. We also presume that the population has no other defining characteristics that could improve our accuracy. Perhaps blue cars are driven at particular hours of the day or more in certain locations? However, there is a bit of a lesson. If you’re trying to identify the rare case, you must be able to generate an accuracy that’s higher than the prevalence of the baseline case. A rare disease needs a very precise diagnostic test. Or, in my field, if you’re trying to identify the one liability claim in 1,000 that will produce a massive jury award, your predictive model must be very, very good.

Session info:

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] knitr_1.6        RWordPress_0.2-3
## loaded via a namespace (and not attached):
##  [1] digest_0.6.4     evaluate_0.5.5   formatR_0.10     htmltools_0.2.4 
##  [5] markdown_0.7.2   RCurl_1.95-4.1   rmarkdown_0.2.50 stringr_0.6.2   
##  [9] tools_3.1.1      XML_3.98-1.1     XMLRPC_0.3-0     yaml_2.1.13

Stuff I’ve gotten horribly wrong

I'm the first (I hope) to admit when I've gotten something wrong. I like to think I'm humble enough to realize that there are limits to my knowledge. Actually, humility doesn't enter into it. Every day I'm confronted with things that I don't know or understand. Those same limits can often blind me to being sage enough to recognize when I've gone off the rails. With time, however, knowledge begins to seep in. So, here it is, stuff I've gotten wrong:

  1. Using a list to store complicated data types in S4 objects is absurd and unnecessary.
    There's a lenghty explanation here, but suffice it to say that it's absolutely possible to vectorize individual elements of your S4 object. I've done it and it's a gas. Don't get me wrong, it's not a walk in the park, but it allows you to build up very complicated objects. So long as accessor functions are coded cleanly, things will work out. Using a list to store complicated elements is a bad idea on a number of levels.

  2. It's totally possible to extract the contents of a data frame without fear of R returning a vector.
    This is really embarassing. All you need to do is set the parameter drop=FALSE.

  3. Computed columns might be a good idea. My thoughts on how to implement them and my response to alternate suggestions was moronic. I use reshape2 and plyr all the time. I'm still not happy that I can't simply define a computed column like I can in SQL, but I've not developed a better alternative.

I'm sure there are others. My initial epiphany about mapply and its relation to nested loops has faded. This is mostly the result of my having gained deeper experience with the vectorization of the language. I still use mapply in this way, so I'm not yet ready to concede that this is approach is “wrong”, per se.

A few weeks ago, I was in Africa as part of a team of instructors demonstrating how to use R. I sat with one of the students for two hours going over some basic coding. At one point, I could tell that he was reluctant to execute a command after he'd typed it. I told him, “Learning R means making many, many mistakes. Go ahead and get started and don't worry.” His code ran fine.

Recursive assignment

Here’s yet another example where I just need to read the help files. Before I go on, I should add my own notion as to why that’s not always easy to do. On loads of message boards, you’ll see people say- correctly- that the documentation is very clear on XYZ. True. But that’s only relevant if you read the bit of the documentation that actually matters to you and you have all of the context you need to understand the terse (though accurate!) descriptions there. It’s a bit like a bus schedule in Samarkand. Absolutely clear and useful if you’re in central Asia and know where you are and where you need to go and when you need to get there. If you’ve been walking the Silk Road for weeks and can’t tell Samarkand from Tashqent, that bus schedule may not do you as much good. So it is with R documentation. Sometimes you’ll have to dust off your shoes, get patient and ask a stranger for help.

So what I had wanted to do was to understand something fairly basic. How is the following statement processed:

myObject$MyColumn[2] = "New value"

This is a typical method to manipulate individual cells in a data frame and a very natural way to structure custom R objects. So, when creating my own objects, how do I implement it? If there is customization, where does it take place? Do I access the element in the $ or the [] first? What assignment operator is being used?

To investigate, I created a very simple object with easy properties that I could assign.

setClass("Person", representation(FirstName = "character", LastName = "character", 
    Birthday = "Date"))

I then created two easy access and set methods. For reasons that will become clear in a moment, I also added a statement to indicate when the methods had been called.

setMethod("$", signature(x = "Person"), function(x, name) {
    print("Just called $ accessor")
    arguments <- as.list(
    slot(x, name)
setMethod("$<-", signature(x = "Person"), function(x, name, value) {
    print("Just called $ assignment")
    arguments <- as.list(
    slot(x, name) = value

And I created a new object.

objPeople = new("Person", FirstName = c("Ambrose", "Victor", "Jules"), LastName = c("Bierce", 
    "Hugo", "Verne"), Birthday = seq(as.Date("2001/01/01"), as.Date("2003/12/31"), 
    by = "1 year"))

So, I can access the properties and my methods will tell me when they've been accessed. I can also assign to the member and I’ll be told when that happens as well.

## [1] "Just called $ accessor"
## [1] "Ambrose" "Victor"  "Jules"
objPeople$FirstName = "Joe"
## [1] "Just called $ assignment"

Now here’s the interesting bit. (Interesting if you’ve just gotten to the train station in Samarkand and are trying to find your hotel. Not so interesting if you’ve been in Uzbekistan for a few weeks.)

objPeople$FirstName[2] = "Joe"
## [1] "Just called $ accessor"
## [1] "Just called $ assignment"

The assignment produced a call to the accessor function? Why? The answer may be found in one of two places. One is the very clear, concise and speedy answer that I got to a question I posed on StackOverflow, which may be read here. Two is the R documentation, which may be found here.

This will tell us that the following two sets of statements are equivalent. (For the rest of the post, I’m suppressing output, so the messages about when the ‘$’ operators are called will not appear.)

objPeople$FirstName[2] = "Joe"

`*tmp*` <- objPeople
objPeople <- `$<-`(`*tmp*`, name = "FirstName", value = `[<-`(`*tmp*`$FirstName, 
    2, value = "Joe"))

So what’s happening? When I want to assign to a subset, three things take place. First, I use my accessor to sort out precisely which value I’m extracting from. Next, I use bracket assignment to alter the elements of a subset of that vector. Finally, I assign the whole vector back to the component of my object. This is a bit easier to see, if we take the steps one at a time.

gonzo = objPeople$FirstName
mojo = `[<-`(gonzo, 2, value = "Joe")
objPeople = `$<-`(objPeople, "FirstName", mojo)

This is why the accessor is not called if there is no subset in the assignment. In that case, the equivalent expression is simply the following:

objPeople = `$<-`(objPeople, "FirstName", "Joe")

Welcome to Uzbekistan. Please enjoy our fine network of buses.

Watching Africa from a plane

I wrote this 8 or 9 days ago, while on a plane and am just now getting around to posting it.

It'€™s either 7:12 PM Friday or 2:12 AM Saturday. I'€™m somewhere over the Mediterranean, having just passed over Tunisia. Sunrise will happen too late for me to see the Sahara. It's a mass of beige on the tiny map; a green dotted line treks doggedly forward over a baked wasteland about the size of the entirety of the US east of the Mississippi. It's wrong to talk about Africa without talking about the enormity of it. Once that's cleared, we'll be in Addis Ababa, Ethiopia. I doubt there will be any on offer- and it won't be at all an appropriate time to drink it- but I'd love to try some Tej. This is likely moot as I have no Ethiopian currency. I'll be glad for a coffee and the chance to buy my kids some postcards.

I'm not sure who else is on this plane. Quite a few Africans, of course, but more white Americans than I was expecting. At least one is wearing a purple t-shirt with letters arranged in the shape of a cross. Are they all missionaries? Well, not all, obviously. I'm not and neither is the woman sitting next to me. She works with the university in Addis. And me? I've got this laptop and my brain and I'm going to try to share the contents of both with some students in Rwanda. I'll have help, of course. I remain intellectually embarrassed to be involved at all. It was only my eagerness to travel, to experience and to learn that got me here. That and, I expect, a surfeit of volunteers.

Still, the question remains: just what are we all doing in Africa? I can only answer the question for myself and there are two bits of it. The first is the easy bit. I love to travel. This trip has enough altruism that I don't feel too guilty leaving my family for 10 days on another of my crazy whims. The second is different. If the trip had been to Peru, Slovakia, or Sri Lanka it would not have caught me the same way. Africa. The continent which is too big to fail, but for which everyone has such bleak hopes. Africa. Origin of humanity. Eden. Africa, source of cheap natural resources, from oil to uranium to diamonds to its most devalued commodity: free human labor. Africa the home of failed states, dictatorships, foreign-drawn borders, heart of darkness, punishing sun, steamy jungles and parched sand. Africa. The place I'd chosen to ignore for the first 40 years of my life. The place that draws me in the same way other places have, with the whispered voice telling me, “There must be more than this. Everyone else surely has it wrong. The only way you'll find out is to go there.”

This won't be an exhaustive experience, mind. It's really just 9 days. Nowhere enough for insight, answers or truth. Yet more than I had when I woke up this morning. Before I dragged myself from my home, bleary-eyed, drove through the darkness to fly against the sun and compressed one day and half a world while sitting on a plane. Tomorrow, I’ll rise again and dust my eyes to greet the African dawn.

Triangle Open Data Day 2014

A rare live blog post today. I'm writing this from Triangle Open Data Day 2014. This will basically be a page of links that I'll try to get around to later.

GIS resources:

Open data resources:

Cloud development resources:

MongoDB presentation is about to start. Will likely update this post.