Querying, parsimony and golden hammers
December 20, 2012 Leave a comment
I love it when things are easy. I love it so much that I’ll spend a great deal of time and effort to keep things simple. At the same time, though, I think there’s some value in expending effort in pursuit of something. If you want to understand a thing, you have to spend time with it and accept it on its own terms.
Which brings me to sqldf and this post describing same. Make no mistake. I have loved relational databases since I first became aware of their existence close to 20 years ago. They’re lovely. Yeah, I get it, there are new options now and I look forward to having enough minutes in the day to get my hands dirty with MongoDB and Riak or whatever. But I love me some SQL in the same way that I love the Grateful Dead. It’s like a mental soundtrack to me, regardless of what the kids these days are listening to.
So why am I skeptical about sqldf? I think because it feels like cheating. Actually, there are two reasons and I’ll get to the second one in a minute. But the first is definitely cheating. When you apply SQL grammar to a dataframe, you can get quick results, sure, but it means that you don’t learn how to manipulate the data using native R technology. You don’t have to work your way through the merge, order, by or aggregate functions. You won’t learn Hadley Wickham‘s plyr package (which, just so we’re clear, I haven’t had time to sort out yet). I think you miss something. A dataframe is an incredibly simple construct. The operations that one would naturally want to perform on it are the sort of things that make you think, “Surely someone has already worked out how to do this”. And they have. Heck, when it comes to aggregation there are all manner of options as this post points out. But they’ve gone farther than that. There are dataframe operations that go beyond what SQL would contemplate.
SQL is a lovely, familiar idiom and it’s nice to fall back on, but I would recommend against it. I could listen to nothing but the Grateful Dead for rest of my life. (No, really. I could.) But that would mean that I never check out Travis, Wilco, Disco Biscuits or whatever. If you lapse into SQL, you’re missing out on R.
And while we’re at it, shouldn’t all that great SQL reside in your database? Are you only looking at your data in R? R is pretty fantastic, but it was never designed to plow through piles and piles of numbers the way your database server was. If you understand the problem you want to study, then your data ought to be in some presentable form by the time it gets to R. Heaven knows, that’s how I work it. And once the data set is prepped in my database, it means that I can distribute it in more than one way. Sure, that’s almost exclusively Excel, but I have users who wouldn’t be happy about having to learn R just to get their hands on one piece of information.
Don’t get me wrong, I may use sqldf now and again, just for kicks. But it’s limited. Like hearing the Dead play “Big River” and calling it Johnny Cash. It’s pretty awesome, but it’s not the genuine article.