How I learned to stop worrying and really love lists
December 14, 2012 Leave a comment
One of the first weird things to get used to in R is unlearning some of the things that you think you know. As often happens, this reminds me of a quote I once read about Zen, which went about like this (I’m paraphrasing), “When I knew nothing of Zen, mountains were mountains, rivers were rivers and the sky was the sky. When I knew a little of Zen, the mountains were not mountains, the rivers were not rivers and the sky was not the sky. When I fully understood Zen, mountains were mountains, rivers were rivers and the sky was the sky.” When I knew a little bit of R, a list was not a list. Actually, I wasn’t sure what to make of it. Is it a structure? Is it a linked list? Is it an oject array?
I’m slowly reaching the point where I begin to understand that a list is a list. I’m not fully Zen on lists yet, but I do know this. I think they might be awesome. For me, the first circle of enlightenment for R comes when I realize how much more powerful and flexible it is than any of the other tools I’ve used (yes, even Matlab). The second circle of enlightenment comes with an appreciation of the apply functions and that means understanding lists. Here’s a very simple construct that I’ve started applying (ha!) often:
df = GetTriangleData() lCompanyDFs = split(df, df$GRCODE) lProjections = lapply(lCompanyDFs, SomeFunction) dfResults = do.call("rbind", lProjections)
Here’s how the process works in a nutshell: 1) Get a pile of data, which contains at least one categorical variable. In the NFL data set, that’s a team, in the NAIC insurance data set (to be discussed in a forthcoming post), that’s an insurance company. 2) Split the data. This will return a list whose elements are all dataframes. (Or at least in this case it will.) 3) Apply some function across the entire list. 4) Stitch the results back together with a call to rbind. Lather. Rinse. Repeat.
Once you’re in the second circle of enlightenment, you’ll never again write a “for” loop. This has been a lifesaver to me when I’m trying to crunch through a giant set of data. I can pull data from our warehouse and carry out routine actions for each of our 500 accounts, for each of our lines of business, for each accident/policy year, etc. I split the data along a different access and the rest of the analysis pretty much takes care of itself.