24 Days of R: Day 13

In this post, I'll look at a very quick way to look at the relationships in a database. I had a bit of fun with my first network graph and plotting the connections between tables and views seems like a natural extension. Here, I'm going to create a bare bones insurance database. We have a defined business segment, which contains one or more accounts. Each account purchases one or more policies and each policy has one or more claims. Finally, each claim is evaluated over the course of its existence and paid and outstanding losses are recorded. Very basic. I'm not going to describe the schema here (though you can see it on my GitHub at the RxR project). The point is to explore the structure of a database without knowing anything at all about it.

I'm using (free, but closed source) SQL Server Compact as I'm having difficulty getting the PostgreSQL ODBC driver to play nicely with RODBC. I've just started diagnosing this- it's almost certainly a 32 vs. 64 bit issue- but if anyone has suggestions, I'm all ears. I'm trying to make Postgres my default database, but it has to work with R. That's a longwinded way of explaining the line where I remove the sysdiagrams table from my output.

RODBC has a number of functions to report metadata. I've used sqlColumns as an aid for more robust ETL, but haven't played with some of the others. Let's see what they can do.

library(RODBC)
myChannel = odbcConnect(dsn = "RxR")

dfTables = sqlTables(myChannel, schema = "dbo")
dfTables = dfTables[dfTables$TABLE_NAME != "sysdiagrams", ]
tableNames = dfTables$TABLE_NAME[dfTables$TABLE_TYPE == "TABLE"]
queryNames = dfTables$TABLE_NAME[dfTables$TABLE_TYPE == "VIEW"]

dfColumns = lapply(c(tableNames, queryNames), sqlColumns, channel = myChannel)
dfColumns = do.call("rbind", dfColumns)

dfKeys = lapply(tableNames, sqlPrimaryKeys, channel = myChannel)
dfKeys = do.call("rbind", dfKeys)

As an amateur programmer, all of my code is evolutionary. It starts specific and moves to something more general. I've taken yesterday's function to create a relationship table and generalized it for the case of database columns. In the two examples I've worked with, I'm examining groups which have members who may participate in other groups. This could be a musician with a side project, or it could be a column whose value is used as a foreign key or the result of a query.

CreateRelation = function(dfTable, IntraColumn, GroupName) {

    myVector = dfTable[, IntraColumn]

    indices = combn(length(myVector), 2)

    dfRelate = data.frame(from = myVector[indices[1, ]], to = myVector[indices[2, 
        ]])

    dfRelate$GroupName = dfTable[, GroupName][1]

    dfRelate

}

lstColumns = split(dfColumns, dfColumns$TABLE_NAME)
dfRelations = lapply(lstColumns, CreateRelation, "COLUMN_NAME", "TABLE_NAME")
dfRelations = do.call("rbind", dfRelations)

With that done, we can display a network graph of all of the columns and queries in the database.

library(igraph)

# g = graph.data.frame(dfRelations, directed=FALSE, vertices=dfColumns[,
# c('COLUMN_NAME','TABLE_NAME')])
g = graph.data.frame(dfRelations, directed = FALSE)

set.seed(1234)
plot(g, vertex.color = g$Color)

plot of chunk graph

I love this. Somehow the account name, but not the account ID is related to claim count. Has the query been properly designed? Policy information sits by itself, walled off from the marketing segment and the claims information, as it should be. There are just two steps from paid loss to business segment. This also makes a bit of sense.

There are several dozen things that I'd like to do with this, but they'll have to wait until tomorrow or later. I'm going to keep playing with this as I think it's a fantastic way to get an initial read on the complexity of a new database. There are several things in this world that I absolutely love. One of them is a map. Another? Metadata!

sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.4.1      RWordPress_0.2-3 igraph_0.6.6     RODBC_1.3-8     
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    RCurl_1.95-4.1
## [5] stringr_0.6.2  tools_3.0.2    XML_3.98-1.1   XMLRPC_0.3-0
Advertisements
Posted in R

3 thoughts on “24 Days of R: Day 13

  1. This looks really handy. I get sent some pretty complicated MS Access databases and would like to use such a toolkit to get a quick idea of how things are related (the inbuilt database documenter does create a graph, but I think your’s looks useful especially to access the metadata for use in scripted queries and subsequent data wrangling steps). Do you know if this will work with .mdb files?

    WRT your comment: “difficulty getting the PostgreSQL ODBC driver to play nicely with RODBC”

    I also prefer postgres and have been working on a couple of helper functions to connect using RPostgreSQL package (with inbuilt driver):

    * see https://github.com/swish-climate-impact-assessment/swishdbtools/blob/master/R/connect2postgres.r for a function that will ask for your credentials
    * or if you use the pgpass file to store your password you can just connect using the name of the database using https://github.com/swish-climate-impact-assessment/swishdbtools/blob/master/R/connect2postgres2.r

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s