# 24 Days of R: Day 10

How often is someone nominated for an academy award? Who has been nominated most often? Is there a difference between leading and supporting roles? Important questions. To answer them, I'm making use of a list of academy award nominees and winners. I've obtained the data from aggdata.com which has a few sets of free data. We'll open the file, do some basic clean up and then have a look at the results for Michael Caine. Note that these results are only through 2010.

```dfAwards = read.csv("./Data/academy_awards.csv", stringsAsFactors = FALSE)
dfAwards = dfAwards[, 1:5]
dfAwards\$Year = as.numeric(substr(dfAwards\$Year, 1, 4))
colnames(dfAwards) = gsub(".", "", colnames(dfAwards), fixed = TRUE)
dfAwards\$Won = dfAwards\$Won == "YES"

dfCaine = subset(dfAwards, Nominee == "Michael Caine")
row.names(dfCaine) = NULL

FirstNominated = min(dfCaine\$Year)
FirstWon = min(dfCaine\$Year[dfCaine\$Won == TRUE])
```

Michael Caine has been nominated 6 times and has won 2 times. It took 20 years for him to win his first award. That's a long time. My guess is that actors receive more multiple nominations and receive nominations over a longer period of time. I'll split the data into actor and actress categories to test this.

```dfAwards\$Gender = "Other"
dfAwards\$Gender[grepl("Actor", dfAwards\$Category)] = "Actor"
dfAwards\$Gender[grepl("Actress", dfAwards\$Category)] = "Actress"
dfActors = subset(dfAwards, Gender != "Other")
row.names(dfActors) = NULL

library(plyr)
plyActor = ddply(dfActors, .(Nominee, Gender), summarize, FirstNominated = min(Year),
NumberNominated = length(Year), LastNominated = max(Year))

plyActor\$Span = plyActor\$LastNominated - plyActor\$FirstNominated
row.names(plyActor) = NULL
meanActor = mean(plyActor\$Span[plyActor\$Gender == "Actor"])
meanActress = mean(plyActor\$Span[plyActor\$Gender == "Actress"])
```

We see that the mean length of time between first and last nomination is fairly comparable. Mean have a slightly longer span, but only just. A box plot of the span looks like this:

```library(ggplot2)
ggplot(plyActor, aes(factor(Gender), Span)) + geom_boxplot()
```

We'll do the same for number of nominations. It's a similar window into the potential longevity of someone's career, or the degree to which someone commands attention.

```actorNominees = mean(plyActor\$NumberNominated[plyActor\$Gender == "Actor"])
actressNominees = mean(plyActor\$NumberNominated[plyActor\$Gender == "Actress"])
ggplot(plyActor, aes(factor(Gender), NumberNominated)) + geom_boxplot()
```

Curiously, just who are those individuals who have career spans greater than 40 years? And which people have been nominated more than 10 times"“

```plyActor[plyActor\$Span > 40, ]
```
```##               Nominee  Gender FirstNominated NumberNominated LastNominated
## 321       Henry Fonda   Actor           1940               2          1981
## 455    Julie Christie Actress           1965               4          2007
## 466 Katharine Hepburn Actress           1932              12          1981
## 655       Paul Newman   Actor           1958               9          2002
## 671     Peter O'Toole   Actor           1962               8          2006
##     Span
## 321   41
## 455   42
## 466   49
## 655   44
## 671   44
```
```plyActor[plyActor\$NumberNominated >= 10, ]
```
```##               Nominee  Gender FirstNominated NumberNominated LastNominated
## 77        Bette Davis Actress           1934              11          1962
## 345    Jack Nicholson   Actor           1969              12          2002
## 466 Katharine Hepburn Actress           1932              12          1981
## 594      Meryl Streep Actress           1978              16          2009
##     Span
## 77    28
## 345   33
## 466   49
## 594   31
```

OK, I could see that. Katharine Hepburn, Paul Newman, Julie Christie, Bette Davis. A superficial look suggests that gender may not suffer from an age bias. Mind, I'd love to have more data to explore this further. In the meantime, I think I'm going to go watch "On Golden Pond”. I saw it when it first came out and it was clearly one hell of a movie for older performers.

Tomorrow: Unsure what will be covered. I'm going to a PostgreSQL meetup, so possibly that.

```sessionInfo()
```
```## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] knitr_1.4.1      RWordPress_0.2-3 ggplot2_0.9.3.1  plyr_1.8
##
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-3   dichromat_2.0-0    digest_0.6.3
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.2
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-29
## [10] munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5
## [13] RCurl_1.95-4.1     reshape2_1.2.2     scales_0.2.3
## [16] stringr_0.6.2      tools_3.0.2        XML_3.98-1.1
## [19] XMLRPC_0.3-0
```
Posted in R