ruk·si

📊 R
Packages

Updated at 2012-03-05 13:39

This note is about R packages in general, introduction to some cool R packages and who to use R packages them.

install.packages("ggplot2")
help(package = "ggplot2")
library("ggplot2")

lattice

Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. See ?Lattice for an introduction.

ggbio

Static visualization for genomic data.

ggplot2

Plotting system for R, based on the grammar of graphics. Better looking plots.

sqldf

Query your data.

beavers <- sqldf(
    "select * from beaver1
     union all
     select * from beaver2;"
)

#head(beavers)
#  day time  temp activ
#1 346  840 36.33     0
#2 346  850 36.34     0
#3 346  900 36.35     0
#4 346  910 36.42     0
#5 346  920 36.55     0
#6 346  930 36.69     0

movies <- data.frame(
    title=c(
        "The Great Outdoors",
        "Caddyshack",
        "Fletch",
        "Days of Thunder",
        "Crazy Heart"
    ),
    year=c(1988, 1980, 1985, 1990, 2009)
)
boxoffice <- data.frame(
    title=c(
        "The Great Outdoors",
        "Caddyshack",
        "Fletch",
        "Days of Thunder",
        "Top Gun"
    ),
    revenue=c(43455230, 39846344, 59600000, 157920733, 353816701)
)

sqldf("SELECT
        m.*
        , b.revenue
    FROM movies m
    INNER JOIN boxoffice b
        ON m.title = b.title;"
)

#               title year   revenue
#1 The Great Outdoors 1988  43455230
#2         Caddyshack 1980  39846344
#3             Fletch 1985  59600000
#4    Days of Thunder 1990 157920733

forecast

Easier time series analysis.

library(forecast)

# mdeaths: Monthly Deaths from Lung Diseases in the UK
fit <- auto.arima(mdeaths)

# Customize your confidence intervals
forecast(fit, level=c(80, 95), h=3)

#         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
#Jan 1980       1822.863 1564.192 2081.534 1427.259 2218.467
#Feb 1980       1923.190 1635.530 2210.851 1483.251 2363.130
#Mar 1980       1789.153 1495.048 2083.258 1339.359 2238.947

plot( forecast(fit), shadecols="oldstyle" )

plyr

Avoid conditional logic by applying functions to data. Replaces split, apply and combine in base R.

# Split a data frame by Species,
# summarize it
# and convert the results into a data frame.
ddply(
    iris,
    .(Species),
    summarise,
    mean_petal_length=mean(Petal.Length)
)

#     Species mean_petal_length
#1     setosa             1.462
#2 versicolor             4.260
#3  virginica             5.552

# Split a data frame by Species,
# summarize it,
# then convert the results into an array.
unlist(daply(iris[,4:5], .(Species), colwise(mean)))

#    setosa.Petal.Width versicolor.Petal.Width  virginica.Petal.Width
#                 0.246                  1.326                  2.026

stringr

If you need to operate on strings.

Database Driver

No need to use extra files if your data is already in a database.

install.packages("RPostgreSQL")
install.packages("RMySQL")
install.packages("RMongo")
install.packages("RODBC")
install.packages("RSQLite")

lubridate

Easier dates. http://www.jstatsoft.org/v40/i03/paper

year("2012-12-12")
# [1] 2012

day("2012-12-12")
# [1] 12

ymd("2012-12-12")
#1 parsed with %Y-%m-%d
# [1] "2012-12-12 UTC"

halloween <- ymd("2010-10-31")
christmas <- ymd("2010-12-25")
interval <- new_interval(halloween, christmas)
# [1] 2010-10-31 -- 2010-12-25

qcc

Statistical quality control, ability to use history to predict future.

Machine creates nuts of 2.5 inch. Previous nuts: 2.48, 2.47, 2.51, 2.52, 2.54, 2.42, 2.52, 2.58, 2.51 Is the machine broken?

Use for website visitor count or database operations.

# Series of value w/ mean of 10 with a little random noise added in.
x <- rep(10, 100) + rnorm(100)

# Test series w/ a mean of 11.
new.x <- rep(11, 15) + rnorm(15)

# qcc will flag the new points.
qcc(x, newdata=new.x, type="xbar.one")

reshape2

Formatting your data.

# Generate a unique id for each row; this let's us go back to wide
# format later.
iris$id <- 1:nrow(iris)

iris.lng <- melt(iris, id=c("id", "Species"))
head(iris.lng)
#  id Species     variable value
#1  1  setosa Sepal.Length   5.1
#2  2  setosa Sepal.Length   4.9
#3  3  setosa Sepal.Length   4.7
#4  4  setosa Sepal.Length   4.6
#5  5  setosa Sepal.Length   5.0
#6  6  setosa Sepal.Length   5.4

iris.wide <- dcast(iris.lng, id + Species ~ variable)
head(iris.wide)
#  id Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1  1  setosa          5.1         3.5          1.4         0.2
#2  2  setosa          4.9         3.0          1.4         0.2
#3  3  setosa          4.7         3.2          1.3         0.2
#4  4  setosa          4.6         3.1          1.5         0.2
#5  5  setosa          5.0         3.6          1.4         0.2
#6  6  setosa          5.4         3.9          1.7         0.4

library(ggplot2)

# Plots a histogram for each numeric column in the dataset
p <- ggplot(aes(x=value, fill=Species), data=iris.lng)
p + geom_histogram() +

Sources