R - Packages
This note is about R packages in general, introduction to some cool R packages and who to use R packages them.
install.packages("ggplot2")
help(package = "ggplot2")
library("ggplot2")
lattice
Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. See ?Lattice for an introduction.
ggbio
Static visualization for genomic data.
ggplot2
Plotting system for R, based on the grammar of graphics. Better looking plots.
sqldf
Query your data.
beavers <- sqldf(
"select * from beaver1
union all
select * from beaver2;"
)
#head(beavers)
# day time temp activ
#1 346 840 36.33 0
#2 346 850 36.34 0
#3 346 900 36.35 0
#4 346 910 36.42 0
#5 346 920 36.55 0
#6 346 930 36.69 0
movies <- data.frame(
title=c(
"The Great Outdoors",
"Caddyshack",
"Fletch",
"Days of Thunder",
"Crazy Heart"
),
year=c(1988, 1980, 1985, 1990, 2009)
)
boxoffice <- data.frame(
title=c(
"The Great Outdoors",
"Caddyshack",
"Fletch",
"Days of Thunder",
"Top Gun"
),
revenue=c(43455230, 39846344, 59600000, 157920733, 353816701)
)
sqldf("SELECT
m.*
, b.revenue
FROM movies m
INNER JOIN boxoffice b
ON m.title = b.title;"
)
# title year revenue
#1 The Great Outdoors 1988 43455230
#2 Caddyshack 1980 39846344
#3 Fletch 1985 59600000
#4 Days of Thunder 1990 157920733
forecast
Easier time series analysis.
library(forecast)
# mdeaths: Monthly Deaths from Lung Diseases in the UK
fit <- auto.arima(mdeaths)
# Customize your confidence intervals
forecast(fit, level=c(80, 95), h=3)
# Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#Jan 1980 1822.863 1564.192 2081.534 1427.259 2218.467
#Feb 1980 1923.190 1635.530 2210.851 1483.251 2363.130
#Mar 1980 1789.153 1495.048 2083.258 1339.359 2238.947
plot( forecast(fit), shadecols="oldstyle" )
plyr
Avoid conditional logic by applying functions to data. Replaces split, apply and combine in base R.
# Split a data frame by Species,
# summarize it
# and convert the results into a data frame.
ddply(
iris,
.(Species),
summarise,
mean_petal_length=mean(Petal.Length)
)
# Species mean_petal_length
#1 setosa 1.462
#2 versicolor 4.260
#3 virginica 5.552
# Split a data frame by Species,
# summarize it,
# then convert the results into an array.
unlist(daply(iris[,4:5], .(Species), colwise(mean)))
# setosa.Petal.Width versicolor.Petal.Width virginica.Petal.Width
# 0.246 1.326 2.026
stringr
If you need to operate on strings.
Database Driver
No need to use extra files if your data is already in a database.
install.packages("RPostgreSQL")
install.packages("RMySQL")
install.packages("RMongo")
install.packages("RODBC")
install.packages("RSQLite")
lubridate
Easier dates. http://www.jstatsoft.org/v40/i03/paper
year("2012-12-12")
# [1] 2012
day("2012-12-12")
# [1] 12
ymd("2012-12-12")
#1 parsed with %Y-%m-%d
# [1] "2012-12-12 UTC"
halloween <- ymd("2010-10-31")
christmas <- ymd("2010-12-25")
interval <- new_interval(halloween, christmas)
# [1] 2010-10-31 -- 2010-12-25
qcc
Statistical quality control, ability to use history to predict future.
Machine creates nuts of 2.5 inch.
Previous nuts: 2.48, 2.47, 2.51, 2.52, 2.54, 2.42, 2.52, 2.58, 2.51
Is the machine broken?
Use for website visitor count or database operations.
# Series of value w/ mean of 10 with a little random noise added in.
x <- rep(10, 100) + rnorm(100)
# Test series w/ a mean of 11.
new.x <- rep(11, 15) + rnorm(15)
# qcc will flag the new points.
qcc(x, newdata=new.x, type="xbar.one")
reshape2
Formatting your data.
# Generate a unique id for each row; this let's us go back to wide
# format later.
iris$id <- 1:nrow(iris)
iris.lng <- melt(iris, id=c("id", "Species"))
head(iris.lng)
# id Species variable value
#1 1 setosa Sepal.Length 5.1
#2 2 setosa Sepal.Length 4.9
#3 3 setosa Sepal.Length 4.7
#4 4 setosa Sepal.Length 4.6
#5 5 setosa Sepal.Length 5.0
#6 6 setosa Sepal.Length 5.4
iris.wide <- dcast(iris.lng, id + Species ~ variable)
head(iris.wide)
# id Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 1 setosa 5.1 3.5 1.4 0.2
#2 2 setosa 4.9 3.0 1.4 0.2
#3 3 setosa 4.7 3.2 1.3 0.2
#4 4 setosa 4.6 3.1 1.5 0.2
#5 5 setosa 5.0 3.6 1.4 0.2
#6 6 setosa 5.4 3.9 1.7 0.4
library(ggplot2)
# Plots a histogram for each numeric column in the dataset
p <- ggplot(aes(x=value, fill=Species), data=iris.lng)
p + geom_histogram() +