ruk·si

📊 R
Basics

Updated at 2017-10-13 17:40

This note is mostly about R syntax and how to use R. It is a programming language and environment for statistical computing.

Use an IDE. For example RStudio. It will make your life a lot easier.

How to get started on Mac.

# create file test.R
write('{"epoch" : 1, "loss" : 0.9}', stdout())
write('{"epoch" : 2, "loss" : 0.5}', stdout())
write('{"epoch" : 3, "loss" : 0.1}', stdout())
brew install r
Rscript test.R

Every R installation comes with datasets-package. This contains 100 or so helpful example datasets.

# Speed (mph) and stopping distances (ft) of cars in 1920s.
head(cars)
#   speed dist
# 1     4    2
# 2     4   10
# 3     7    4
# 4     7   22
# 5     8   16
# 6     9   10

# Edgar Anderson's Iris Data
# 150 rows of 5 variables each
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

# Air quality in New York, May to September in 1973.
head(airquality)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

# A time series of 468 CO2 observations in Mauna Loa; monthly from 1959 to 1997.
head(co2)
# [1] 315.42 316.31 316.50 317.56 318.13 318.00

# Topographic information on Auckland's Maunga Whau volcano
# A matrix with 87 rows and 61 columns
head(volcano)
# ...

Summary Statistics

You can get summaries for most variables types.

targets <- read.csv("targets.csv")
str(targets) # Summary
summary(targets) # Summary with min, max, median, mean, quarters and NA count.
table(targets) # Counts of all existing values.
unique(targets) # Single instance of all existing values.

Mean, add together and divide by count.

# Count of limbs crew member has.
limbs <- c(4, 3, 4, 3, 2, 4)
names(limbs) <- c('One-Eye', 'Peg-Leg', 'Smitty', 'Hook', 'Scooter', 'Dan')

# Average limb count a.k.a. mean.
mean(limbs)

# Generating a bar plot with average line.
barplot(limbs)
abline(h = mean(limbs))

Median, choose the middle value.

abline(h = median(limbs))

Standard deviation; describes the range of typical values from a data set.

# Loot amounts from raids.
pounds <- c(45000, 50000, 35000, 40000, 35000, 45000, 10000, 15000)
barplot(pounds)
meanValue <- mean(pounds)
abline(h = meanValue)

# What is normal "loot" amount?
# Use standard derivation to see normal range.
deviation <- sd(pounds)
abline(h = meanValue + deviation)
abline(h = meanValue - deviation)

Apply

Apply runs a function on each element of a data structure.

targets <- read.csv("targets.csv")
apply(targets, 2, sum)
apply(targets, 1, sum, na.rm=TRUE)

apply(targets, 2, function(x) {
    sd(x) / sqrt(length(x))
})
products <- read.csv("products.csv")
tapply(products$totalPrice, products$condition, mean)
# ==>
tapply(products$totalPrice, products$wheels, mean)
# ==>
tapply(products$totalPrice, products[ ,c("condition","wheels")], length)

With

With is useful for one-off calculation on dataset.

products <- read.csv("products.csv")

# With products, calculate total price minus shipping cost.
priceProfit <- with(products, totalPrice - shippingCost)

Within is useful to including new variables to datasets.

products <- read.csv("products.csv")
pr <- within(products, {
        priceProfit <- totalPrice - shippingCost
    })
products <- read.csv("products.csv")

# Here only total price and conditions are returned.
aggregate(totalPrice ~ wheels + cond, products, mean)

# Dot . means to return all columns.
aggregate(. ~ wheels + cond, products, mean)

Sources

  • Google Developers R Programming Videos
  • Try R