ruk·si

📊 R
Distribution

Updated at 2012-12-24 02:10

This note is about finding out distribution using R. Distribution is basically the spread of a dataset.

Single data points from a large dataset can make it more relatable, but those individual numbers do not mean much without something to compare to.

For example:

  • What is the median of a dataset?
  • What happens in between the maximum value and median?
  • Do the values cluster towards the median and quickly increase?
  • Are there are lot of values clustered towards the maximums and minimums with nothing in between?

Box-and-Whisker Plot

These have the median in the middle, upper and lower quartiles, and upper and lower fences. If there are outliers more or less than 1.5 times the upper or lower quartiles, respectively, they are shown with dots.

# Load crime data
crime <- read.csv(
    "http://datasets.flowingdata.com/crimeRatesByState-formatted.csv"
)

# Remove Washington, D.C.
crime.new <- crime[crime$state != "District of Columbia",]

# Remove national averages.
crime.new <- crime.new[crime.new$state != "United States ",]

# Box plot.
boxplot(
    crime.new$robbery,
    horizontal=TRUE,
    main="Robbery Rates in US"
)

# Box plots for all crime rates.
boxplot(
    crime.new[,-1],
    horizontal=TRUE,
    main="Crime Rates in US"
)

Histogram

A histogram can provide more details. Histograms look like bar charts, but they are not the same. The horizontal axis on a histogram is continuous.

# Histogram.
hist(crime.new$robbery, breaks=10)

# Multiple histograms
par( mfrow=c(3, 3) )
colnames <- dimnames( crime.new)[[2]]
for (i in 2:8) {
    hist(
        crime[,i],
        xlim=c(0, 3500),
        breaks=seq(0, 3500, 100),
        main=colnames[i],
        probability=TRUE,
        col="gray",
        border="white"
    )
}

Density Plot

For smoother distributions, you can use the density plot. You should have a healthy amount of data to use these or you could end up with a lot of unwanted noise.

# Density plot
par( mfrow=c(3, 3) )
colnames <- dimnames(crime.new)[[2]]
for (i in 2:8) {
    d <- density( crime[,i] )
    plot( d, type="n", main=colnames[i] )
    polygon( d, col="red", border="gray" )
}

# Histograms and density lines
par(mfrow=c(3, 3))
colnames <- dimnames(crime.new)[[2]]
for (i in 2:8) {
    hist(
        crime[,i],
        xlim=c(0, 3500),
        breaks=seq(0, 3500, 100),
        main=colnames[i],
        probability=TRUE,
        col="gray",
        border="white"
    )
    d <- density(crime[,i])
    lines(d, col="red")
}

Rug

The rug, which simply draws ticks for each value, is another way to show distributions. It usually accompanies another plot.

# Density and Rug
d <- density(crime$robbery)
plot(d, type="n", main="robbery")
polygon(d, col="lightgray", border="gray")
rug(crime$robbery, col="red")

Violin Plot

There's a box-and-whisker in the center, and it's surrounded by a centered density, which lets you see some of the variation.

library( vioplot )
vioplot( crime.new$robbery, horizontal=TRUE, col="gray" )