ruk·si

📊 Data Visualization

Updated at 2014-11-10 11:48

Data visualization is communicating data-based ideas in a human-friendly format.

Data -> Visualization -> Perception -> Information -> Memory

Data visualization is the key tool in exploring any data. It exposes patterns, highlights cause-and-effect relationships and supports decision making. Using visualization as a research tool has 4 major steps:

  • Record Data: Record or collect data you wish to analyze. Frequently, data is already provided but then you should check alternative sources to back up your research. Prefer keeping different sources separate for more in depth analysis.
  • Prepare Data: The data must be transformed into a format that can be utilized. Usually means having data in computer-readable format like a spreadsheet.
  • Analyze Data: Create a lot of visualizations to find interesting patterns in the data. Choose the aspects you want to bring front, every visualization should have a point it wishes to make.
  • Finalize Visualization: Remove the unnecessary and tidy up the visualization. Focus on communicating the insight you got from analyzing the data.

Data visualization is for large datasets. The best way to explore and represent 20 or less values is usually a table.

Graphical Excellence: When you publish your data visualizations, follow the principles of graphical excellence:

  • Communicate with clarity, precision and efficiency.
  • The target is to give the greatest number of ideas in the shortest time with the least ink in the smallest space available.
  • You must be telling the truth about the data. Hiding details on purpose fights against graphical excellence.

How data visualization can lie:

  • Irregular tracking segments.
  • Cut the tracking segment on a specific point to hide unwanted detail; hiding the context.
  • Baseline of a chart is in negative values.
  • Comparing money values through ages without calculating inflation.
  • Magnitude of the difference between numbers is not considered, only the order.
  • Introduce design variation e.g. zooms or perspective.
Lie Factor = % change in visual / % change in data
Less than 0.95 or more than 1.05 is unprofessional and deceiving.

Sophisticated Visualization: A data visualization which has more than one variable but isn't a time-series or a map.

Dull, Dull, line with marked data points.

If your visualization is dull, then your numbers are probably dull too. The fact is that finding a good story to tell with numbers requires some statistical skill. When you have dull numbers, you are bound to have dull visualizations. This makes you miss the real news in the actual data.

Avoid using two or three dimensions to show one-dimensional data. Be extra careful when the number of information-carrying dimensions exceeds the number of dimensions in the data. These designs have so many potential pitfalls that they should be avoided.

Country populations shown as a varying 2-dimensional shapes. => You must calculate the area of each shape and see that they correlate with the population value.

Mark significant events into your time-series charts. Even consider only marking significant events instead of regular intervals.

The larger share of graphic's ink is devoted to data, the better Target data-ink ratio of over 0.8.

Data-ink Ratio = ink arranged in response to variation in the numbers represented / total ink used to print the graphic

Bar chart 5%, 10%, 15%

Use range-frames over boxes.

Prefer horizontal rectangles. It's naturally pleasing to the eye. The nature of the data may suggest another kind of shape though.

Around 1.0 height x 1.618 width Always between 1.2 and 2.2.

Embedding data visualization to other media. Sparklines.

Glucose 6.6 (previous measurements, normal range)

Chartjunk:

  • Unintential Optical Art Grouped lines form moiré vibrations when viewed.
  • Dreaded Grid Dark and heavy grid lines carry no information and should be surpressed.
  • Self-promoting Graphical Duck Graphic does not contain ornament, the graphic is the ornament. "That is some interesting data" becomes "that looks cool". When the visuals have no connection to the data it represents.

Data visualization research is a different thing. It's about understanding how visualizations convey information. Helps to develop principles and techniques for creating effective visualizations.

Data Variable Types

Taxonomy:

1 dimension = sets 2 dimensions = maps 3 dimensions = shapes n dimensions = relational tree = hierarchy graph = network

Nominal:

  • Main operations: =
  • E.g. Apple, Orange, Banana, Burned, Not burned, Apple is not Orange.

Ordered:

  • Main operations: =, >, <
  • E.g. Hot, Warm, Cold, Warm is hotter than cold.

Interval:

  • Main operations: =, <, >, -, distance
  • Has no zero point.
  • E.g. 19.1.2006, 7.2.2013, between 10.12.2012 and 20.12.2012 is 10 days.

Ratio:

  • Main operations: =, <, >, -, distance, proportions
  • Has fixed zero point.
  • E.g. 10cm, 50cm, 1kg, 100g, 0C, -32C, 20kg is 20kg from no weight.

Relational data model:

  • Contains tables with rows, columns and schema.
  • Collection of tables is a database.
  • Relational algebra applies.

Statistical data model contain variables or observations.

Variables are either measures or dimensions. Measures are data values that can be aggregated. Dimensions are discrete variables describing the data.

Population - Ratio - Measure Current Year - Interval - Dimension Age - Ratio - depends how used Sex - Nominal - Dimension

Preparing Data

Data wrangling is manipulation of data prior to analysis.

  • Manual formatting of data.
  • Most of the time it comes down to writing custom scripts and parsers.
  • Data Wrangler: http://vis.stanford.edu/wrangler
  • Google Refine: http://code.google.com/p/google-refine

Problems in preparing the data:

  • Missing data, no measurement. Remove all related data or do more measurements.
  • Seemingly erroneous values, error in measurement or outlier?
  • Erroneous types, Celsius variables in Fahrenheit column. You need type conversion.

Exploratory analysis:

  • Construct graphics to answer questions.
  • Inspect answer.
  • Create new questions.
  • Repeat until you have no more questions.
  • Take most important parts and finalize visualization.

Visual Encoding

Visual encoding is how you present the data as visualization.

Differentiating elements:

  • Textual labels to explain the visuals.
  • Numeric values to support the visuals.
  • Color of elements.
  • Transparency of elements.
  • Position of elements.
  • Space between elements.
  • Length of elements.
  • Size, scale, area or volume of elements.
  • Shape of elements.
  • Orientation of elements
  • Blurring of elements.
  • Texture of elements.
  • Angles in and between elements.

Grouping elements:

  • Enclosing elements in a group.
  • Connecting elements in a group.
  • Separating groups with space.
  • Element in a group are the same shape.

Effectiveness of encoding: Mackinlay’s Ranking.

Visualization encoding should match properties of the represented data. E.g. use size variable as size encoding or gold medals won shown in yellow.

Visualizations shouln't hide facts or express facts that are not in the data. But most important data should be presented in most effective way. Data is your story, you must choose the hero and act as the narrator. E.g. presenting one to many relationships as an one dimensional plot hides data.

You should aim to maximize data-to-ink ratio. Draw as little as you can while showing as much as you can.

Using Colors

Colors are used to identify, group, layer and highlight information: Brightness can be used as order indicator, from light to dark. Hue can be used as unordered indicator, different colors.

Use:

  • Different color hues for nominal data. Colors should be distinctive and easily nameable e.g. green, blue, red.
  • Use different hues and saturation for ratio, e.g. dark green, light yellow, dark red.
  • Use different saturation for interval and ordered data e.g. from light blue to dark blue.

Avoid using too may colors in one map or visualization. Six colors can be considered maximum, at least on maps. Check out ColorBrewer.

Make sure the visualization is readable by color blind. Check out VisCheck.

There is a lot more about colors themselves.

Important to consider:

  • Bezold Effect: color appearance depends on adjacent colors.
  • Crispening: color appearance depends on background.
  • Spreading: color appearance depends on size of stripes.
  • Cultural Conventions: color appearance depends on reader cultural background.

Evaluating Visualizations

  • What is the purpose of the visualization?
  • Does the visualization serve the purpose?
  • Who is the target audience?
  • Can the target audience fully understand the visualization?
  • Does the visualization show the appropriate amount of data?
  • Does the visualization convey the data honestly and without much bias?
  • Does the visualization state where the data comes from?
  • Are effective visual encodings being used?
  • Is interaction used to enable effective exploration of the data?
  • Is the visualization innovative?
  • Does the visualization address important, interesting or inspiring topic?
  • Is the visualization aesthetically pleasing?
  • Is the visual design appropriate to the topic?

Sources

  • Information Visualization, University of Turku
  • Data Visualization, Stanford University
  • The Visual Display of Quantitative Information, Edward R. Tufte