🏬 Data Warehouses

Updated at 2018-06-25 00:43

Data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence.

Data warehouse performance tricks:

Columnar Storage: rather than organizing data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. Every column is almost its own index.
cost-based query planning
pipelined execution
just-in-time compilation
data compression
how the data is distributed

Warehouse Types

Data Landfill

You save everything in AWS S3 with some semi-logic. You might need it later and S3 is cheap as hell.

Pro: simple as hell, cheap as hell Cons: not usable in this form so data needs to be transferred for processing

Traditional Data Warehouse

Application data goes to one RDBMS. Analytics data data goes to one RDBMS.

Pro: simple Con: not scalable

Managed Analytics Service

You use AWS Redshift or Google BigQuery for your analytics data. Works for basic analytics use-cases, not a lot of customization. Costs like hell, but worth it if you use the data.

Pro: simple Con: can be expensive with a lot of queries

Hadoop-style Data Lake

Events go to HDFS. Different applications compute on the HDFS.

Pro: you add more space by adding nodes to the HDFS. Con: needs you to co-locate your computation and data, so avoid heavy computation.

Lambda Architecture

stream of events > stream processing layer > storage layer + batch jobs for higher latency processing both are aggregated and served

Pro: flexible Con: complex, can be expensive if done wrong

Kappa Architecture

events > message bus with "raw data" topic e.g. Kafka > transform > message bus with "processed data" topic e.g. Kafka > transform > message bus with "analysis results" topic e.g. Kafka > consume

Pro: really flexible Con: really complex

Hadoop + Spark.

Unifies data sources and doubles as a compute layer.

Pro: flexible, a lot of open source support, easy to add more storage Con: co-hosting computaiton and storage or a lot of data transfer; hard to optimize

Managed Data Warehouses

The main two that I have personally used are Google BigQuery and Amazon Redshift.

Data volume:

Redshift: terabyte-scale works fine, will slow down when going to petabytes.
BigQuery: unlimited in practice, 10 petabytes incoming data per day.

Query speed: If you need to compare data warehouse query speeds, compare complex queries, that is where the real latency can be seen. But the query speeds are usually pretty similar.

Redshift: expect queries up to a couple of minutes.
BigQuery: expect queries to take up to a minute.

Cost:

Redshift: cheaper, especially for steady query workloads, let's say more than 40 queries per hour.
BigQuery: more expensive, but can be cheaper if you have spiky workloads, but the cost is hard to predict.

Usage:

Redshift: You write and run SQL based on PostgreSQL.
BigQuery: You write and run BigQuery SQL dialect.

Analytic tools to use with data warehouses:

Periscope
Looker
Chartio

Redshift

Redshift storage stack: Cluster > Node > Disks > Slices > Columns > Blocks

Cluster = a group of computers that form a Redshift instance
Node = a computer in the cluster
Disk = locally attached storage devices, each node has 1, 3, 8 or 24 disks.
Slice = virtual compute node to parallelize query processing; each node has 2, 16 or 32 slices. Table row-columns are distributed to slices and slices process only their own data.
Column = these logical columns scale in 1 block increments.
Block = column data is saved to 1MB immutable blocks.

Each table has a distribution style.

KEY same hashed value go to the same slice. Use when one of the values produces an even distribution, and when you want to optimize join- and group-operations.
ALL copies the whole table to the first slice of each node. Use for small tables that are required frequently.
EVEN distributes data evenly to available slices. Use when any value cannot produce an even distribution.

Block properties cause:

10 row writes have similar cost as 100 000 row writes
UPDATE/DELETE only logically delete rows, VACUUM is required for actual removal.
having varchar columns of over 150 characters will degrade performance.

Column properties cause:

Incorrect column encoding (compression) will degrade performance.
ANALYZE COMPRESSION can be used on existing tables to propose optimal compression. Column compression change requires table rebuild.
DISTKEY and SORTKEY significantly influence performance as query only as fast as the slowest slice.

Slice properties cause:

You need to have even data distribution between slices for gain full benefits of parallelism.
You can optimize speed for your own use case by placing frequently joined columns to the same slice.
Use at least as many input files as there are slices in the cluster to maximize throughput.

Data preparation:

Use CSV files for simplicity.
Pick a simple delimiter like ,, | or tab.
Pick a simple NULL character like \N
Use double quotes and escape characters for varchars.
Split files to 1MB - 1GB so they are a multiple of the number of slices for faster ingestion.

Sources

Redshift versus Snowflake versus BigQuery / Part 1: Performance
Deep Dive on Amazon Redshift, AWS Loft Meetup, September 2017
Wikipedia