ruk·si

Data Warehouses

Updated at 2018-06-24 21:43

Data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence.

Data warehouse performance tricks:

  • Columnar Storage: rather than organizing data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. Every column is almost its own index.
  • cost-based query planning
  • pipelined execution
  • just-in-time compilation
  • data compression
  • how the data is distributed

Warehouse Types

Data Landfill

You save everything in AWS S3 with some semi-logic.
You might need it later and S3 is cheap as hell.

Pro: simple as hell, cheap as hell
Cons: not usable in this form so data needs to be transferred for processing

Traditional Data Warehouse

Application data goes to one RDBMS.
Analytics data data goes to one RDBMS.

Pro: simple
Con: not scalable

Managed Analytics Service

You use AWS Redshift or Google BigQuery for your analytics data.
Works for basic analytics use-cases, not a lot of customization.
Costs like hell, but worth it if you use the data.

Pro: simple
Con: can be expensive with a lot of queries

Hadoop-style Data Lake

Events go to HDFS.
Different applications compute on the HDFS.

Pro: you add more space by adding nodes to the HDFS.
Con: needs you to co-locate your computation and data, so avoid heavy computation.

Lambda Architecture

stream of events 
    > stream processing layer 
        > storage layer + batch jobs for higher latency processing
          both are aggregated and served

Pro: flexible
Con: complex, can be expensive if done wrong

Kappa Architecture

events
	> message bus with "raw data" topic e.g. Kafka
		> transform
			> message bus with "processed data" topic e.g. Kafka
				> transform
					> message bus with "analysis results" topic e.g. Kafka
						> consume

Pro: really flexible
Con: really complex

Hadoop + Spark.

Unifies data sources and doubles as a compute layer.

Pro: flexible, a lot of open source support, easy to add more storage
Con: co-hosting computaiton and storage or a lot of data transfer; hard to optimize

Managed Data Warehouses

The main two that I have personally used are Google BigQuery and Amazon Redshift.

Data volume:

  • Redshift: terabyte-scale works fine, will slow down when going to petabytes.
  • BigQuery: unlimited in practice, 10 petabytes incoming data per day.

Query speed: If you need to compare data warehouse query speeds, compare complex queries, that is where the real latency can be seen. But the query speeds are usually pretty similar.

  • Redshift: expect queries up to a couple of minutes.
  • BigQuery: expect queries to take up to a minute.

Cost:

  • Redshift: cheaper, especially for steady query workloads, let's say more than 40 queries per hour.
  • BigQuery: more expensive, but can be cheaper if you have spiky workloads, but the cost is hard to predict.

Usage:

  • Redshift: You write and run SQL based on PostgreSQL.
  • BigQuery: You write and run BigQuery SQL dialect.

Analytic tools to use with data warehouses:

  • Periscope
  • Looker
  • Chartio

Redshift

Redshift storage stack:
Cluster > Node > Disks > Slices > Columns > Blocks
  • Cluster = a group of computers that form a Redshift instance
  • Node = a computer in the cluster
  • Disk = locally attached storage devices, each node has 1, 3, 8 or 24 disks.
  • Slice = virtual compute node to parallelize query processing; each node has 2, 16 or 32 slices. Table row-columns are distributed to slices and slices process only their own data.
  • Column = these logical columns scale in 1 block increments.
  • Block = column data is saved to 1MB immutable blocks.

Each table has a distribution style.

  • KEY same hashed value go to the same slice. Use when one of the values produces an even distribution, and when you want to optimize join- and group-operations.
  • ALL copies the whole table to the first slice of each node. Use for small tables that are required frequently.
  • EVEN distributes data evenly to available slices. Use when any value cannot produce an even distribution.

Block properties cause:

  • 10 row writes have similar cost as 100 000 row writes
  • UPDATE/DELETE only logically delete rows, VACUUM is required for actual removal.
  • having varchar columns of over 150 characters will degrade performance.

Column properties cause:

  • Incorrect column encoding (compression) will degrade performance.
  • ANALYZE COMPRESSION can be used on existing tables to propose optimal compression. Column compression change requires table rebuild.
  • DISTKEY and SORTKEY significantly influence performance as query only as fast as the slowest slice.

Slice properties cause:

  • You need to have even data distribution between slices for gain full benefits of parallelism.
  • You can optimize speed for your own use case by placing frequently joined columns to the same slice.
  • Use at least as many input files as there are slices in the cluster to maximize throughput.

Data preparation:

  • Use CSV files for simplicity.
  • Pick a simple delimiter like ,, | or tab.
  • Pick a simple NULL character like \N
  • Use double quotes and escape characters for varchars.
  • Split files to 1MB - 1GB so they are a multiple of the number of slices for faster ingestion.

Sources