Data Warehouses
Updated at 2018-06-24 21:43
Data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence.
Data warehouse performance tricks:
- Columnar Storage: rather than organizing data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. Every column is almost its own index.
- cost-based query planning
- pipelined execution
- just-in-time compilation
- data compression
- how the data is distributed
Warehouse Types
Data Landfill
You save everything in AWS S3 with some semi-logic.
You might need it later and S3 is cheap as hell.
Pro: simple as hell, cheap as hell
Cons: not usable in this form so data needs to be transferred for processing
Traditional Data Warehouse
Application data goes to one RDBMS.
Analytics data data goes to one RDBMS.
Pro: simple
Con: not scalable
Managed Analytics Service
You use AWS Redshift or Google BigQuery for your analytics data.
Works for basic analytics use-cases, not a lot of customization.
Costs like hell, but worth it if you use the data.
Pro: simple
Con: can be expensive with a lot of queries
Hadoop-style Data Lake
Events go to HDFS.
Different applications compute on the HDFS.
Pro: you add more space by adding nodes to the HDFS.
Con: needs you to co-locate your computation and data, so avoid heavy computation.
Lambda Architecture
stream of events
> stream processing layer
> storage layer + batch jobs for higher latency processing
both are aggregated and served
Pro: flexible
Con: complex, can be expensive if done wrong
Kappa Architecture
events
> message bus with "raw data" topic e.g. Kafka
> transform
> message bus with "processed data" topic e.g. Kafka
> transform
> message bus with "analysis results" topic e.g. Kafka
> consume
Pro: really flexible
Con: really complex
Hadoop + Spark.
Unifies data sources and doubles as a compute layer.
Pro: flexible, a lot of open source support, easy to add more storage
Con: co-hosting computaiton and storage or a lot of data transfer; hard to optimize
Managed Data Warehouses
The main two that I have personally used are Google BigQuery and Amazon Redshift.
Data volume:
- Redshift: terabyte-scale works fine, will slow down when going to petabytes.
- BigQuery: unlimited in practice, 10 petabytes incoming data per day.
Query speed: If you need to compare data warehouse query speeds, compare complex queries, that is where the real latency can be seen. But the query speeds are usually pretty similar.
- Redshift: expect queries up to a couple of minutes.
- BigQuery: expect queries to take up to a minute.
Cost:
- Redshift: cheaper, especially for steady query workloads, let's say more than 40 queries per hour.
- BigQuery: more expensive, but can be cheaper if you have spiky workloads, but the cost is hard to predict.
Usage:
- Redshift: You write and run SQL based on PostgreSQL.
- BigQuery: You write and run BigQuery SQL dialect.
Analytic tools to use with data warehouses:
- Periscope
- Looker
- Chartio
Redshift
Redshift storage stack:
Cluster > Node > Disks > Slices > Columns > Blocks
- Cluster = a group of computers that form a Redshift instance
- Node = a computer in the cluster
- Disk = locally attached storage devices, each node has 1, 3, 8 or 24 disks.
- Slice = virtual compute node to parallelize query processing; each node has 2, 16 or 32 slices. Table row-columns are distributed to slices and slices process only their own data.
- Column = these logical columns scale in 1 block increments.
- Block = column data is saved to 1MB immutable blocks.
Each table has a distribution style.
KEY
same hashed value go to the same slice. Use when one of the values produces an even distribution, and when you want to optimize join- and group-operations.ALL
copies the whole table to the first slice of each node. Use for small tables that are required frequently.EVEN
distributes data evenly to available slices. Use when any value cannot produce an even distribution.
Block properties cause:
- 10 row writes have similar cost as 100 000 row writes
UPDATE
/DELETE
only logically delete rows,VACUUM
is required for actual removal.- having
varchar
columns of over 150 characters will degrade performance.
Column properties cause:
- Incorrect column encoding (compression) will degrade performance.
ANALYZE COMPRESSION
can be used on existing tables to propose optimal compression. Column compression change requires table rebuild.DISTKEY
andSORTKEY
significantly influence performance as query only as fast as the slowest slice.
Slice properties cause:
- You need to have even data distribution between slices for gain full benefits of parallelism.
- You can optimize speed for your own use case by placing frequently joined columns to the same slice.
- Use at least as many input files as there are slices in the cluster to maximize throughput.
Data preparation:
- Use CSV files for simplicity.
- Pick a simple delimiter like
,
,|
or tab. - Pick a simple NULL character like
\N
- Use double quotes and escape characters for
varchar
s. - Split files to 1MB - 1GB so they are a multiple of the number of slices for faster ingestion.
Sources
- Redshift versus Snowflake versus BigQuery / Part 1: Performance
- Deep Dive on Amazon Redshift, AWS Loft Meetup, September 2017
- Wikipedia