Data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence.
Data warehouse performance tricks:
- Columnar Storage: rather than organizing data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. Every column is almost its own index.
- cost-based query planning
- pipelined execution
- just-in-time compilation
- data compression
- how the data is distributed
You save everything in AWS S3 with some semi-logic. You might need it later and S3 is cheap as hell. Pro: simple as hell, cheap as hell Cons: not usable in this form so data needs to be transferred for processing
Traditional Data Warehouse
Application data goes to one RDBMS. Analytics data data goes to one RDBMS. Pro: simple Con: not scalable
Managed Analytics Service
You use AWS Redshift or Google BigQuery for your analytics data. Works for basic analytics use-cases, not a lot of customization. Costs like hell, but worth it if you use the data. Pro: simple Con: can be expensive with a lot of queries
Hadoop-style Data Lake
Events go to HDFS. Different applications compute on the HDFS. Pro: you add more space by adding nodes to the HDFS. Con: needs you to co-locate your computation and data, so avoid heavy computation.
stream of events > stream processing layer > storage layer + batch jobs for higher latency processing both are aggregated and served Pro: flexible Con: complex, can be expensive if done wrong
events > message bus with "raw data" topic e.g. Kafka > transform > message bus with "processed data" topic e.g. Kafka > transform > message bus with "analysis results" topic e.g. Kafka > consume Pro: really flexible Con: really complex
Hadoop + Spark.
Unifies data sources and doubles as a compute layer. Pro: flexible, a lot of open source support, easy to add more storage Con: co-hosting computaiton and storage or a lot of data transfer; hard to optimize
Managed Data Warehouses
The main two that I have personally used are Google BigQuery and Amazon Redshift.
- Redshift: terabyte-scale works fine, will slow down when going to petabytes.
- BigQuery: unlimited in practice, 10 petabytes incoming data per day.
Query speed: If you need to compare data warehouse query speeds, compare complex queries, that is where the real latency can be seen. But the query speeds are usually pretty similar.
- Redshift: expect queries up to a couple of minutes.
- BigQuery: expect queries to take up to a minute.
- Redshift: cheaper, especially for steady query workloads, let's say more than 40 queries per hour.
- BigQuery: more expensive, but can be cheaper if you have spiky workloads, but the cost is hard to predict.
- Redshift: You write and run SQL based on PostgreSQL.
- BigQuery: You write and run BigQuery SQL dialect.
Analytic tools to use with data warehouses:
Redshift storage stack: Cluster > Node > Disks > Slices > Columns > Blocks
- Cluster = a group of computers that form a Redshift instance
- Node = a computer in the cluster
- Disk = locally attached storage devices, each node has 1, 3, 8 or 24 disks.
- Slice = virtual compute node to parallelize query processing; each node has 2, 16 or 32 slices. Table row-columns are distributed to slices and slices process only their own data.
- Column = these logical columns scale in 1 block increments.
- Block = column data is saved to 1MB immutable blocks.
Each table has a distribution style.
KEYsame hashed value go to the same slice. Use when one of the values produces an even distribution, and when you want to optimize join- and group-operations.
ALLcopies the whole table to the first slice of each node. Use for small tables that are required frequently.
EVENdistributes data evenly to available slices. Use when any value cannot produce an even distribution.
Block properties cause:
- 10 row writes have similar cost as 100 000 row writes
DELETEonly logically delete rows,
VACUUMis required for actual removal.
varcharcolumns of over 150 characters will degrade performance.
Column properties cause:
- Incorrect column encoding (compression) will degrade performance.
ANALYZE COMPRESSIONcan be used on existing tables to propose optimal compression. Column compression change requires table rebuild.
SORTKEYsignificantly influence performance as query only as fast as the slowest slice.
Slice properties cause:
- You need to have even data distribution between slices for gain full benefits of parallelism.
- You can optimize speed for your own use case by placing frequently joined columns to the same slice.
- Use at least as many input files as there are slices in the cluster to maximize throughput.
- Use CSV files for simplicity.
- Pick a simple delimiter like
- Pick a simple NULL character like
- Use double quotes and escape characters for
- Split files to 1MB - 1GB so they are a multiple of the number of slices for faster ingestion.
- Redshift versus Snowflake versus BigQuery / Part 1: Performance
- Deep Dive on Amazon Redshift, AWS Loft Meetup, September 2017