🐘 Hadoop

Updated at 2017-07-10 20:31

Hadoop is software for distributed storage and processing. It uses MapReduce programming model, Map filter and sorts data while Reduce perform summary operation like count. Hadoop is meant for big quantities of data.

Don't use Hadoop if your total data is less than 2TB in foreseeable future. Lease or buy machine with 2TB of disk space, add some extra RAM, import data to PostgresSQL server and use it to do the analytics. Hadoop is good for big data, but big data starts with petabytes, not terabytes.

AWS Elastic MapReduce creates hosted Hadoop clusters. If you want to play with Hadoop, it's a great place to get started. I would even say it's more cost-efficient to use AWS EMR than to manage Hadoop clusters yourself if you don't already have the servers.