Data Science Process Models
Generalized data science workflow:
Gathering Data
▲ └► Understand Data
│ └► Create Models
│ └► Understand the World ►┐
├───────────────┬────────────────────┘
│ └► Create Products: Applications, Services, Reports
│ └► Generates New Data ►┐
├───────────────────────┬──────────────────┘
│ └► Make Decision
│ └► Do Actions
│ └► Record Results ►┐
└──────────────────────────────────────────────────┘
Cross-Industry Standard Process for Data Mining (CRISP-DM) breaks the process of data mining into six major phases:
Business Understanding: Understand objectives from a business perspective.
Data Understanding: Get familiar with the data, identify quality problems and detect interesting subsets.
Data Preparation: Construct the final dataset used in the modeling.
Modeling: Use various modeling techniques with wide range of parameters.
Evaluation: Review that the model properly achieves all the business objectives.
Deployment: Anything between generating a report to building a repeatable data scoring process.
+---+ +---+ +---+ +---+ +---+ +---+ | 1 +-----> 2 +-----> 3 +-----> 4 +-----> 5 +-----> 6 | | <-----+ | | <-----+ | | | | | +-^-+ +---+ +---+ +---+ +-+-+ +---+ | | +---------------------------------------+
Analytics Solutions Unified Method for Data Mining (ASUM-DM) refines and extends CRISP-DM. But it is more or less the same.
Sample, Explore, Modify, Model, and Assess (SEMMA) is often considered to be a general data mining methodology. SEMMA was designed to help the users of the SAS Enterprise Miner software, leaving out business aspects of data science.
- Sample: Select the right sized dataset for modeling.
- Explore: Look out for relationship between variables and find abnormalities.
- Modify: Select, create and transform variables for data modeling.
- Model: Apply various modeling techniques on the prepared dataset to find the ones that provide the desired outcome.
- Assess: Evaluate the results for reliability and usefulness.
Generalized data flow in a machine learning system:
Gathering Data PHYSICAL LEVEL
│
└► Pre-processing Data <┐ DATA LEVEL
├► Cleaning Data ───┘
│
└► Feature Extraction
└► Pattern Discovery <───────┐ MODEL LEVEL
├► Low-level Correction ─┘
│
└► Situation Assessment
└► Decision Making <──────────┐ DECISION LEVEL
├► High-level Correction ─┘
│
└► Presentation PRESENTATION LEVEL
Sources
- Wikipedia: Cross-industry standard process for data mining
- Travis Oliphant