⛏️ Data Science Process Models
Updated at 2018-03-29 04:51
Generalized data science workflow:
Gathering Data
▲ └► Understand Data
│ └► Create Models
│ └► Understand the World ►┐
├────────────┬──────────────────┘
│ └► Create Products: Applications, Services, Reports
│ └► Generates New Data ►┐
├───────────────────┬───────────────┘
│ └► Make Decision
│ └► Do Actions
│ └► Record Results ►┐
└──────────────────────────────────────────┘
Cross-Industry Standard Process for Data Mining (CRISP-DM) breaks the process of data mining into six major phases:
- Business Understanding: Understand goals from a business perspective.
- Data Understanding: Get familiar with the data, identify quality problems and detect interesting subsets.
- Data Preparation: Construct the final dataset used in the modeling.
- Modeling: Use various modeling techniques with wide range of parameters.
- Evaluation: Review that the model properly achieves all the business goals.
- Deployment: Anything between generating a report to building a repeatable data scoring process.
+---+ +---+ +---+ +---+ +---+ +---+
| 1 +-----> 2 +-----> 3 +-----> 4 +-----> 5 +-----> 6 |
| <-----+ | | <-----+ | | | | |
+-^-+ +---+ +---+ +---+ +-+-+ +---+
| |
+---------------------------------------+
Analytics Solutions Unified Method for Data Mining (ASUM-DM) refines and extends CRISP-DM. But it is more or less the same.
Sample, Explore, Modify, Model, and Assess (SEMMA) is often considered to be a general data mining methodology. SEMMA was designed to help the users of the SAS Enterprise Miner software, leaving out business aspects of data science.
- Sample: Select the right sized dataset for modeling.
- Explore: Look out for relationship between variables and find abnormalities.
- Modify: Select, create and transform variables for data modeling.
- Model: Apply various modeling techniques on the prepared dataset to find the ones that provide the desired outcome.
- Assess: Evaluate the results for reliability and usefulness.
Generalized data flow in a machine learning system:
Gathering Data PHYSICAL LEVEL
│
└► Pre-processing Data <┐ DATA LEVEL
├► Cleaning Data ───┘
│
└► Feature Extraction
└► Pattern Discovery <───────┐ MODEL LEVEL
├► Low-level Correction ─┘
│
└► Situation Assessment
└► Decision Making <──────────┐ DECISION LEVEL
├► High-level Correction ─┘
│
└► Presentation PRESENTATION LEVEL
Sources
- Wikipedia: Cross-industry standard process for data mining
- Travis Oliphant