ruk·si

⛏️ Data Science Process Models

Updated at 2018-03-29 04:51

Generalized data science workflow:

Gathering Data
▲   └► Understand Data
│       └► Create Models
│           └► Understand the World ►┐
├────────────┬──────────────────┘
│               └► Create Products: Applications, Services, Reports
│                   └► Generates New Data ►┐
├───────────────────┬───────────────┘
│                       └► Make Decision
│                           └► Do Actions
│                               └► Record Results ►┐
└──────────────────────────────────────────┘

Cross-Industry Standard Process for Data Mining (CRISP-DM) breaks the process of data mining into six major phases:

  1. Business Understanding: Understand goals from a business perspective.
  2. Data Understanding: Get familiar with the data, identify quality problems and detect interesting subsets.
  3. Data Preparation: Construct the final dataset used in the modeling.
  4. Modeling: Use various modeling techniques with wide range of parameters.
  5. Evaluation: Review that the model properly achieves all the business goals.
  6. Deployment: Anything between generating a report to building a repeatable data scoring process.
+---+     +---+     +---+     +---+     +---+     +---+
| 1 +-----> 2 +-----> 3 +-----> 4 +-----> 5 +-----> 6 |
|   <-----+   |     |   <-----+   |     |   |     |   |
+-^-+     +---+     +---+     +---+     +-+-+     +---+
  |                                       |
  +---------------------------------------+

Analytics Solutions Unified Method for Data Mining (ASUM-DM) refines and extends CRISP-DM. But it is more or less the same.

Sample, Explore, Modify, Model, and Assess (SEMMA) is often considered to be a general data mining methodology. SEMMA was designed to help the users of the SAS Enterprise Miner software, leaving out business aspects of data science.

  1. Sample: Select the right sized dataset for modeling.
  2. Explore: Look out for relationship between variables and find abnormalities.
  3. Modify: Select, create and transform variables for data modeling.
  4. Model: Apply various modeling techniques on the prepared dataset to find the ones that provide the desired outcome.
  5. Assess: Evaluate the results for reliability and usefulness.

Generalized data flow in a machine learning system:

Gathering Data                                            PHYSICAL LEVEL
	│
	└► Pre-processing Data <┐                             DATA LEVEL
		├► Cleaning Data ───┘
		│
		└► Feature Extraction
			└► Pattern Discovery <───────┐                MODEL LEVEL
				├► Low-level Correction ─┘
				│
				└► Situation Assessment
					└► Decision Making <──────────┐       DECISION LEVEL
						├► High-level Correction ─┘
						│
						└► Presentation                   PRESENTATION LEVEL

Sources