ruk·si

Data Science Process Models

Updated at 2018-03-29 01:51

Generalized data science workflow:

Gathering Data
▲   └► Understand Data
│       └► Create Models
│           └► Understand the World ►┐
├───────────────┬────────────────────┘
│               └► Create Products: Applications, Services, Reports
│                   └► Generates New Data ►┐
├───────────────────────┬──────────────────┘
│                       └► Make Decision
│                           └► Do Actions
│                               └► Record Results ►┐
└──────────────────────────────────────────────────┘

Cross-Industry Standard Process for Data Mining (CRISP-DM) breaks the process of data mining into six major phases:

  1. Business Understanding: Understand objectives from a business perspective.

  2. Data Understanding: Get familiar with the data, identify quality problems and detect interesting subsets.

  3. Data Preparation: Construct the final dataset used in the modeling.

  4. Modeling: Use various modeling techniques with wide range of parameters.

  5. Evaluation: Review that the model properly achieves all the business objectives.

  6. Deployment: Anything between generating a report to building a repeatable data scoring process.

    +---+ +---+ +---+ +---+ +---+ +---+ | 1 +-----> 2 +-----> 3 +-----> 4 +-----> 5 +-----> 6 | | <-----+ | | <-----+ | | | | | +-^-+ +---+ +---+ +---+ +-+-+ +---+ | | +---------------------------------------+

Analytics Solutions Unified Method for Data Mining (ASUM-DM) refines and extends CRISP-DM. But it is more or less the same.

Sample, Explore, Modify, Model, and Assess (SEMMA) is often considered to be a general data mining methodology. SEMMA was designed to help the users of the SAS Enterprise Miner software, leaving out business aspects of data science.

  1. Sample: Select the right sized dataset for modeling.
  2. Explore: Look out for relationship between variables and find abnormalities.
  3. Modify: Select, create and transform variables for data modeling.
  4. Model: Apply various modeling techniques on the prepared dataset to find the ones that provide the desired outcome.
  5. Assess: Evaluate the results for reliability and usefulness.

Generalized data flow in a machine learning system:

Gathering Data                                            PHYSICAL LEVEL
    │
    └► Pre-processing Data <┐                             DATA LEVEL
        ├► Cleaning Data ───┘
        │
        └► Feature Extraction
            └► Pattern Discovery <───────┐                MODEL LEVEL
                ├► Low-level Correction ─┘
                │
                └► Situation Assessment
                    └► Decision Making <──────────┐       DECISION LEVEL
                        ├► High-level Correction ─┘
                        │
                        └► Presentation                   PRESENTATION LEVEL

Sources