ruk·si

⛏️ Data Science Process Models

Updated at 2018-03-29 04:51

Generalized data science workflow:

Gathering Data ▲ └► Understand Data │ └► Create Models │ └► Understand the World ►┐ ├───────────────┬────────────────────┘ │ └► Create Products: Applications, Services, Reports │ └► Generates New Data ►┐ ├───────────────────────┬──────────────────┘ │ └► Make Decision │ └► Do Actions │ └► Record Results ►┐ └──────────────────────────────────────────────────┘

Cross-Industry Standard Process for Data Mining (CRISP-DM) breaks the process of data mining into six major phases:

  1. Business Understanding: Understand objectives from a business perspective.

  2. Data Understanding: Get familiar with the data, identify quality problems and detect interesting subsets.

  3. Data Preparation: Construct the final dataset used in the modeling.

  4. Modeling: Use various modeling techniques with wide range of parameters.

  5. Evaluation: Review that the model properly achieves all the business objectives.

  6. Deployment: Anything between generating a report to building a repeatable data scoring process.

    +---+ +---+ +---+ +---+ +---+ +---+ | 1 +-----> 2 +-----> 3 +-----> 4 +-----> 5 +-----> 6 | | <-----+ | | <-----+ | | | | | +-^-+ +---+ +---+ +---+ +-+-+ +---+ | | +---------------------------------------+

Analytics Solutions Unified Method for Data Mining (ASUM-DM) refines and extends CRISP-DM. But it is more or less the same.

Sample, Explore, Modify, Model, and Assess (SEMMA) is often considered to be a general data mining methodology. SEMMA was designed to help the users of the SAS Enterprise Miner software, leaving out business aspects of data science.

  1. Sample: Select the right sized dataset for modeling.
  2. Explore: Look out for relationship between variables and find abnormalities.
  3. Modify: Select, create and transform variables for data modeling.
  4. Model: Apply various modeling techniques on the prepared dataset to find the ones that provide the desired outcome.
  5. Assess: Evaluate the results for reliability and usefulness.

Generalized data flow in a machine learning system:

Gathering Data PHYSICAL LEVEL │ └► Pre-processing Data <┐ DATA LEVEL ├► Cleaning Data ───┘ │ └► Feature Extraction └► Pattern Discovery <───────┐ MODEL LEVEL ├► Low-level Correction ─┘ │ └► Situation Assessment └► Decision Making <──────────┐ DECISION LEVEL ├► High-level Correction ─┘ │ └► Presentation PRESENTATION LEVEL

Sources