ruk·si

Data Science Team Roles

Updated at 2018-09-24 21:37

Data science teams need to have six different roles covered. Multiple roles can be covered by a single person or even by an external service but they must be handled somehow.

  1. Business Owner
  2. Problem Domain Expert
  3. Data Engineer
  4. Data Scientist
  5. Machine Learning Engineer
  6. Product Engineer

1. Business Owner

Understands the business and benefits gained outside the team.

Drives the discussion what are the goals of data science.

Has some understanding of machine learning.

Handles all data-access related issues or delegating that work.

2. Problem Domain Expert

MLOps is most relevant for subject-matter experts as a feedback mechanism and a platform for communication with data scientists about the models they are building.

When there are unexpected shifts in performance, subject-matter experts need a scalable way, through MLOp processes, to flag model results that don't align with business expectations.

That is, they should be able to use MLOps processes as a jumping-off point for exploring:

  • The data pipelines behind the models
  • Understanding what data is being used
  • How it's being transformed and enhanced
  • What kind of machine learning techniques are being applied

Understands what we are predicting and why.

Knows what data is available and what the data communicates in fine detail.

Has some understanding of machine learning.

Communicates with data scientist (4) what different data points mean and provides detailed problem domain specific knowledge.

3. Data Engineer

Data scientists should step back in when it comes time to test, package, robustify, and then deploy the model.

The role of data engineers in the life cycle is to optimize the retrieval and use of data to eventually power machine learning models.

Software engineers are responsible for the maintenance of the website as a whole, and a large part of that includes the functioning of the machine learning models in production.

Gathers data from given sources.

Helps data scientist (4) to process large quantities of data. Communicates with machine learning engineer (5) and data scientist (4) how the whole pipeline works.

Knows how to plan maintainable data pipelines and manages how the data is tested.

Able to automate infrastructure.

4. Data Scientist

Explores and tries to understand correlations in the gathered data.

Defines what kind of predictive models will be built by machine learning engineers (5), usually with Jupyter Notebooks or similar environment.

Understands well how statistics and machine learning works.

Knows some programming.

5. Machine Learning Engineer

Wraps models created by the data scientist (4) to be used in production.

Writes units tests for the code created and manages how the models are tested.

Has a good understanding how machine learning works.

Is excellent in writing maintainable code as it is pure software development at this point.

6. Product Engineer

Handles taking a machine learning model and running it in production.

"Production" can mean various things; deploying to server, apps, phones, ships, cars, etc.

DevOps teams have two primary roles in the ML model life cycle:

  1. They are the people managing operational systems as well as tests to ensure:
    • Security
    • Performance
    • Availability
  2. They are responsible for CI/CD pipeline management.

Communicates with (5) machine learning engineer how to deploy the production models.

Communicates with (3) data engineer how to gather the data the machine learning system requires.