SageMaker
AWS SageMaker is a managed machine learning platform.
SageMaker has four main components:
- Notebooks
- Jobs
- Models
- Endpoints
Main suggested control interface is Jupyter Notebook but they do also have APIs using AWS command-line tools.
Notebooks
Simple way to start an EC2 instance with basic data science tools installed.
There are limited number of instance types to start.
- 3 types at the time of writing.
- 5GB EBS storage.
Launching a notebook instance takes about 5 minutes.
The instance won't show in your EC2 console, it is managed by SageMaker.
There are some pre-packaged environments (kernels) inside Jupyter Notebook "New" tab. MXNET, Spark, TensorFlow
Version control of the notebooks is a hassle. You need to manually commit and push it to e.g. GitHub.
S3 acts a natural place to download and upload data.
Notebook should not be used to prepare the data. It should be separate step in an automated pipeline.
You can run terminal commands with !
, like normally with Notebooks.
!git clone https://github.com/awslabs/amazon-sagemaker-examples.git
!conda install -y -c conda-forge xgboost
Jobs
- Select ECR Docker image to use.
- Select instance size.
- Point to S3 where the training data is.
- Set hyperparameters
- Point where in S3 to save the result artifacts.
SageMaker and the used S3 bucket need to be in the same region.
Distributed training is just running the job in separate machines, without any merging of the results?
Job instance logs will be available at CloudWatch /aws/sagemaker/TrainingJobs
.
Hyperparameters can be seen in the Jobs page entity of the Job.
Models
You define a model through the API when you have a model artifact in S3 and provide some metadata related to it. You do this with e.g. boto
.
You can download the model to the notebook instance with boto
.
Endpoints
How to route traffic to one or more models.
Allows A/B testing of 2 models and canary deployment.
Inference requires you to have a single gzipped tar file as "model".
Endpoint logs include CPU/memory utilization, invocation count, error count and latency times.
General Concepts
Accept hyperparameters at your code entry point.
def model_fn(features, labels, mode, hyperparameters=None):
if hyperparameters is None:
hyperparameters = dict()
learning_rate = hyperparameters.get('learning_rate', 0.001)
Report metrics back from your training. Here we are using AUC. Area under the curve (AUC) is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. AUC is frequently used to compare predictive models; although it can be quite noisy. SageMaker uses AWS CloudWatch to visualize metrics.
metric_ops = {
'roc_auc': tf.metrics.auc(
labels,
predictions,
summation_method='careful_interpolation'
),
# ...
}
return tf.estimator.EstimatorSpec(
# ...
eval_metric_ops=metric_ops,
)
Build SageMaker Estimator. SageMaker Estimator communicates how to run your code.
from sagemaker.tensorflow import TensorFlow
# The parameters that are constant and will not be tuned
shared_hyperparameters = {
'number_layers': 5,
}
tf_estimator = TensorFlow(
entry_point='my/tensorflow/model.py',
role='<sagemaker_role_arn>',
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
training_steps=10000,
hyperparameters=shared_hyperparameters,
)
Select your performance metrics. Tell SageMaker how to get extract metrics from the logs. The last instance in the logs will be the final performance value.
objective_metric_name = 'ROC-AUC'
objective_type = 'Maximize'
metric_definitions = [
{'Name': 'ROC-AUC', 'Regex': 'roc_auc = ([0-9\\.]+)'},
]
Define hyperparameter search space.
from sagemaker.tuner import (
IntegerParameter,
CategoricalParameter,
ContinuousParameter,
HyperparameterTuner,
)
hyperparameter_ranges = {
"learning_rate": ContinuousParameter(1e-5, 1e-1),
"number_nodes": IntegerParameter(32, 512),
"optimizer": CategoricalParameter(['Adam', 'SGD'])
}
Configure hyperparameter optimization. Where to download data, how many iterations to run, how many trainings to run in parallel, etc.
tuner = HyperparameterTuner(
tf_estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
max_jobs=100,
max_parallel_jobs=5,
objective_type=objective_type
)
channel = {
'training': 's3://<bucket_name>/my/training_file.csv',
'test': 's3://<bucket_name>/my/test_file.csv',
}
tuner.fit(inputs=channel)