How to Integrate MindsDB with Amazon SageMaker

Zoran Pandovski

Learn how you can use MindsDB and SageMaker to train and deploy models within SageMaker, create endpoints and take advantage of automated machine learning with MindsDB.

Bring your own MindsDB container


Amazon SageMaker, Amazon’s cloud-based machine learning platform, provides an option to package and use your own algorithm. If there is no pre-built Amazon SageMaker container image, you have the option to package your own script or algorithm to use. 

In this post, you’ll learn how you can use MindsDB and SageMaker to train and deploy models within SageMaker, create endpoints and take advantage of automated machine learning with MindsDB. The code for this tutorial can be found inside the mindsdb-sagemaker-container GitHub repository.

The mindsdb_impl Code Structure

All of the components we need to package MindsDB for Amazon SageMager are located inside the mindsdb_impl directory:


This directory contains the following files:

  • nginx.conf - is the configuration file for the nginx .
  • predictor.py is the program that actually implements the Flask web server and the MindsDB predictions for this app. We have modified this to use the MindsDB Predictor and to accept different types of tabular data for predictions.
  • serve is the program that is started when the container is started for hosting.
  • train is the program that is invoked when the container is being run for training. We have modified this program to use MindsDB Predictor interface.
  • wsgi.py is a small wrapper used to invoke the Flask app. 

Create a Docker Image 

Firstly, we will need to create a Docker container. If you don’t have any experience with Docker, please see the official Docker website for more information. In short, Docker provides you the flexibility to easily create, copy, and deploy your application environment. 

Docker as a containerization platform will package your application alongside all the dependencies in the Docker container that should work in any environment. So, with Docker there is no more “It doesn’t work on my machine” excuse. 

The Dockerfile will describe the image that we want to build. MindsDB works with Python versions greater than 3.6.x. In this example, we will use the 3.7 Python version and use the standard Ubuntu 18.10 image in our container.

The next step is to define all of the packages that are required in our environment as wget, python3.7, nginx, and distutils: 


Next, we should install the MindsDB and other server dependencies from pypi:


Note that `--default-timeout` is increased to 1000. The default one is 60, but some of the MindsDB dependencies as torch are taking more time to install so we have increased that, to avoid timeouts.

The last part in the Docker file is where we copy the code from the mindsdb_impl directory to /opt/program which will be the working directory, and make train and serve scripts executable.

Build Docker Image

Let’s build an image from a Dockerfile that we have created. The docker command for building the image is build and -t parameter provides the name of the image:


After getting the `Successfully built` message, we should be able to list the image by running the list command:


Dataset

For our dataset example, we will use Diabetes patient records datasets available to download from Pima Indians Diabetes Database. The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Our model shall predict whether or not the patients in the dataset have diabetes.

Test the Container Locally

Before we deploy the image to the Amazon Elastic Container Registry, let's test the whole setup locally. All of the files for testing the setup are located inside the local_test directory. The local_test directory contains all of the scripts and data samples for testing the built container on the local machine.

  •  train_local.sh: Instantiate the container configured for training.
  •  serve_local.sh: Instantiate the container configured for serving.
  •  predict.sh: Run predictions against a locally instantiated server.
  •  test-dir:  This directory is mounted in the container.
  •  test_data: This directory contains a few tabular format datasets used for getting the predictions.
  •  input/data/training/file.csv`: The training data.
  •  model: The directory where MindsDB writes the model files.
  • call.py: This cli script can be used for testing the deployed model on the SageMaker endpoint

First, we should train the model by executing the train script and provide the image name as a parameter:

This script will start the container and provide the train argument so it knows to invoke the train program. Depending on the rows and columns of the dataset, training can take additional time. After executing the script, you should see some statistics from MindsDB logger like data distribution for column and testing accuracy. If the training finished successfully it will display execution time in seconds and Training complete message. The newly created models will be saved inside the local_tes/test_dir/model directory.

Next, let’s start the inference server that will provide an endpoint for getting the predictions. Inside the local_test directory execute serve_local.sh script:

 

A ​starting the inference server with n workers message will be displayed.The server will then be started and we can access the invocations endpoint on http://localhost:8080/invocations .

To run predictions against the invocations endpoint, we can use predict.sh script:

 

The arguments sent to the script are the test data and content type of the data. In this example, we use text/csv but MindsDB also works with data in excel, json, tsv formats. Now, we should see a prediction message returned from MindsDB as:


Deploy the Image on Amazon ECR (Elastic Container Repository)

After successfully getting the predictions, we are sure that the whole setup is working and the next step will be to push the image to the ECR. 

Amazon ECR is a container registry that makes it easy to store and deploy container images. Inside the mindsdb-sagemaker-container repository there is a build-and-push.sh script. We will use it to push the image to ECR. The only argument that the script accepts is the name of the image:

 The script will look for an AWS EC Repository in the default region that you are using, and create a new one if that doesn't exist.

Use MindsDB in Amazon SageMaker

          Training and Inference flow on SageMaker


Now that we have the mindsdb-impl image deployed to Amazon ECR, we are ready to use it inside SageMaker. The only thing that we need is to get the Image URI from Amazon ECR. Copy the URI that should look like

Create Train Job

The first thing we need to do is to start a model training job by using Amazon SageMaker’s console. Follow the steps below to successfully start a train job and use MindsDB to create the models.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

2. From the left panel choose Create Training Job and provide the following information

  • Job name
  • IAM role - it’s best if you provide AmazonSageMakeFullAccess IAM policy
  • Algorithm source - Your own algorithm container in ECR

3. Provide container ECR path

  • Container - the ECR registry Image URI that we have pushed
  • Input mode - File

4. Resource configuration - leave the default instance type and count

5. Hyperparameters - MindsDB requires to_predict column name, so it knows which column we want to predict, e.g.,

  • Key - to_predict
  • Value - Class(the column in diabetes dataset)

6. Input data configuration

  • Channel name - training
  • Data source - s3
  • S3 location - path to the s3 bucket where the dataset is located

7. Output data configuration -  path to the s3 where the models will be saved

AWS console display for SageMaker training job 


After filling out all of the required configuration options, click on Create training job. In the training jobs table, you should see the name of the newly created job and status running. 

When the training job finishes successfully, the Completed status shall be displayed. Just to make sure that the models were created, go to the s3 bucket provided as output. 

In the s3, you should see a new bucket with the name of the training job, e.g., mindsdb_impl/output/ that contains model.tar.gz compressed file with model artifacts.

Model Creation

Before we deploy the model to SageMaker, we should provide the model artifacts and container locations. Let’s go to the Create model and add the required settings:

1. Model name - must be unique

2. IAM role - it’s best if you provide AmazonSageMakeFullAccess IAM policy

3. Container input options

  • Provide model artifacts and inference image location
  • Use a single model
  • Location of the inference code - 846763053924.dkr.ecr.us-east-1.amazonaws.com/mindsdb_impl:latest
  • Location of model artifacts - path to model.tar.gz inside s3 bucket.

AWS console display for SageMaker models 


After the required settings are added, click on Create model.

Endpoint Configuration

In the endpoint configuration, we should provide which models to deploy, and the hardware requirements:

  1. Endpoint configuration name
  2. Add model - select the previously created model
  3. Choose Create endpoint configuration.


AWS console Endpoint configuration screen 

Create Endpoint

Lastly, we should create endpoint and provide endpoint configuration that specify which models to deploy and the requirements:

  1. Endpoint name
  2. Attach endpoint configuration - select the previously created endpoint configuration
  3. Choose Create endpoint.

                                              AWS console Endpoint screen 

After n time, SageMaker will create a new instance and start the inference code.  

If there are some errors, we can check the CloudWatch logs for additional details. Before starting the server, SageMaker will call the ping API inside the container and, if the response code is 200, it will start the endpoint.

After that, the endpoint status shall change to InService and it’s ready for use.

Call Endpoint from call.py Script

This script is located in local_test directory and can be used as cli for testing the endpoint, e.g.:

The required arguments are:

  • endpoint - the name of the SageMaker endpoint
  • dataset - the location of test dataset
  • content type - mime type of the data

Call Endpoint from Jupyter Notebook

The endpoint is InService, so we can create a Jupyter notebook to call the endpoint. Amazon SageMaker provides pre-built notebook instances that run Jupyter notebooks. First, create a new notebook instance choosing the default configuration and start the instance. 

Once the status is InService, click on open Jupyter. Inside the Jupyter menu create a new notebook and choose conda_pytorch_p36, that will contain most of the required dependencies. The following code uses sagemaker-runtime to invoke the endpoint and display the response from it.


Run the code and you should see the prediction response from the endpoint:


Shutdown Instances

Don’t forget to delete the endpoint and stop the notebook instances, after you are done.

Additional useful resources

Author Bio

Zoran is a full stack developer based in Macedonia. He works as MindsDB's senior full stack developer and works on everything from building and managing the website to supporting the open source product to working with users on their support questions.

Be Part of Our Community.

Join our growing community.