Develop and deploy ML models using Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot

 Develop and deploy ML models using Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot

Data generates new value to businesses through insights and building predictive models. However, although data is plentiful, available data scientists are far and few. Despite our attempts in recent years to produce data scientists from academia and elsewhere, we still see a huge shortage that will continue into the near future.

To accelerate model building, data scientists and ML practitioners often take advantage of AutoML (automated machine learning) tools that can augment their work. They can take away the tedious and iterative process of data preparation, model training and tuning. AutoML tools help data scientists improve their productivity when developing ML models.

In this post, we discuss how data scientists and other advanced analytics users can use Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot to analyze their data sets and build highly predictive ML models. To demonstrate these capabilities, we use the Pima Indian Diabetes public data set from UCI.

Solution overview

The Pima Indian Diabetes data set contains the information of 768 women from a population near Phoenix, Arizona. The outcome tested was diabetes. It carries 258 tested positive and 500 tested negative observations, with one target and eight attributes: pregnancies, glucose, blood pressure, skin thickness, insulin, BMI (body mass index), age, and pedigree diabetes function. We use this data set to demonstrate how to use Autopilot and Data Wrangler to build highly predictive ML models without having to write any code.

The high-level steps for building an ML model are as follows:

  1. Perform exploratory data analysis.
  2. Perform feature engineering.
  3. Train the model.
  4. Validate the model.
  5. Deploy the model.
  6. Make predictions.

We walk through these steps as we build a binary classification model using the Pima Indian Diabetes data set.

Import your data set with Data Wrangler

Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding.

  1. On the Studio console, under File, choose New.
  2. Choose Flow.

If this is your first time opening Data Wrangler, you may have to wait a few minutes for it to be ready.

  1. Rename your flow as needed.
  2. For Import data, choose your data source.

  1. Upload the pima-indian-diabates.csv file from Amazon S3.

You can now preview your data set.

  1. In the Details pane, deselect Enable sampling (this is a small data set, so we don’t need it).

  1. Choose Import dataset.

You now have a flow diagram.

  1. Choose the + icon next to Data types and choose Edit data types.

  1. Make sure that Data Wran


Source - Continue Reading:


Related post