Data preparation remains a major challenge in the machine learning (ML) space. Data scientists and engineers need to write queries and code to get data from source data stores, and then write the queries to transform this data, to create features to be used in model development and training. All of this data pipeline development work doesn’t really focus on the building of ML models, but focuses on the building of data pipelines necessary to make the data available to the models. Amazon SageMaker Data Wrangler makes it easier for data scientists and engineers to prepare data in the early phase of developing ML applications by using a visual interface.
Data Wrangler simplifies the process of data preparation and feature engineering using a single visual interface. Data Wrangler comes with over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can now use Snowflake as a data source in Data Wrangler to easily prepare data in Snowflake for ML.
In this post, we use a simulated dataset that represents loans from a financial services provider, which has been provided by Snowflake. This dataset contains lender data about loans granted to individuals. We use Data Wrangler to transform and prepare the data for later use in ML models, first building a data flow in Data Wrangler, then exporting it to Amazon SageMaker Pipelines. First, we walk through setting up Snowflake as the data source, then explore and transform the data using Data Wrangler.
This post assumes you have the following:
- A Snowflake account with permissions to create storage integrations
- Data in a table in Snowflake
- An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles
- An Amazon Simple Storage Service (Amazon S3) bucket that Data Wrangler can use for outputting transformed data
Set up permissions for Data Wrangler
In this section, we cover the permissions required to set up Snowflake as a data source for Data Wrangler. This section requires you to perform steps in both the AWS Management Console and Snowflake. The user in each environment should have permission to create policies, roles, and secrets in AWS, and the ability to create storage integrations in Snowflake.
All permissions for AWS resources are managed via your IAM role attached to your Amazon SageMaker Studio instance. Snowflake-specific permissions are managed by the Snowflake admin; they can grant granular permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses, and storage integration objects. Make sure that the correct permissions are set up outside of Data Wrangler.
AWS access requirements
Snowflake requires the following permissions on your
Source - Continue Reading: https://aws.amazon.com/blogs/machine-learning/prepare-data-from-snowflake-for-machine-learning-with-amazon-sagemaker-data-wrangler/