Amazon SageMaker Data Wrangler is the fastest and easiest way for data scientists to prepare data for machine learning (ML) applications. With Data Wrangler, you can simplify the process of feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization through a single visual interface. Data Wrangler comes with 300 built-in data transformation recipes that you can use to quickly normalize, transform, and combine features. With the data selection tool in Data Wrangler, you can quickly select data from different data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift.
AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to Athena tables.
In this post, we demonstrate how to enable cross-account access for Data Wrangler using Athena as a source and Lake Formation as a central data governance capability. As shown in the following architecture diagram, Account A is the data lake account that holds all the ML-ready data derived from ETL pipelines. Account B is the data science account where a team of data scientists uses Data Wrangler to compile and run data transformations. We need to enable cross-account permissions for Data Wrangler in Account B to access the data tables located in Account A’s data lake via Lake Formation permissions.
With this architecture, data scientists and engineers outside the data lake account can access data from the lake and create data transformations via Data Wrangler.
Before you dive into the setup process, ensure the data to be shared across accounts are crawled and cataloged as detailed in this post. Let us presume this process has been completed and the databases and tables already exist in Lake Formation.
The following are the high-level steps to implement this solution:
- In Account A, register your S3 bucket using Lake Formation and create the necessary databases and tables for the data if doesn’t exist.
- The Lake Formation administrator can now share datasets from Account A to other accounts. Lake Formation shares these resources using AWS Resource Access Manager (AWS RAM).
- In Account B, accept the resource share request using AWS RAM. Create a local resource link for the shared table via Lake Formation and create a local database.
- Next, you need to grant permissions for the SageMaker Studio execution role in Account B to access the shared table and the resource link you created in the previous step.
- In Data Wrangler, use the local database and the resource link you created in Account B to query the dataset using the Athena connector and per
Source - Continue Reading: https://aws.amazon.com/blogs/machine-learning/enable-cross-account-access-for-amazon-sagemaker-data-wrangler-using-aws-lake-formation/