For data scientists and machine learning (ML) developers, data preparation is one of the most challenging and time-consuming tasks of building ML solutions. In an often iterative and highly manual process, data must be sourced, analyzed, cleaned, and enriched before it can be used to train an ML model.
Typical tasks associated with data preparation include:
- Locating data – Finding where raw data is stored and getting access to it
- Data visualization – Examining statistical properties for each column in the dataset, building histograms, studying outliers
- Data cleaning – Removing duplicates, dropping or filling entries with missing values, removing outliers
- Data enrichment and feature engineering – Processing columns to build more expressive features, selecting a subset of features for training
Data scientists and developers typically iterate through these tasks until a model reaches the desired level of accuracy. This iterative process can be tedious, error-prone, and difficult to replicate as a deployable data pipeline. Fortunately, with Amazon SageMaker Data Wrangler, you can reduce the time it takes to prepare data for ML from weeks to minutes by accelerating the process of data preparation and feature engineering. With Data Wrangler, you can complete each step of the ML data preparation workflow, including data selection, cleansing, exploration, and visualization, with little to no code, which simplifies the data preparation process.
In a previous post introducing Data Wrangler, we highlighted its main features and walked through a basic example using the well-known Titanic dataset. For this post, we dive deeper into Data Wrangler and its integration with other Amazon SageMaker features to help you get started quickly.
Now, let’s get started with Data Wrangler.
In this post, we use Data Wrangler to prepare data for creating ML models to predict credit risk and help financial institutions more easily approve loans. The result is an exportable data flow capturing the data preparation steps required to prepare the data for modeling. We use a sample dataset containing information on 1,000 potential loan applications, built from the German Credit Risk dataset. This dataset contains categorical and numeric features covering the demographic, employment, and financial attributes of loan applicants, as well as a label indicating whether the individual is high or low credit risk. The features require cleaning and manipulation before we can use them as training data for an ML model. A modified version of the dataset, which we use in this post, has been saved in a sample data Amazon Simple Storage Service (Amazon S3) bucket. In the next section, we walk through how to download the sample data and upload it to your own S3 bucket.
Source - Continue Reading: https://aws.amazon.com/blogs/machine-learning/prepare-data-for-predicting-credit-risk-using-amazon-sagemaker-data-wrangler-and-amazon-sagemaker-clarify/