# Using machine learning to predict vessel time of arrival with Amazon SageMaker

According to the International Chamber of Shipping, 90% of world commerce happens at sea. Vessels are transporting every possible kind of commodity, including raw materials and semi-finished and finished goods, making ocean transportation a key component of the global supply chain. Manufacturers, retailers, and the end consumer are reliant on hundreds of thousands of ships carrying freight across the globe, delivering their precious cargo at the port of discharge after navigating for days or weeks.

As soon as a vessel arrives at its port of call, off-loading operations begin. Bulk cargo, containers, and vehicles are discharged, depending on the kind of vessel. Complex landside operations are triggered by cargo off-loading, involving multiple actors. Terminal operators, trucking companies, railways, customs, and logistic service providers work together to make sure that goods are delivered according to a specific SLA to the consignee in the most efficient way.

Shipping companies publicly advertise their vessels’ estimated time of arrival (ETA) in port, and downstream supply chain activities are planned accordingly. However, delays often occur, and the ETA might differ from the vessel’s actual time of arrival (ATA), for instance due to technical or weather-related issues. This impacts the entire supply chain, in many instances reducing productivity and increasing waste and inefficiencies.

Predicting the exact time a vessel arrives in a port and starts off-loading operations poses remarkable challenges. Today, a majority of companies rely on experience and improvisation to respectively guess ATA and cope with its fluctuations. Very few providers are leveraging machine learning (ML) techniques to scientifically predict ETA and help companies create better planning for their supply chain. In this post, we’ll show how to use Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly, to predict the arrival time of vessels.

## Study

Vessel ETA prediction is a very complex problem. It involves a huge number of variables and a lot of uncertainty. So when you decide to apply a technique like ML on a problem like that, it’s crucial to have a baseline (such as an expert user or a rule-based engine) to compare the performance and understand if your model is good enough.

This work is a study of the challenge of accurately predicting the vessel ETA. It’s not a complete solution, but it can be seen as a reference for you to implement your own sound and complete model, based on your data and expertise. The solution includes the following high-level steps:

1. Reduce the problem to a single vessel voyage (when the vessel departs from one given port and gets to another).
2. Explore a temporal dataset.
3. Identify the spatiotemporal aspects in the checkpoints sent by each vessel.
4. From a given checkpoint, predict the ETA in days for the vessel to reach the destination port (inside a given vessel voyage).

The following image shows multiple vessel voyages of the same vessel in different colors. A shipping is composed of multiple voyages.

## Methodology

Vesseltracker, an AWS customer focused on maritime transportation intelligence, shared with us a sample of the historical data they collect from vessels (checkpoints) and ports (port calls) every day. The checkpoints contain the main characteristics of each vessel, plus their current geoposition, speed, direction, draught, and more. The port calls are the dates and times of each vessel’s arrival or departure.

Because we had to train a ML model to predict continuous values, we decided to experiment with some regression algorithms like XGBoost, Random Forest, and MLP. At the end of the experiments (including hyperparameter optimization), we opted for the Random Forest Regressor, given it gave us a better performance.

To train a regressor, we had to transform the data and prepare one feature to be the label, in this case, the number of days (float) that the vessel takes to get to the destination port.

For feature engineering, it’s important to highlight the following steps:

1. Identify each vessel voyage in the temporal dataset. Join with the port calls to mark the departure and the arrival checkpoints.
2. Compute backward from the destination port to the departure port the accumulated time per checkpoint.
3. Apply any geo-hashing mechanism to encode the GPS (latitude, longitude) and transform it into a useful feature.
4. Compute the great-circle distance between each sequential pair of geopositions from checkpoints.
5. Because the vessel changes the speed over time, we need to compute a new feature (called efficiency) that helps the model ponder the vessel displacement (speed and performance) before computing the remaining time.
6. Use the historical data of all voyages and the great-circle distance between each checkpoint to create an in-memory graph that shows us all the paths and distances between each segment (checkpoints).

With this graph, you can compute the distance between the current position of the vessel and the destination port. This process resulted in a new feature called accum_dist, or accumulated distance. As the feature importance analysis shows, because this feature has a high linear correlation with the target, it has a higher importance to the model.

## Amazon SageMaker

We chose Amazon SageMaker to manage the entire pipeline of our study. SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Traditional ML development is a complex, expensive, and iterative process made even harder because there are no integrated tools for the entire ML workflow. You need to stitch together tools and workflows, which is time consuming and error prone. SageMaker solves this challenge by providing all the components used for ML in a single toolset so models get to production faster with much less effort and at lower cost.

### Dataset

The following tables show samples of the data we used to create the dataset. The port calls (shown in the first table) are expressed by a few attributes from the vessel, the port identification, and timestamps of the arrival and departure events of a given vessel.

 imo arrival_date departure_date port 0 0 2019-01-07 2019-01-08 GUAM 1 0 2019-01-11 2019-01-12 NAHA 2 0 2019-01-12 2019-01-17 NAHA

Then we have the vessel checkpoints. A checkpoint is a message sent by each vessel at a frequency X (in this case, approximately 1 day) that contains information about the vessel itself and its current status. By joining both tables, we enrich the checkpoints with information about the vessel departure and arrival, which is crucial to enclose all the other checkpoints sent between these two events. The following table is an example of vessel checkpoint data.

 imo timestamp_position position_date shiptype dimA dimB draught destination lat lon heading course aisshiptype geometry 0 0 2019-01-01 22:44:13+00:00 2019-01-01 1 143 74 10.0 GUAM 18.5195 -167.327833 220.0 219.0 0 POINT (-167.32783 18.51950) 1 0 2019-01-02 00:03:01+00:00 2019-01-02 1 143 74 10.0 GUAM 18.5295 -167.785333 270.0 337.0 0 POINT (-167.78533 18.52950) 2 0 2019-01-03 22:53:39+00:00 2019-01-03 1 143 74 10.0 GUAM 18.1515 175.703667 219.0 28.0 0 POINT (175.70367 18.15150)

In the next table, both tables are joined and cleaned, with the geolocation encoded and the accumulated time and distance calculated. This view is for one particular vessel, which shows all the checkpoints that belong to one given vessel voyage.

 accum_time accum_dist efficiency geohash shiptype dimB draught destination_port dimA origin_port 13322 0 days 00:00:00 0.000000 0 wtx4 0 29 10.6 SHANGHAI 150 SHANGHAI 13323 0 days 00:00:00 0.000000 0.00277462 wtwh 0 29 10.6 ZHANGJIAGANG 150 ZHANGJIAGANG 13324 7 days 12:15:41 760.610952 0.00036225 wtty 0 29 7.5 INCHEON 150 ZHANGJIAGANG 13325 0 days 00:00:00 0.000000 0.00117208 wy9g 0 29 7.5 INCHEON 150 INCHEON 13326 2 days 18:33:06 52.331437 6.81763e-05 wydh 0 29 7.7 DANGJIN 150 INCHEON 13327 0 days 00:00:00 0.000000 0.000218424 wy9f 0 29 7.7 DANGJIN 150 DANGJIN 13328 1 days 16:33:17 311.514360 3.50274e-08 wyd4 0 29 7.7 GWANGYANG 150 DANGJIN 13329 0 days 00:00:00 0.000000 0.0021337 wy4g 0 29 9.6 GWANGYANG 150 GWANGYANG 13330 26 days 00:56:16 10966.168593 0.00243405 wy4y 0 29 9.6 LOS ANGELES 150 GWANGYANG 13331 24 days 06:09:44 10461.353900 0.00327819 wvs8 0 29 10.3 LOS ANGELES 150 GWANGYANG

Finally, we have the dataset used to train our model. The first column is the label or the target value our model tries to create a regression. The rest of the columns are the features the decision tree uses during training.

 accum_time accum_dist efficiency geohash shiptype dimB draught destination_port dimA origin_port 0 0.000000 0.000000 0.009117 5217 1 74 10.0 51 143 51 1 3.454988 2264.432050 0.000000 5217 1 74 10.0 115 143 51 2 2.582269 1959.171581 0.004048 5225 1 74 9.1 115 143 51 3 1.373183 1100.031040 0.008224 5232 1 74 9.1 115 143 51 4 0.000000 0.000000 0.009272 4660 1 74 9.1 115 143 115 5 0.000000 0.000000 0.002147 4663 1 74 9.1 115 143 115 6 0.000000 0.000000 0.000406 4564 1 74 9.1 211 143 211 7 0.000000 0.000000 0.000221 4563 1 74 9.1 164 143 164 8 10.998912 10238.146891 0.000950 4606 1 74 10.9 99 143 164 9 8.874850 8215.609556 0.011021 5663 1 74 11.0 99 143 164

## Model and results

After preparing the data, it’s time to train our model. We used a technique called k-fold cross validation to create six different combinations of training and validation data (approximately 80% and 20%) to explore the variation of the data as much as possible. With the native support for Scikit-learn on SageMaker, we only had to create a Python script with the training code and share it to a SageMaker Estimator, prepared with the SageMaker Python library. See the following code:

model = RandomForestRegressor(
n_estimators=est, verbose=0, n_jobs=4, criterion='mse',
max_leaf_nodes=1500, max_depth=depth, random_state=0
)

After training our model, we used a metric called R2 to evaluate the model performance. R2 measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A poor model has a low R2 and a useful model has an R2 as close as possible to 1.0 (or 100%).

In this case, we expected a model that predicted values with a high level of correlation with the testing data. With this combination of data preparation, algorithm selection, and hyperparameters optimization and cross validation, our model achieved an R2 score of 0.9473.

This result isn’t bad, but it doesn’t mean that we can’t improve the solution. We can minimize the accumulated error by the model by adding important features to the dataset. These features can help the model better understand all the low-level nuances and conditions from each checkpoint that can cause a delay. Some examples include weather conditions from the geolocation of the vessel, port conditions, accidents, extraordinary events, seasonality, and holidays in the port countries.

Then we have the feature importance (shown in the following graph). It’s a measurement of how strong or important a given feature from the dataset is for the prediction itself. Each feature has a different importance, and we want to keep only those features that are impactful for the model in the dataset.

The graph shows that accumulated distance is the most important feature (which is expected, given the high correlation with the target), followed by efficiency (an artificial feature we created to ponder the impact of the vessel displacement over time). In third place, we have destination port, closer to the encoded geoposition.

You can download the notebooks created for this experiment, to see all the details of the implementation. Click on the links bellow to get them:

You will need Amazon SageMaker to run these notebooks so create a SageMaker Studio Domain in your AWS Account, upload the notebooks to your new environment and run your own experiments!

## Summary

The use of ML in predicting vessel time of arrival can substantially increase the accuracy of land-side operations planning and implementation, in comparison to traditional, manual estimation methodologies that are used widely across the industry. We’re working with and shipping companies to improve the accuracy of our model, as well as on add other relevant features. If your company is interested in learning more about our model and how it can be consumed, please reach out to our Head of World Wide Technology for Transportation and Logistics, Michele Sancricca, at [email protected]

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions for solving their business challenges, using the AWS platform. He has been working on several AI/ML projects related to Computer Vision, Natural Language Processing, Forecasting, ML at the edge, etc. He likes playing with hardware and automation projects in his free time and he has a particular interest for robotics.

Michele Sancricca is the AWS Worldwide Head of Technology for Transportation and Logistics. Previously, he worked as Head of Supply Chain Products for Amazon Global Mile and led the Digital Transformation Division of Mediterranean Shipping Company. A retired Lieutenant Commander, Michele spent 12 years in the Italian Navy as Telecommunication Officer and Commanding Officer.