Fish Catch Weight Prediction Model

Project Overview

In this group project, we developed machine learning models to predict the total weight of fish catches from Norwegian fishing vessels. Using a dataset containing over 300,000 fishing operation reports from 2018, we implemented and compared various regression models including Linear Regression, LASSO, Ridge, Random Forest, and Neural Networks (MLP).

The Challenge

Norwegian fishing vessels are required to report detailed information about their fishing operations. This creates a rich dataset including vessel specifications, location data, and catch details. Our goal was to use this data to create accurate predictions of catch weights, which could help in fleet management and resource planning.

Data Preprocessing

The raw data required significant preprocessing:

Model Development

We implemented several machine learning approaches, progressively improving our predictions:

Linear Models

# Example of our Ridge Regression implementation
ridge_reg = make_pipeline(MinMaxScaler(), 
                         Ridge(alpha=0.001, random_state=42))
ridge_reg.fit(X_train, y_train)

Advanced Models

# Our best performing model: Random Forest
forest = RandomForestRegressor(random_state=42)
forest.fit(X_train, y_train)
forest_predicted_values = forest.predict(X_val)

Results

Our model comparison showed clear differences in prediction accuracy:

The Random Forest model significantly outperformed other approaches, explaining 74% of the variance in catch weights. This suggests that non-linear relationships in the data are important for accurate predictions.

Clustering Analysis

We also performed unsupervised learning using K-means clustering to identify patterns in fishing operations.

The elbow method is a technique used to determine the optimal number of clusters in a dataset by plotting the within-cluster sum of squares (WCSS) against the number of clusters.

The elbow method is subjective but it looked like 2 might be the pivot point.

Using PCA for dimensionality reduction, we identified two distinct clusters in the data:

Key Learnings

Technologies Used

Future Improvements