Table of Contents
Recommender systems are the invisible engines behind personalized internet experiences. Whether you’re searching for your next TV show on Netflix, products on Amazon, or music on Spotify, recommender systems are working behind the scenes. They analyze your preferences, compare them with millions of other users, and deliver tailored recommendations that keep you engaged and happy.
In this article, we’ll build a recommender system using the Surprise library on an Ubuntu GPU server. We’ll use the MovieLens 100K dataset, a popular dataset for collaborative filtering, and train a model using SVD.
Prerequisites
Before starting, ensure you have the following:
- An Ubuntu 24.04 Cloud GPU Server.
- CUDA Toolkit and cuDNN Installed.
- A root or sudo privileges.
Step 1: Install Required Dependencies
First, we need to set up the environment and install the necessary dependencies.
1. Update your system packages.
apt update -y
2. Install Python and the venv module to create an isolated environment:
apt install python3-pip python3-venv -y
3. Create and activate a virtual environment to manage dependencies:
python3 -m venv surprise_env
source surprise_env/bin/activate
4. Install the necessary Python libraries:
pip install "numpy<2" scikit-surprise pandas matplotlib
Explanation:
- NumPy: For numerical computations.
- scikit-surprise: A Python library for building and analyzing recommender systems.
- Pandas: For data manipulation.
- Matplotlib: For visualization (optional).
Step 2: Download the MovieLens 100K Dataset
The MovieLens 100K dataset is a widely used dataset for building recommender systems. It contains 100,000 ratings from 943 users on 1,682 movies.
1. Create a directory for the dataset and download it:
mkdir -p ~/datasets && cd ~/datasets
wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
This will create a directory named ml-100k containing the dataset files.
2. The file we need is ml-100k/u.data. Let’s inspect the first few lines:
head ml-100k/u.data
Output:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
The columns represent:
- user_id: ID of the user.
- item_id: ID of the movie.
- rating: Rating given by the user (1-5).
- timestamp: Timestamp of the rating.
Step 3: Load the Dataset
We’ll use the surprise library to load the dataset into a format suitable for training.
1. Create a file named load_data.py:
nano load_data.py
Add the following code:
import pandas as pd
from surprise import Dataset, Reader
# Load MovieLens 100K dataset
file_path = "ml-100k/u.data"
columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep='\t', names=columns)
# Define a reader object
reader = Reader(line_format='user item rating timestamp', sep='\t')
# Load the dataset into Surprise format
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)
print("Data loaded successfully.")
2. Execute the script:
python3 load_data.py
Output:
Data loaded successfully.
Step 4: Build a Recommender System Using Surprise
We’ll use the SVD (Singular Value Decomposition) algorithm, a popular collaborative filtering technique, to build our recommender system.
1. Create a Python file to train the model
nano train_model.py
Add the following code:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import cross_validate
# Load dataset
file_path = "ml-100k/u.data"
columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep='\t', names=columns)
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)
# Train SVD model
model = SVD()
cross_validate(model, data, cv=5, verbose=True)
print("Model trained successfully.")
2. Run the script.
python3 train_model.py
This will perform 5-fold cross-validation and output RMSE (Root Mean Square Error) and MAE (Mean Absolute Error).
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 0.9271 0.9318 0.9398 0.9403 0.9423 0.9363 0.0058
MAE (testset) 0.7311 0.7359 0.7374 0.7412 0.7434 0.7378 0.0043
Fit time 0.79 0.83 0.85 0.88 0.96 0.86 0.06
Test time 0.09 0.09 0.09 0.10 0.08 0.09 0.01
Model trained successfully.
Step 5: Make Predictions Using the Model
Now that the model is trained, let’s make predictions for a specific user.
1. Create a Python file to make predictions.
nano predict.py
Add the following code:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
# Load dataset
file_path = "ml-100k/u.data"
columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep='\t', names=columns)
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)
# Split data into train and test set
trainset, testset = train_test_split(data, test_size=0.2)
# Train SVD model
model = SVD()
model.fit(trainset)
# Make predictions
predictions = model.test(testset)
# Print sample predictions
for prediction in predictions[:10]:
print(f"User {prediction.uid} - Item {prediction.iid}: Predicted Rating {prediction.est:.2f}")
2. Run the script.
python3 predict.py
Output:
User 282 - Item 325: Predicted Rating 2.90
User 619 - Item 188: Predicted Rating 3.74
User 336 - Item 1118: Predicted Rating 3.04
User 763 - Item 505: Predicted Rating 4.18
User 735 - Item 286: Predicted Rating 3.45
User 276 - Item 214: Predicted Rating 3.53
User 305 - Item 557: Predicted Rating 2.91
User 682 - Item 71: Predicted Rating 3.20
User 718 - Item 471: Predicted Rating 4.00
User 145 - Item 460: Predicted Rating 3.16
Step 6: Save and Load the Model
To reuse the trained model, we’ll save it to a file and load it later.
1. Create a file named save_model.py:
nano save_model.py
Add the following code:
import pandas as pd
import pickle
from surprise import Dataset, Reader, SVD
# Load dataset
file_path = "/root/datasets/ml-100k/u.data" # Ensure this path is correct
columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep='\t', names=columns) # Now pd is defined
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)
# Train model
trainset = data.build_full_trainset()
model = SVD()
model.fit(trainset)
# Save the trained model
with open("recommender_model.pkl", "wb") as f:
pickle.dump(model, f)
print("Model saved successfully.")
2. Run the script.
python3 save_model.py
Output:
Model saved successfully.
3. Create a file named load_model.py to load and use the saved model.
nano load_model.py
Add the following code:
import pickle
from surprise import Dataset, Reader
# Load the saved model
with open("recommender_model.pkl", "rb") as f:
model = pickle.load(f)
# Make a prediction for user 1 and item 50
user_id = "1"
item_id = "50"
predicted_rating = model.predict(user_id, item_id).est
print(f"Predicted rating for User {user_id} and Item {item_id}: {predicted_rating:.2f}")
4. Run the script.
python3 load_model.py
Output:
Predicted rating for User 1 and Item 50: 3.53
Step 7: Evaluate the Model Performance
Finally, let’s evaluate the model’s performance using RMSE (Root Mean Square Error).
1. Create a file for evaluation.
nano evaluate_model.py
Add the following code:
import pandas as pd
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
import pickle
# Load dataset
file_path = "/root/datasets/ml-100k/u.data" # Make sure this path is correct
columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep='\t', names=columns) # Now pd is defined
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)
# Split data
trainset, testset = train_test_split(data, test_size=0.2)
# Load saved model
with open("recommender_model.pkl", "rb") as f:
model = pickle.load(f)
# Predict
predictions = model.test(testset)
# Compute RMSE
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse:.4f}")
2. Run the script.
python3 evaluate_model.py
Output:
RMSE: 0.6721
RMSE: 0.6721
Conclusion
In this article, we built a recommender system using the Surprise library on an Atlantic.Net GPU server. Now, you have a working recommender system that can be extended to real-world applications like e-commerce product recommendations, movie suggestions, or content curation.