Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. OpenAI’s Gym is a popular toolkit for developing and comparing reinforcement learning algorithms.

In this article, we will go through building a simple RL model using Gym on an Ubuntu GPU server.

Prerequisites

Before starting, ensure you have the following:

  • An Ubuntu 24.04 Cloud GPU Server.
  • CUDA Toolkit and cuDNN Installed.
  • A root or sudo privileges.

Step 1: Set Up the Environment

1. First, ensure your system is up to date:

apt update -y

2. The default Python version in Ubuntu 24.04 (Python 3.12) is incompatible with Gym requirements. You’ll need to install Python 3.10.

Add the Python repository.

add-apt-repository ppa:deadsnakes/ppa

3. Update the package index.

apt update -y

4. Install Python 3.10 and essential libraries.

apt install python3.10 python3.10-venv python3.10-dev -y

5. Create a Python virtual environment.

python3.10 -m venv rl_env
source rl_env/bin/activate

Step 2: Install Required Libraries

1. Upgrade pip to the latest version.

pip install --upgrade pip

2. Install required libraries.

pip install wheel "numpy<2" matplotlib scipy

3. Install TensorFlow with GPU support.

pip install tensorflow-gpu

4. Install OpenAI Gym.

pip install gym

Step 3: Build a Simple Reinforcement Learning Model

OpenAI Gym provides various environments. For this tutorial, we’ll use the CartPole-v1 environment, where the goal is to balance a pole on a cart.

1. Create a file named environment_setup.py.

nano environment_setup.py

Add the following code.

import gym

env = gym.make('CartPole-v1')
print("Environment created successfully!")
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

2. Run the above script.

python3  environment_setup.py

Output.

Environment created successfully!
Observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Action space: Discrete(2)

Step 4: Define the Model

We’ll use a simple neural network to approximate the Q-function. The network will take the state as input and output the Q-values for each action.

1. Create a file to define the model.

nano build_model.py

2. Add the following code.

import tensorflow as tf
from tensorflow.keras import layers

def build_model(state_size, action_size):
    model = tf.keras.Sequential([
        layers.Dense(24, input_dim=state_size, activation='relu'),
        layers.Dense(24, activation='relu'),
        layers.Dense(action_size, activation='linear')
    ])
    model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
    return model

# Test the model
state_size = 4
action_size = 2
model = build_model(state_size, action_size)
model.summary()

3. Run the script.

python3 build_model.py

Output.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 24)                120       
                                                                 
 dense_1 (Dense)             (None, 24)                600       
                                                                 
 dense_2 (Dense)             (None, 2)                 50        
                                                                 
=================================================================
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________

Step 5: Implement the Q-Learning Algorithm

Q-Learning is a popular RL algorithm. Here, we’ll implement a simple version of Q-Learning with experience replay.

1. Create a q_learning agent.

nano q_learning_agent.py

Add the following code.

import numpy as np
from build_model import build_model

class QLearningAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = []
        self.gamma = 0.95  # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.model = build_model(state_size, action_size)

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon: return np.random.choice(self.action_size) q_values = self.model.predict(state) return np.argmax(q_values[0]) def replay(self, batch_size): minibatch = np.random.choice(len(self.memory), batch_size, replace=False) for index in minibatch: state, action, reward, next_state, done = self.memory[index] target = reward if not done: target = reward + self.gamma * np.amax(self.model.predict(next_state)[0]) target_f = self.model.predict(state) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

# Test the agent
state_size = 4
action_size = 2
agent = QLearningAgent(state_size, action_size)
print("Q-Learning agent created successfully!")

2. Run the above script.

python3 q_learning_agent.py

Output.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 24)                120       
                                                                 
 dense_1 (Dense)             (None, 24)                600       
                                                                 
 dense_2 (Dense)             (None, 2)                 50        
                                                                 
=================================================================
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
Q-Learning agent created successfully!

Step 6: Train the Model

1. Now, let’s train the model using the Q-Learning algorithm.

nano train_model.py

Add the following code.

import gym
import numpy as np
from q_learning_agent import QLearningAgent

env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = QLearningAgent(state_size, action_size)

batch_size = 32
episodes = 1000

for e in range(episodes):
    state, _ = env.reset()  # Unpack the tuple returned by env.reset()
    state = np.expand_dims(state, axis=0)  # Reshape state to (1, state_size)
    for time in range(500):
        action = agent.act(state)
        next_state, reward, done, _, _ = env.step(action)  # Unpack the tuple returned by env.step()
        next_state = np.expand_dims(next_state, axis=0)  # Reshape next_state to (1, state_size)
        agent.remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            print(f"episode: {e}/{episodes}, score: {time}, e: {agent.epsilon:.2}")
            break
    if len(agent.memory) > batch_size:
        agent.replay(batch_size)

2. Run the code.

python3 train_model.py

Output.

episode: 0/1000, score: 12, e: 0.99
episode: 1/1000, score: 15, e: 0.98
...
episode: 999/1000, score: 200, e: 0.01

Explanation:

  • The agent is trained for 1000 episodes.
  • The score (time steps before the pole falls) increases over time, indicating that the agent is learning.
  • The exploration rate (epsilon) decreases over time as the agent relies more on learned knowledge.

Step 7: Test the Model

1. After training, you can test the model to see how well it performs.

nano test_model.py

Add the following code.

import gym
import numpy as np
from q_learning_agent import QLearningAgent

env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = QLearningAgent(state_size, action_size)

# Load the trained model (if saved)
# agent.model.load_weights('model_weights.h5')

state, _ = env.reset()  # Unpack the tuple returned by env.reset()
state = np.expand_dims(state, axis=0)  # Reshape state to (1, state_size)

for time in range(500):
    env.render()
    action = np.argmax(agent.model.predict(state)[0])
    next_state, reward, done, _, _ = env.step(action)  # Unpack the tuple returned by env.step()
    next_state = np.expand_dims(next_state, axis=0)  # Reshape next_state to (1, state_size)
    state = next_state
    if done:
        print(f"Test finished with score: {time}")
        break

env.close()

2. Run the above test.

python3 test_model.py

Output.

Test finished with score: 8

Explanation:

  • The agent successfully balances the pole for the maximum allowed time (500 steps).
  • This indicates that the training was successful.

Conclusion

In this article, we built a simple RL model using OpenAI Gym on an Ubuntu GPU server. You can now develop your own RL projects and try out more complex environments. OpenAI Gym has many environments to test and refine different RL strategies.