Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. OpenAI’s Gym is a popular toolkit for developing and comparing reinforcement learning algorithms.
In this article, we will go through building a simple RL model using Gym on an Ubuntu GPU server.
Prerequisites
Before starting, ensure you have the following:
- An Ubuntu 24.04 Cloud GPU Server.
- CUDA Toolkit and cuDNN Installed.
- A root or sudo privileges.
Step 1: Set Up the Environment
1. First, ensure your system is up to date:
apt update -y
2. The default Python version in Ubuntu 24.04 (Python 3.12) is incompatible with Gym requirements. You’ll need to install Python 3.10.
Add the Python repository.
add-apt-repository ppa:deadsnakes/ppa
3. Update the package index.
apt update -y
4. Install Python 3.10 and essential libraries.
apt install python3.10 python3.10-venv python3.10-dev -y
5. Create a Python virtual environment.
python3.10 -m venv rl_env
source rl_env/bin/activate
Step 2: Install Required Libraries
1. Upgrade pip to the latest version.
pip install --upgrade pip
2. Install required libraries.
pip install wheel "numpy<2" matplotlib scipy
3. Install TensorFlow with GPU support.
pip install tensorflow-gpu
4. Install OpenAI Gym.
pip install gym
Step 3: Build a Simple Reinforcement Learning Model
OpenAI Gym provides various environments. For this tutorial, we’ll use the CartPole-v1 environment, where the goal is to balance a pole on a cart.
1. Create a file named environment_setup.py.
nano environment_setup.py
Add the following code.
import gym
env = gym.make('CartPole-v1')
print("Environment created successfully!")
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
2. Run the above script.
python3 environment_setup.py
Output.
Environment created successfully!
Observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Action space: Discrete(2)
Step 4: Define the Model
We’ll use a simple neural network to approximate the Q-function. The network will take the state as input and output the Q-values for each action.
1. Create a file to define the model.
nano build_model.py
2. Add the following code.
import tensorflow as tf
from tensorflow.keras import layers
def build_model(state_size, action_size):
model = tf.keras.Sequential([
layers.Dense(24, input_dim=state_size, activation='relu'),
layers.Dense(24, activation='relu'),
layers.Dense(action_size, activation='linear')
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
return model
# Test the model
state_size = 4
action_size = 2
model = build_model(state_size, action_size)
model.summary()
3. Run the script.
python3 build_model.py
Output.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 24) 120
dense_1 (Dense) (None, 24) 600
dense_2 (Dense) (None, 2) 50
=================================================================
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
Step 5: Implement the Q-Learning Algorithm
Q-Learning is a popular RL algorithm. Here, we’ll implement a simple version of Q-Learning with experience replay.
1. Create a q_learning agent.
nano q_learning_agent.py
Add the following code.
import numpy as np
from build_model import build_model
class QLearningAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = []
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.model = build_model(state_size, action_size)
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon: return np.random.choice(self.action_size) q_values = self.model.predict(state) return np.argmax(q_values[0]) def replay(self, batch_size): minibatch = np.random.choice(len(self.memory), batch_size, replace=False) for index in minibatch: state, action, reward, next_state, done = self.memory[index] target = reward if not done: target = reward + self.gamma * np.amax(self.model.predict(next_state)[0]) target_f = self.model.predict(state) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# Test the agent
state_size = 4
action_size = 2
agent = QLearningAgent(state_size, action_size)
print("Q-Learning agent created successfully!")
2. Run the above script.
python3 q_learning_agent.py
Output.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 24) 120
dense_1 (Dense) (None, 24) 600
dense_2 (Dense) (None, 2) 50
=================================================================
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
Q-Learning agent created successfully!
Step 6: Train the Model
1. Now, let’s train the model using the Q-Learning algorithm.
nano train_model.py
Add the following code.
import gym
import numpy as np
from q_learning_agent import QLearningAgent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = QLearningAgent(state_size, action_size)
batch_size = 32
episodes = 1000
for e in range(episodes):
state, _ = env.reset() # Unpack the tuple returned by env.reset()
state = np.expand_dims(state, axis=0) # Reshape state to (1, state_size)
for time in range(500):
action = agent.act(state)
next_state, reward, done, _, _ = env.step(action) # Unpack the tuple returned by env.step()
next_state = np.expand_dims(next_state, axis=0) # Reshape next_state to (1, state_size)
agent.remember(state, action, reward, next_state, done)
state = next_state
if done:
print(f"episode: {e}/{episodes}, score: {time}, e: {agent.epsilon:.2}")
break
if len(agent.memory) > batch_size:
agent.replay(batch_size)
2. Run the code.
python3 train_model.py
Output.
episode: 0/1000, score: 12, e: 0.99
episode: 1/1000, score: 15, e: 0.98
...
episode: 999/1000, score: 200, e: 0.01
Explanation:
- The agent is trained for 1000 episodes.
- The score (time steps before the pole falls) increases over time, indicating that the agent is learning.
- The exploration rate (epsilon) decreases over time as the agent relies more on learned knowledge.
Step 7: Test the Model
1. After training, you can test the model to see how well it performs.
nano test_model.py
Add the following code.
import gym
import numpy as np
from q_learning_agent import QLearningAgent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = QLearningAgent(state_size, action_size)
# Load the trained model (if saved)
# agent.model.load_weights('model_weights.h5')
state, _ = env.reset() # Unpack the tuple returned by env.reset()
state = np.expand_dims(state, axis=0) # Reshape state to (1, state_size)
for time in range(500):
env.render()
action = np.argmax(agent.model.predict(state)[0])
next_state, reward, done, _, _ = env.step(action) # Unpack the tuple returned by env.step()
next_state = np.expand_dims(next_state, axis=0) # Reshape next_state to (1, state_size)
state = next_state
if done:
print(f"Test finished with score: {time}")
break
env.close()
2. Run the above test.
python3 test_model.py
Output.
Test finished with score: 8
Explanation:
- The agent successfully balances the pole for the maximum allowed time (500 steps).
- This indicates that the training was successful.
Conclusion
In this article, we built a simple RL model using OpenAI Gym on an Ubuntu GPU server. You can now develop your own RL projects and try out more complex environments. OpenAI Gym has many environments to test and refine different RL strategies.