In today’s data-driven world, vector embeddings have become fundamental in machine learning applications, especially in natural language processing (NLP). Vector embeddings are used in various applications such as semantic search, sentiment analysis, and machine translation. They help capture the semantic meanings of sentences, thus enabling machines to perform tasks that require a deep understanding of human language. Atlantic.Net Cloud GPU provides a robust environment for deploying machine learning models that require high computational power.

This article will guide you on how to generate and use vector embeddings using an Atlantic.Net Cloud GPU.

Prerequisites

Before proceeding, ensure you have the following:

  • An Atlantic.Net Cloud GPU server running Ubuntu 22.04, equipped with an NVIDIA A100 GPU with at least 10 GB of GPU RAM.
  • CUDA Toolkit and cuDNN Installed.
  • Root or sudo privileges.

Step 1: Setting Up Your Environment

First, connect to your Atlantic.Net Cloud GPU instance using SSH. Once logged in, you should set up a Python environment. Here’s how you can install the necessary packages:

apt install python3 python3-pip

Install PyTorch.

pip3 install torch --index-url https://download.pytorch.org/whl/cu118

Install the sentence-transformer Python framework.

pip3 install sentence_transformers

Step 2: Creating the Embedding Generation Script

Create a Python script named create-embeddings.py that will encode sentences into vector embeddings using the SentenceTransformer model.

nano create-embeddings.py

Add the following code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
     'This is sentence 1',
     'This is sentence 2',
     'This is sentence 3'
]

embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, embeddings):
     print("Sentence:", sentence)
     print("Embedding:", embedding)
     print("")

In this script, we initialize a pre-trained model (all-MiniLM-L6-v2), which is suitable for generating embeddings. The script processes a list of sentences and outputs their corresponding embeddings.

Now, run the above script.

python3 create-embeddings.py

The output will display the embeddings for each sentence, represented as vectors. These embeddings can be used for further analysis or model training.

 -1.81968231e-02  1.34411370e-02  3.34244817e-02  2.00667456e-02
 -1.49116814e-02  2.83299759e-02  3.46540324e-02  5.40056638e-02
  3.22964322e-03 -3.13563198e-02 -2.74929032e-02 -2.80718114e-02
  4.45681438e-03 -2.83749006e-03  3.57951708e-02  4.00369102e-03
  3.07004545e-02  3.27301696e-02  9.36542265e-03  3.44577916e-02
  1.49957128e-02  8.31826255e-02  3.61972526e-02  9.59376097e-02
 -2.22833566e-02  5.62446602e-02 -5.56956753e-02 -7.11603910e-02
  7.02162553e-03 -6.63649812e-02  4.74848412e-02 -5.78645468e-02
  6.50438573e-03  4.54177819e-02 -4.24348488e-02  6.53610080e-02
 -1.88463349e-02 -3.33171338e-02  2.32637711e-02 -3.17560397e-02
 -4.93795201e-02 -3.29444744e-02 -5.85382022e-02 -3.99537198e-02
  7.91056156e-02  7.39838034e-02  5.14839850e-02 -2.18956899e-02
 -4.34886962e-02  1.52404765e-02 -4.13710400e-02 -1.73745619e-03
  3.03356256e-02 -1.15166663e-03  2.02252921e-02 -9.41368788e-02]

Step 3: Calculating Cosine Similarity

You can calculate the cosine similarity between their embeddings to understand how similar two sentences are. Create a script named cosine-similarity.py to accomplish this:

nano cosine-similarity.py

Add the following code.

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

emb1 = model.encode("This is sentence 1")
emb2 = model.encode("This is sentence 2")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Run the script to see the cosine similarity score, which quantifies the similarity between two sentences.

python3 cosine-similarity.py

Output:

Cosine-Similarity: tensor([[0.9145]])

Step 4: Finding Top Similar Sentence Pairs

Finding the most similar pairs of sentences in a dataset can be crucial for applications like clustering or recommendation systems. Use the following script to find and display the top similar sentence pairs:

nano top-similar-pairs.py

Add the following code:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
     'A man is eating food.',
     'A man is eating a piece of bread.',
     'The girl is carrying a baby.',
     'A man is riding a horse.',
     'A woman is playing violin.',
     'Two men pushed carts through the woods.',
     'A man is riding a white horse on an enclosed ground.',
     'A monkey is playing drums.',
     'Someone in a gorilla costume is playing a set of drums.'
]

embeddings = model.encode(sentences)

cos_sim = util.cos_sim(embeddings, embeddings)

all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
         all_sentence_combinations.append([cos_sim[i][j], i, j])

all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

This script identifies and lists the top five similar sentence pairs from a predefined list, demonstrating a practical application of vector embeddings in determining text similarity.

Now, run the above script.

python3 top-similar-pairs.py

Output:

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474

The above outputs illustrate how vector embeddings and cosine similarity can be used to analyze and quantify the similarity between different texts, which can be crucial for many applications such as search engines, recommendation systems, and more sophisticated NLP tasks.

Conclusion

Using an Atlantic.Net Cloud GPU server to handle vector embeddings allows powerful computational resources to be leveraged to perform intensive machine learning tasks efficiently. The examples here showcase basic operations you can perform with embeddings, from generation to similarity comparison, which is foundational for advanced NLP tasks.