Embeddings in OpenAI API

413 views 09/08/2022 admin 0 Comments API, Code Search Embeddings, data visualization in 2d, embeddings, Openai, Similarity Embeddings, Text Search Embeddings, types of embeddings model

What are embeddings in OpenAI API?

An embedding in OpenAI API is a unique type of data representation that machine learning models and algorithms may use easily. An information-dense representation of a text’s semantic meaning is popularly known as embedding in OpenAI API. The distance between two embeddings in the vector space and the semantic similarity between two inputs in their original format are correlated. Each embedding is a vector of floating point values. For instance, if two sentences resemble one another, their vector representations must do the same.

Types of embeddings model

For various functionality, OpenAI now provides three families of embedding models: text search, text similarity, and code search. There are up to four models in each family, ranging in capability:

Babbage (2048 dimensions), Curie (4096 dimensions), Davinci, and Ada (12288 dimensions).

The most capable model is Davinci, but it is also the slowest and most expensive. The least capable language is Ada, yet it is also the fastest and least costly.

These embedding models are most useful to express excel at a given task.

Similarity Embeddings

These models do an excellent job of capturing the semantic similarity between two or more texts.

Text Search Embeddings

These models aid in determining if lengthy publications are pertinent to a brief search term. There are two different kinds:

- one for embedding the search query
- one for embedding the documents which the user will return.

Code Search Embeddings

There are two types, much like with search embeddings: one for encoding retrievable code snippets and one for encoding natural language search queries.

How to get embeddings in OpenAI API

The following code snippets demonstrate how to request an embedding vector for a piece of text via the Embeddings endpoint:

response = openai.Embedding.create(
    input="Sample document text goes here",
    model="text-similarity-babbage-001"
)
embeddings = response['data'][0]['embedding']

Keep in mind that 2048 tokens are the maximum length of input text for our embedding models (approximately equivalent to around 2-3 pages of text). Before submitting a request, one should make sure that your inputs do not go over this limit.

There are two techniques to get embeddings for the search models. Longer texts (to be searched over) are handled by the search model-doc model. In contrast, shorter texts—typically queries or class labels in zero-shot classification—are governed by the search model-query model.

Use Cases

Below are some representative use cases.

Obtaining the embeddings

Up to October 2012, Amazon customers left 568,454 food ratings in total, making up the dataset. We will utilize a portion of the 1,000 most recent reviews as an example. English-language reviews typically lean either positively or negatively. Each review includes a ProductId, UserId, Score, review body, title, and summary (Text).

The review text and summary integrates into a single text. This concatenated text will be encoded by the model, which will provide a single vector embedding.

def get_embedding(text, model="text-similarity-davinci-001"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
 
df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, model='text-similarity-babbage-001'))
df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, model='text-search-babbage-doc-001'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

import pandas as pd
 
df = pd.read_csv('output/embedded_1k_reviews.csv')
df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)
df['babbage_search'] = df.babbage_search.apply(eval).apply(np.array)

Data Visualization in 2D

The complexity of the underlying model affects how large the embeddings are. We employ the t-SNE technique to transform the high-dimensional data into two dimensions for visualization.

Three clusters appear to have been formed by the visualization, one of which is dominated by unfavorable evaluations.

Embedding as a text feature encoder for ML algorithms

Text feature encoder. Any machine learning model’s performance will get enhanced by embeddings if part of the pertinent inputs is free text. Within an ML model, an embedding can also serve as a categorical feature encoder. The names of categorical variables, like work titles, should be descriptive and numerous in order to bring the most value. For this purpose, similarity embeddings typically outperform search embeddings.

We found that the embedding representation is typically quite detailed and information-rich. For instance, even a 10% reduction in the dimensionality of the inputs using SVD or PCA typically leads to worse downstream performance on particular tasks.

The following two use cases, regression, and classification will employ this code to divide the data into a training set and a testing set.

import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib
 
df = pd.read_csv('output/embedded_1k_reviews.csv')
matrix = df.babbage_similarity.apply(eval).to_list()
 
# Create a t-SNE model and transform the data
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)
vis_dims = tsne.fit_transform(matrix)
 
colors = ["red", "darkorange", "gold", "turquiose", "darkgreen"]
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]
color_indices = df.Score.values - 1
 
colormap = matplotlib.colors.ListedColormap(colors)
plt.scatter(x, y, c=color_indices, cmap=colormap, alpha=0.3)
plt.title("Amazon ratings visualized in language using t-SNE")