How to Create a Recommendation System for a Website with YDB and Vector Search

A step-by-step guide to building a recommendation system for websites, online stores, and applications using YDB, embeddings, and semantic search for similar data.

If you have information that you actively use on your website, in an online store, or in mobile applications, you can enhance it with a recommendation system.

A recommendation system is a mechanism that suggests the most semantically relevant information based on the source data. This system is ideal for a wide range of purposes:

Stores: A recommendation system helps display similar products, which can increase sales.
Service websites: It helps offer customers relevant suggestions, improving service quality.
Search websites: It helps users find precise information for their query, speeding up decision-making.
Blogs and informational websites: Recommendations help connect closely related content, making it more engaging and useful.
Product and information search: The system makes it possible to quickly find products and information in a database, significantly simplifying navigation.

The main difference from regular search is that a recommendation system does not search for exact matches. Instead, it uses vector representations and close matches. To create an effective recommendation system, you will need:

Data: Collect all the information that will be used in your system.
Vector database: This database lets you efficiently represent data and find close matches, which significantly improves the recommendation system.

Introduction

In this article, we will look at Yandex Database (YDB), which will be our tool for:

Storing information
Saving vectors
Searching for similar data

As an example, we will use the following data structure:

Record ID
Title
Data
Vector
Creation date

This is a typical table structure. Once we fill it in, we can populate our database with information. With this approach, searching for semantically similar data will return additional details, such as the title, description, and the ability to sort results by creation date.

Data Migration

Creating a Table

First, create a table, for example articles, using the following YQL query:

CREATE TABLE articles (
  `id` Uuid,
  `title` Utf8,
  `summary` Utf8,
  `body` Utf8,
  `isPublished` Bool,
  `createdAt` Timestamp,

  `embedding` String,

  PRIMARY KEY (`id`)
);

This query creates a table with the required fields: record ID, title, description, text, and fields for storing the vector and creation date.

Pay attention to the PRIMARY KEY (id) keyword. This is the primary key that will be used to identify each record in the table. You can learn more about this in the YDB documentation.

Loading Data

Now let us fill the table. To do this, you can use the INSERT function, which lets you add a new record to the table:

INSERT INTO articles (
    id,
    title,
    summary,
    body,
    createdAt,
)
VALUES (
    RandomUuid(CurrentUtcTimestamp()),
    'My title'
    'My summary text',
    'My body information',
    true,
    CurrentUtcTimestamp(),
)
RETURNING *`;

This function adds new data to the table, which in turn populates our database with information. Here, the vector value is still null, because we will create the vector for the data later and add it to the table separately.

Creating a Vector Representation for the Data

For a recommendation system, vector search is one of the key elements. To implement this kind of search, ordinary data must be converted into a vector that will be used for searching.

This can be done with an embedding function, which converts data into a vector.

You can do this yourself, but I recommend using ready-made solutions. For example, Yandex Cloud provides ready-made models that convert your data into vector representations.

More detailed information is available on this page: Text embedding models.

For vectorization, I will use the following Python code:

from __future__ import annotations
from yandex_ai_studio_sdk import AIStudio

# Specify your own values
YANDEX_FOLDER_ID = "*****"
YANDEX_API_KEY = "*****"

doc_text ="""The main text of the article or any other data that will be searched."""

def main():
    import numpy as np
    from scipy.spatial.distance import cdist

    sdk = AIStudio(
        folder_id=YANDEX_FOLDER_ID,
        auth=YANDEX_API_KEY,
    )

    # Document embeddings
    doc_model = sdk.models.text_embeddings("emb://<folder_id>/text-embeddings-v2-doc/latest")
    doc_embedding = doc_model.run(doc_text, dimensions=256)

    vector = doc_embedding.embedding

    print("Dimension:", len(vector))
    
    # Convert to float32 (IMPORTANT for YDB)
    vector = np.asarray(doc_embedding.embedding, dtype=np.float32)

    # Create a string that can be inserted into YQL
    vector_sql = "[" + ", ".join(str(float(x)) + "f" for x in vector) + "]"

    print(vector_sql)

if __name__ == "__main__":
    main()

Pay attention to the following key points:

text-embeddings-v2-doc is the model version and its type, which can be doc or query.
dimensions=256 sets the vector dimensionality. You can choose 256, 512, or 768; the larger the size, the greater the depth and accuracy.
vector = np.asarray(doc_embedding.embedding, dtype=np.float32) converts the vector to a floating-point number, which is required for storing it in YDB.
vector_sql = "[" + ",".join(str(float(x)) + "f" for x in vector) + "]" creates a string that can be inserted into a YQL query.

Items 3 and 4 are only the necessary conversions for saving the vector in YDB. As a result, you will get a vector similar to this: [-0.039563942700624466f, 0.04024477303028107f, -0.00539764016866684f, -0.05466749519109726f, 0.06055353209376335f, ...]

Saving the Vector in YDB

To add the vector to our data, we need to update the corresponding record in the table. This can be done with the following YQL expression:

UPDATE articles
SET embedding =
  Untag(
    Knn::ToBinaryStringFloat([-0.039563942700624466f, 0.04024477303028107f, -0.00539764016866684f, -0.05466749519109726f, 0.06055353209376335f, ...]),
    "FloatVector"
  )
WHERE id = CAST("5772ab53-78b2-4f05-94b5-2985d98c2c79" AS Uuid);

Note that I update the data for a specific id by providing the corresponding value. You can find more detailed information about vector indexes in the documentation: Search in a table without using a vector index.

Searching for Similar Data

To search for similar data, we will use search without a global index. This can significantly affect search speed, especially if you have a large number of records. As an example, I used the following query:

$K = 3;
$TargetEmbedding = Knn::ToBinaryStringFloat([-0.16292296f, -0.01153724f, -0.00408009f, -0.02625849f, -0.07566188f, ...]);

SELECT id, title, Knn::CosineDistance(embedding, $TargetEmbedding) As CosineDistance
FROM articles
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;

Pay attention to the following key points:

$K defines the number of similar records to return.
SELECT id, title specifies the selected fields.
Knn::ToBinaryStringFloat() receives the query that will be used for the search.

Pass the vector that will be used for searching to the Knn::ToBinaryStringFloat() function. This can be information or simply a search query. After running the query, we get the following result:

1. 61a11aab | My title | 0.6590741872787476
2. cb7e144f | My title | 0.6948220729827881
3. 3e24011b | My title | 0.7233924865722656

This shows that the first three records are the best matches according to the similarity criterion. YDB returns not only the data, but also the additional CosineDistance field.

This field measures the degree of match between objects, and the higher its value, the less similar the data is.

Conclusion

This was a simple example of how you can already use recommendation systems in your web applications, websites, online stores, and mobile applications.