Embed data in Python and store the vector in Oracle 23ai

I wrote about Vector search in Oracle 23ai in Oracle Database 23ai and Vector Search. Part of the process was embedding chunks of data in the database and add the embeddings as a VECTOR data type in a new table. In this part I use Python to do the embedding outside the database and store the result as a VECTOR in the database using the oracledb module. The Python code does not call any external API, but performs the embedding on my computer using downloaded models from HuggingFace.

Doing the embedding outside the database has some advantages:

Loading of these models using DBMS_VECTOR.LOAD_ONNX_MODEL is not currently supported in FREE-version on Mac with Apple chip (it was not supported in 23.5, but now it fails with an internal error in 23.6).
Embedding is compute-intensive, running outside the database reduces CPU load on the database server.
Newer hardware or access to GPUs makes a big difference if your software supports it (like PyTorch with CUDA support).
The FREE version of Oracle 23ai database will only use two CPUs.
You can implement parallel processing according to your needs.
I don’t have to load the ONNX model into the database for embedding.

Regarding the last point, one disadvantage of not having the model in the database is that I cannot easily embed query text directly from SQL with the VECTOR_EMBEDDING function, and I would have to use some service external to the database for that. For a limited amount of text the cost using an external API should not be much. In my lab I have the same model loaded into the database as I use from Python.

What follows is two ways of doing the embedding in Python:

Using LLM from https://llm.datasette.io/
Sentence Transformers from https://www.sbert.net/

Wine Reviews

For this test I am using the a Wine Tasting dataset available at Kaggle. It has about 130.000 reviews of wine and is useful for testing vector search on the review texts combined with a relational query on the other columns (country, region, price, points, etc). The dataset is a bit old, the youngest wine is from 2017¹.

Loading the CSV file

In case you want to do the embedding in the database, or just have a look at the dataset the way it is, you can load the data from SQLcl first. For the external embedding in Python you can load the data and embed in one go, see further down.

I started with the file winemag-data-130k-v2.csv from the zip file and loaded it with SQL Developer command line version, SQLcl, Download from oracle.com. I wrote about loading CSV files with SQLcl in Practical Guide to Load Large CSV Files to Oracle. Being practical means looking at the file and find an easy way to get the job done (loading the file). In this case some of the review texts have line endings as carriage return and new line (\r\n in Unix lingo), so I decided to add the | character at the end of each line that marked the end of each row with this one-liner in Perl:

perl -pe 's/([^\r])\n/$1|\n/g' winemag-data-130k-v2.csv > winereviews.csv

Since the default for NLS_NUMERIC_CHARACTERS in my region is ,. I changed it for the session with:

alter session set nls_numeric_characters='.,';

You probably won’t have to do that.

These commands change the end of line terminator to | and bumps up the number of rows to be scanned before creating the table:

set loadformat csv term |
set load scan 5000

Then I loaded it with:

load winereviews winereviews.csv new

Two rows failed (wrong format in file) which is acceptable in this test case.

The review for each wine is included in the description field. The longest has 820 characters. I plan to map each of the descriptions as one point in the 384 dimensional space. In other words, I am not going to slice the text into chunks as I did in the previous blogpost on the subject. From the description of the model at Huggingface input text longer than 256 word pieces will be truncated, but all reviews are inside this limit.² That is why I add one column to the table for the embedding.

With the table in place I can add a new column for the embedding vector for each review³:

alter table  winereviews add review_embedding vector(384,FLOAT32);

Why did I use 384 and FLOAT32? Have a look at the model card at Hugging Face.

For instructions on how to do the embedding in the database see my previous post about that.

If you are a DBA you may worry about row-chaining, well, you can move the table after load.

Embedding from Python

There are a few ways to do this, I have tried out a couple in Python.

Install LLM

I have played with the CLI utility and Python library LLM from Simon Willison. This utility lets you interact with LLM models on your own machine, or if you prefer use external APIs from OpenAI, Mistral, Anthropic and Google. See page on Github and his LLM page for more information on this.

I have Python 3.10.12 installed. Since I will insert the embeddings to Oracle I will also install the oracledb module as well. Setting up a virtual environment and install the modules:

python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install llm oracledb

The utility comes with a lot of features worth checking out, but here I only want to write simple code to do the embedding with models available here. You can check the models available out of the box with:

llm models

Only models from OpenAI are listed, so we’ll install the gpt4all plugin:

llm install llm-gpt4all

(For more information on GPT4All, see https://www.nomic.ai/gpt4all ). The models command now shows a much longer list, here is an excerpt of the first models from gpt4all.

gpt4all: all-MiniLM-L6-v2-f16 - SBert, 43.76MB download, needs 1GB RAM
gpt4all: all-MiniLM-L6-v2 - SBert, 43.82MB download, needs 1GB RAM
gpt4all: nomic-embed-text-v1 - Nomic Embed Text v1, 261.58MB download, needs 1GB RAM
gpt4all: nomic-embed-text-v1 - Nomic Embed Text v1.5, 261.58MB download, needs 1GB RAM
gpt4all: Llama-3 - Llama 3.2 1B Instruct, 737.21MB download, needs 2GB RAM
gpt4all: qwen2-1_5b-instruct-q4_0 - Qwen2-1.5B-Instruct, 894.10MB download, needs 4GB RAM
gpt4all: Llama-3 - Llama 3.2 3B Instruct, 1.79GB download, needs 4GB RAM

We need another plugin as well in order to embed using local models:

llm install llm-sentence-transformers

Finally we need to install one embedding model:

llm sentence-transformers register all-MiniLM-L6-v2

Load the wine reviews as JSON

The zip file from Kaggle also has a file with the same dataset in JSON format. As I worked with the Python code I figured I could just insert the JSON from Python directly to the database since Oracle has good support for JSON. Loading the file into Python is easy with:

import json
with open("winemag-data-130k-v2.json") as f:
data = json.load(f)

This is how I connect to the database and test the connection. Username, password and connection string (e.g. localhost/freepdb1) comes from the environment:

import oracledb
import os
un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING')


with oracledb.connect(user=un, password=pw, dsn=cs) as connection:
    with connection.cursor() as cursor:
        sql = "select sysdate from dual"
        for r in cursor.execute(sql):
            print(r)

Code to load JSON into table (simple, but not fastest):

import time

connection = oracledb.connect(user=un, password=pw, dsn=cs)
cursor = connection.cursor()
started=time.time()
for i in range(len(data)):
    cursor.execute("insert into winereviews_stage(id,wine_data) values(:1,:2)",
    [i,json.dumps(data[i])])

connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))

Elapsed: 36.504868030548096

I cannot let the opportunity pass to show the speed-up with bulk-loading (See Chapter 8 Executing Batch Statements and Bulk Loading in the oracledb documentation.)

With bulkload:

started=time.time()
bulk_data=[(i,json.dumps(data[i])) for i in range(len(data))]
cursor.executemany("insert into winereviews_stage(id,wine_data) values(:1,:2)",
   bulk_data)
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))

Elapsed: 2.8074018955230713

The bulk loading takes less than 10% of the time (This with 23ai FREE version and Python both running locally on a Mac mini M1).

OK, back to the main subject.

The staging table

Here is the SQL to create the staging table with a column for the embedding vector:

create table winereviews_stage(
        id integer constraint winereviews_stage_pk primary key,
        wine_data json,
        description_embedding vector(384,float32));

Embed with LLM from datasette.io

This code is all it takes to embed and store the JSON together with an embedding of the description column into the database in one go:

import json
import oracledb
import os
import llm
import time
import array

un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING')  # dbhost.example.com/orclpdb

with open("winemag-data-130k-v2.json") as f:
    data = json.load(f)

# Load the model
embedding_model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
# Create connection
connection = oracledb.connect(user=un, password=pw, dsn=cs)
# Create cursor
cursor = connection.cursor()

started=time.time()
# Use embed() from model to embed each description
# embed() returns list, convert it to array of floats
# List comprehension to create a list of tuples
bulk_data=[(i,json.dumps(data[i]),
   array.array('f',embedding_model.embed(data[i]["description"])))
   for i in range(len(data))]
cursor.executemany("""
   insert into winereviews_stage(id,wine_data,DESCRIPTION_EMBEDDING)
   values(:1,:2,:3)
""",bulk_data)
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))

Only two methods are used from the LLM module; llm.get_embedding_model() loads the model into an object called embedding_model and the embed() embeds a string.

Embedding with Sentence Transformers from sbert.net

As an alternative, and perhaps slightly easier to install, we can use Sentence Transformers a.k.a. SBERT. They have a large collection of pre-trained models and have their own Python package for embedding.

Create a new directory somewhere and a new virtual environment with required packages:

python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install oracledb sentence-transformers

The code does not change much:

import json
import oracledb
import os
import time
import array

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING')  # dbhost.example.com/orclpdb

with open("winemag-data-130k-v2.json") as f:
    data = json.load(f)

connection = oracledb.connect(user=un, password=pw, dsn=cs)
cursor = connection.cursor()

print("Truncating table")
cursor.execute("truncate table winereviews_stage")
print("Bulk insert")

started=time.time()
bulk_data=[(i,json.dumps(data[i]),
   array.array('f',model.encode(data[i]["description"]))) 
   for i in range(len(data))]
cursor.executemany("insert into winereviews_stage(id,wine_data,DESCRIPTION_EMBEDDING) values(:1,:2,:3)",bulk_data)
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))

In short, one line to load the model and the method encode() to embed the string. The encode() returns a numpy.ndarray type, but the conversion to an array uses the same constructor from the array package.

About inserting vectors with oracledb

The documentation for the oracledb package is quite good and explains how to store VECTOR values from Python, see Chapter 14 in the python-oracledb documentation.

Relational data with `JSON_TABLE`

We can convert the JSON stuff in the WINE_DATA field with the JSON_TABLE function in SQL, and create a new table that includes the vector column that was populated from Python:

create table winereviews_embedded 
as select jt.*, w.description_embedding 
from winereviews_stage w, json_table(w.wine_data columns title,
    country,region_1,region_2,designation,price,points,variety,
    taster_name,description) jt;

but this creates all columns with VARCHAR2(4000). This adds the types I want:

create table winereviews_embedded 
as select jt.*, w.description_embedding 
from winereviews_stage w, json_table(
    w.wine_data columns (
            title varchar(200),
            country varchar2(50),
            region_1 varchar2(50),
            region_2 varchar2(50),
            designation varchar2(50),
            price number,
            points number,
            variety varchar2(50),
            taster_name varchar2(100),
            description varchar(2000))) jt;

NESTED is a shortcut that works in this case:

create table winereviews_embedded 
as select jt.*, w.description_embedding 
from winereviews_stage w NESTED wine_data columns (
            title varchar(200),
            country varchar2(50),
            region_1 varchar2(50),
            region_2 varchar2(50),
            designation varchar2(50),
            price number,
            points number,
            variety varchar2(50),
            taster_name varchar2(100),
            description varchar(2000)) jt;

More research needed

I have a few ideas on what I can do with these data. After all 2017 is not that old for a good Bordeaux, perhaps I can find one of my favorites by a normal relational search, and then use vector search to identify more wine similar to my taste. Probably I should try out different models and compare the results on familiar wines before I start drinking based on the first vector search.

Comparing embedding done in the database with external embedding is also interesting. I have seen bugs earlier, regarding vector distance (like cosine distance above 1).

There is a lot to explore here with these ~~arrays~~ vectors ;-)

External embedding in Oracle 23ai

Embed data in Python and store the vector in Oracle 23ai

Wine Reviews

Loading the CSV file

Embedding from Python

Install LLM

Load the wine reviews as JSON

The staging table

Embed with LLM from datasette.io

Embedding with Sentence Transformers from sbert.net

About inserting vectors with oracledb

Relational data with `JSON_TABLE`

More research needed

Similar posts

External embedding in Oracle 23ai

Embed data in Python and store the vector in Oracle 23ai

Wine Reviews

Loading the CSV file

Embedding from Python

Install LLM

Load the wine reviews as JSON

The staging table

Embed with LLM from datasette.io

Embedding with Sentence Transformers from sbert.net

About inserting vectors with oracledb

Relational data with JSON_TABLE

More research needed

Similar posts

Relational data with `JSON_TABLE`