Embed data in Python and store the vector in Oracle 23ai
I wrote about Vector search in Oracle 23ai in Oracle Database 23ai and Vector Search.
Part of the process was embedding chunks of data in the database and add the embeddings as a VECTOR data type in a new table. In this part I use Python to do the embedding outside the database and store the result as a VECTOR in the database using the oracledb
module. The Python code does not call any external API, but performs the embedding on my computer using downloaded models from HuggingFace.
Doing the embedding outside the database has some advantages:
- Loading of these models using
DBMS_VECTOR.LOAD_ONNX_MODEL
is not currently supported in FREE-version on Mac with Apple chip (it was not supported in 23.5, but now it fails with an internal error in 23.6). - Embedding is compute-intensive, running outside the database reduces CPU load on the database server.
- Newer hardware or access to GPUs makes a big difference if your software supports it (like PyTorch with CUDA support).
- The FREE version of Oracle 23ai database will only use two CPUs.
- You can implement parallel processing according to your needs.
- I don’t have to load the ONNX model into the database for embedding.
Regarding the last point, one disadvantage of not having the model in the database is that I cannot easily embed query text directly from SQL with the VECTOR_EMBEDDING
function, and I would have to use some service external to the database for that. For a limited amount of text the cost using an external API should not be much. In my lab I have the same model loaded into the database as I use from Python.
What follows is two ways of doing the embedding in Python:
- Using LLM from https://llm.datasette.io/
- Sentence Transformers from https://www.sbert.net/
Wine Reviews
For this test I am using the a Wine Tasting dataset available at Kaggle. It has about 130.000 reviews of wine and is useful for testing vector search on the review texts combined with a relational query on the other columns (country, region, price, points, etc). The dataset is a bit old, the youngest wine is from 20171.
Loading the CSV file
In case you want to do the embedding in the database, or just have a look at the dataset the way it is, you can load the data from SQLcl first. For the external embedding in Python you can load the data and embed in one go, see further down.
I started with the file winemag-data-130k-v2.csv
from the zip file and loaded it with SQL Developer command line version, SQLcl, Download from oracle.com.
I wrote about loading CSV files with SQLcl in Practical Guide to Load Large CSV Files to Oracle.
Being practical means looking at the file and find an easy way to get the job done (loading the file). In this case some of the review texts have line endings as carriage return and new line (\r\n
in Unix lingo), so I decided to add the |
character at the end of each line that marked the end of each row with this one-liner in Perl:
perl -pe 's/([^\r])\n/$1|\n/g' winemag-data-130k-v2.csv > winereviews.csv
Since the default for NLS_NUMERIC_CHARACTERS
in my region is ,.
I changed it for the session with:
alter session set nls_numeric_characters='.,';
You probably won’t have to do that.
These commands change the end of line terminator to |
and bumps up the number of rows to be scanned before creating the table:
set loadformat csv term |
set load scan 5000
Then I loaded it with:
load winereviews winereviews.csv new
Two rows failed (wrong format in file) which is acceptable in this test case.
The review for each wine is included in the description
field. The longest has 820 characters. I plan to map each of the descriptions as one point in the 384 dimensional space. In other words, I am not going to slice the text into chunks as I did in the previous blogpost on the subject. From the description of the model at Huggingface
input text longer than 256 word pieces will be truncated, but all reviews are inside this limit.2 That is why I add one column to the table for the embedding.
With the table in place I can add a new column for the embedding vector for each review3:
alter table winereviews add review_embedding vector(384,FLOAT32);
Why did I use 384 and FLOAT32? Have a look at the model card at Hugging Face.
For instructions on how to do the embedding in the database see my previous post about that.
If you are a DBA you may worry about row-chaining, well, you can move
the table after load.
Embedding from Python
There are a few ways to do this, I have tried out a couple in Python.
Install LLM
I have played with the CLI utility and Python library LLM from Simon Willison. This utility lets you interact with LLM models on your own machine, or if you prefer use external APIs from OpenAI, Mistral, Anthropic and Google. See page on Github and his LLM page for more information on this.
I have Python 3.10.12 installed. Since I will insert the embeddings to Oracle I will also install the oracledb module as well. Setting up a virtual environment and install the modules:
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install llm oracledb
The utility comes with a lot of features worth checking out, but here I only want to write simple code to do the embedding with models available here. You can check the models available out of the box with:
llm models
Only models from OpenAI are listed, so we’ll install the gpt4all plugin:
llm install llm-gpt4all
(For more information on GPT4All, see https://www.nomic.ai/gpt4all
). The models
command now shows a much longer list, here is an excerpt of the first models from gpt4all.
gpt4all: all-MiniLM-L6-v2-f16 - SBert, 43.76MB download, needs 1GB RAM
gpt4all: all-MiniLM-L6-v2 - SBert, 43.82MB download, needs 1GB RAM
gpt4all: nomic-embed-text-v1 - Nomic Embed Text v1, 261.58MB download, needs 1GB RAM
gpt4all: nomic-embed-text-v1 - Nomic Embed Text v1.5, 261.58MB download, needs 1GB RAM
gpt4all: Llama-3 - Llama 3.2 1B Instruct, 737.21MB download, needs 2GB RAM
gpt4all: qwen2-1_5b-instruct-q4_0 - Qwen2-1.5B-Instruct, 894.10MB download, needs 4GB RAM
gpt4all: Llama-3 - Llama 3.2 3B Instruct, 1.79GB download, needs 4GB RAM
We need another plugin as well in order to embed using local models:
llm install llm-sentence-transformers
Finally we need to install one embedding model:
llm sentence-transformers register all-MiniLM-L6-v2
Load the wine reviews as JSON
The zip file from Kaggle also has a file with the same dataset in JSON format. As I worked with the Python code I figured I could just insert the JSON from Python directly to the database since Oracle has good support for JSON. Loading the file into Python is easy with:
import json
with open("winemag-data-130k-v2.json") as f:
data = json.load(f)
This is how I connect to the database and test the connection. Username, password and connection string (e.g. localhost/freepdb1
) comes from the environment:
import oracledb
import os
un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING')
with oracledb.connect(user=un, password=pw, dsn=cs) as connection:
with connection.cursor() as cursor:
sql = "select sysdate from dual"
for r in cursor.execute(sql):
print(r)
Code to load JSON into table (simple, but not fastest):
import time
connection = oracledb.connect(user=un, password=pw, dsn=cs)
cursor = connection.cursor()
started=time.time()
for i in range(len(data)):
cursor.execute("insert into winereviews_stage(id,wine_data) values(:1,:2)",
[i,json.dumps(data[i])])
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))
Elapsed: 36.504868030548096
I cannot let the opportunity pass to show the speed-up with bulk-loading (See Chapter 8 Executing Batch Statements and Bulk Loading in the oracledb documentation.)
With bulkload:
started=time.time()
bulk_data=[(i,json.dumps(data[i])) for i in range(len(data))]
cursor.executemany("insert into winereviews_stage(id,wine_data) values(:1,:2)",
bulk_data)
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))
Elapsed: 2.8074018955230713
The bulk loading takes less than 10% of the time (This with 23ai FREE version and Python both running locally on a Mac mini M1).
OK, back to the main subject.
The staging table
Here is the SQL to create the staging table with a column for the embedding vector:
create table winereviews_stage(
id integer constraint winereviews_stage_pk primary key,
wine_data json,
description_embedding vector(384,float32));
Embed with LLM from datasette.io
This code is all it takes to embed and store the JSON together with an embedding of the description
column into the database in one go:
import json
import oracledb
import os
import llm
import time
import array
un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING') # dbhost.example.com/orclpdb
with open("winemag-data-130k-v2.json") as f:
data = json.load(f)
# Load the model
embedding_model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
# Create connection
connection = oracledb.connect(user=un, password=pw, dsn=cs)
# Create cursor
cursor = connection.cursor()
started=time.time()
# Use embed() from model to embed each description
# embed() returns list, convert it to array of floats
# List comprehension to create a list of tuples
bulk_data=[(i,json.dumps(data[i]),
array.array('f',embedding_model.embed(data[i]["description"])))
for i in range(len(data))]
cursor.executemany("""
insert into winereviews_stage(id,wine_data,DESCRIPTION_EMBEDDING)
values(:1,:2,:3)
""",bulk_data)
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))
Only two methods are used from the LLM module; llm.get_embedding_model()
loads the model into an object called embedding_model
and the embed()
embeds a string.
Embedding with Sentence Transformers from sbert.net
As an alternative, and perhaps slightly easier to install, we can use Sentence Transformers a.k.a. SBERT. They have a large collection of pre-trained models and have their own Python package for embedding.
Create a new directory somewhere and a new virtual environment with required packages:
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install oracledb sentence-transformers
The code does not change much:
import json
import oracledb
import os
import time
import array
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING') # dbhost.example.com/orclpdb
with open("winemag-data-130k-v2.json") as f:
data = json.load(f)
connection = oracledb.connect(user=un, password=pw, dsn=cs)
cursor = connection.cursor()
print("Truncating table")
cursor.execute("truncate table winereviews_stage")
print("Bulk insert")
started=time.time()
bulk_data=[(i,json.dumps(data[i]),
array.array('f',model.encode(data[i]["description"])))
for i in range(len(data))]
cursor.executemany("insert into winereviews_stage(id,wine_data,DESCRIPTION_EMBEDDING) values(:1,:2,:3)",bulk_data)
connection.commit()
print("Elapsed: {}".format(str(time.time()-started)))
In short, one line to load the model and the method encode()
to embed the string. The encode()
returns a numpy.ndarray
type, but the conversion to an array uses the same constructor from the array
package.
About inserting vectors with oracledb
The documentation for the oracledb
package is quite good and explains how to store VECTOR values from Python, see Chapter 14
in the python-oracledb documentation.
Relational data with JSON_TABLE
We can convert the JSON stuff in the WINE_DATA
field with the JSON_TABLE
function in SQL, and create a new table that includes the vector column that was populated from Python:
create table winereviews_embedded
as select jt.*, w.description_embedding
from winereviews_stage w, json_table(w.wine_data columns title,
country,region_1,region_2,designation,price,points,variety,
taster_name,description) jt;
but this creates all columns with VARCHAR2(4000)
. This adds the types I want:
create table winereviews_embedded
as select jt.*, w.description_embedding
from winereviews_stage w, json_table(
w.wine_data columns (
title varchar(200),
country varchar2(50),
region_1 varchar2(50),
region_2 varchar2(50),
designation varchar2(50),
price number,
points number,
variety varchar2(50),
taster_name varchar2(100),
description varchar(2000))) jt;
NESTED is a shortcut that works in this case:
create table winereviews_embedded
as select jt.*, w.description_embedding
from winereviews_stage w NESTED wine_data columns (
title varchar(200),
country varchar2(50),
region_1 varchar2(50),
region_2 varchar2(50),
designation varchar2(50),
price number,
points number,
variety varchar2(50),
taster_name varchar2(100),
description varchar(2000)) jt;
More research needed
I have a few ideas on what I can do with these data. After all 2017 is not that old for a good Bordeaux, perhaps I can find one of my favorites by a normal relational search, and then use vector search to identify more wine similar to my taste. Probably I should try out different models and compare the results on familiar wines before I start drinking based on the first vector search.
Comparing embedding done in the database with external embedding is also interesting. I have seen bugs earlier, regarding vector distance (like cosine distance above 1).
There is a lot to explore here with these arrays vectors ;-)
Similar posts
- PyTorch to vectorize images, REST APIs to POST to Oracle PyTorch and the transformers package to embed images in Python and store it Oracle through a REST API offered by ORDS.
- Using Cohere to Generate Vector Embeddings Same wine reviews dataset used, embedding using an API at Cohere. Using OpenAI to Generate Vector Embeddings does the same with an API at OpenAI.
-
The dataset does not contain a separate column for production year, but since the title usually includes the vintage you can find the youngest with this SQL:
select max(regexp_substr(title,'[12][0-9]{3}')) year from WINEREVIEWS where regexp_like (title,'[12][0-9]{3}')
. ↩︎ -
According to https://platform.openai.com/tokenizer the average for English text is about 100 tokens for 75 words. The maximum word count in a review (
description
) is 138:select max(REGEXP_COUNT(description,'(\w)\W') ) word_count from winereviews;
or about 185 tokens. ↩︎ -
See Embedding Vectors vs. Vector Embeddings for Tanel Poder’s take on the correct writing. I think he is right and will try to use embedding vector from now on. ↩︎