Oracle Database 23ai and Vector Search

Oracle launched version 23ai of the database May 2. Immediately after the event we could pull the Free edition from the Oracle Container Registry and play with it.

I blogged about how you can set up 23c Free in Your Own Oracle Database Lab in 1-2-3 218 days ago ¹.

This command pulls the latest version:

podman pull container-registry.oracle.com/database/free:latest

Have a look in Your Own Oracle Database Lab in 1-2-3 for more details on how to create a new container with it. I created mine with:

podman run -p 21521:1521 -v /tmp/models:/mnt --name ora23ai_free \
-e ENABLE_ARCHIVELOG -e ORACLE_PWD=somepassword \
container-registry.oracle.com/database/free:latest

Vector Search

I wanted to explore the new Vector Search and started with AI Vector Search User’s Guide . It has a fairly good introduction to what this is all about and explains:

The new VECTOR data type
Vector Embeddings
Similarity Search
Vector Embedding Models

Obviously, I am not an expert on this, but this field is something I want to work more with, and I believe this possibility to apply AI on data in the database without using external APIs has a lot of potential.

So I started with the User’s Guide and tried out some of the new stuff.

I decided to look at textual data first. From the manual:

For textual data, sentence transformers transform words, sentences, or paragraphs into vector embeddings.

You can generate these embeddings outside the database and import them to the database later, but if you have a model in the ONNX format, then you can import it and generate those embeddings on your data right inside the database. That might not seem like a big difference, but for me as a DBA that means:

I don’t have to offload a lot of data
The data are kept safe (think database security, access, leakage, etc)
I expect excellent performance on engineered systems with code running really close to the data

Importing a pretrained model

The database does not come with any embedding models, but we can load models in ONNX format; see https://onnx.ai for more information about the Open Neural Network Exchange (ONNX) format. It links to models that can be downloaded from https://github.com/onnx .

Models that are in other formats (from Huggingface.co and others) can be converted with the Oracle Machine Learning for Python (OML4Py) Client. I tried some pretrained models in ONNX format, but struggled to load them, so I decided to look at OML4Py Client. That resulted in a separate post How to install OML4Py Client on Ubuntu (and delayed this post…) That post also has links to documentation on how to install it on Oracle Linux.

OML4Py Client comes with some preconfigured models that can be exported to a file:

from oml.utils import EmbeddingModel, EmbeddingModelConfig
EmbeddingModelConfig.show_preconfigured()

The last function currently lists:

['sentence-transformers/all-mpnet-base-v2', 'sentence-transformers/all-MiniLM-L6-v2', 
'sentence-transformers/multi-qa-MiniLM-L6-cos-v1', 'ProsusAI/finbert', 
'medicalai/ClinicalBERT', 'sentence-transformers/distiluse-base-multilingual-cased-v2', 
'sentence-transformers/all-MiniLM-L12-v2', 'BAAI/bge-small-en-v1.5', 
'BAAI/bge-base-en-v1.5', 'taylorAI/bge-micro-v2', 'intfloat/e5-small-v2', 
'intfloat/e5-base-v2', 'prajjwal1/bert-tiny', 'thenlper/gte-base', 
'thenlper/gte-small', 'TaylorAI/gte-tiny', 'infgrad/stella-base-en-v2', 
'sentence-transformers/paraphrase-multilingual-mpnet-base-v2', 
'intfloat/multilingual-e5-base', 'intfloat/multilingual-e5-small', 
'sentence-transformers/stsb-xlm-r-multilingual']

You can pick one from the list above and look them up at https://huggingface.co/ like this one: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 .

You can export one in two steps:

em = EmbeddingModel(model_name="sentence-transformers/all-MiniLM-L12-v2")
em.export2file("all_MiniLM_L12_v2",output_dir=".")

The last function downloads a few files before the export completes. Filename will be like the string given + extension .onnx. Move that file to a directory on the database server where you can create a directory in the database for it:

create directory dm_import as '/tmp/models';
create user oisene identified by "oisene" 
default tablespace users quota unlimited on users;
grant db_developer_role to oisene;
grant create mining model to oisene;
grant read, write on directory dm_import to oisene;

Connect to the new user (oisene) and start the import:

begin
   DBMS_VECTOR.LOAD_ONNX_MODEL(
    'DM_IMPORT',
    'all_MiniLM_L12_v2.onnx',
    'minilm_l12_v2');
end;
/

The last parameter is the name of the loaded model, and will be used later. After load we can check USER_MINING_MODELS:

select model_name, mining_function, algorithm,algorithm_type,creation_date
from user_mining_models;

MODEL_NAME       MINING_FUNCTION    ALGORITHM    ALGORITHM_TYPE    CREATION_DATE    
________________ __________________ ____________ _________________ ________________ 
MINILM_L12_V2    EMBEDDING          ONNX         NATIVE            15-MAY-24

More info about the model in USER_MINING_MODEL_ATTRIBUTES:

SELECT model_name, attribute_name, attribute_type, data_type, vector_info
FROM user_mining_model_attributes
ORDER BY ATTRIBUTE_NAME;

MODEL_NAME       ATTRIBUTE_NAME    ATTRIBUTE_TYPE    DATA_TYPE    VECTOR_INFO            
________________ _________________ _________________ ____________ ______________________ 
MINILM_L12_V2    DATA              TEXT              VARCHAR2                            
MINILM_L12_V2    ORA$ONNXTARGET    VECTOR            VECTOR       VECTOR(384,FLOAT32)

Loading of a model creates two views, in this case, DM$VMMINILM_L12_V2 and DM$VJMINILM_L12_V2 where the last part of the view name is the model name.

Test Data

There are plenty of data sets, but for this case a mix of structured and unstructured data is ideal. I downloaded beer review data from Kaggle , License CC BY 4.0 DEED .

The largest dataset has a file beer_reviews beer_profile_and_ratings.csv with 3197 rows in it, not exactly big data, but sufficient to demo the functionality. This being beer, some analysis before tasting may be worth it instead of starting from the top or just picking all the IPAs at random.

I’m using SQLcl to load the data:

load beer_reviews beer_profile_and_ratings.csv new

This simple command scans the CSV file, creates a new table based on the scan, sets LOADFORMAT and LOAD properties, and finally loads the data. It even prints out the DDL statement for the table when it is done. (One quick tip, check your NLS settings with show parameter nls. If nls_territory is NORWAY or any country with different number format than US, then change it to AMERICAN with alter session set nls_territory='AMERICA'; This ensures that correct NUMBER datatype will be used.)

I did rename a couple of columns after, to simplify SQL later.

One note about the dataset, some reviews have only the word “Notes:” in description. This SQL will count how many has real notes pr style:

select style,count(*),
    count(case when length(description)>10 then 1 else null end) cnt_descr
from BEER_REVIEWS 
group by style
order by 3 desc;

I decided to remove those dummy values + some more changes:

delete from BEER_REVIEWS where description='Notes:';
alter table beer_reviews add id number;
update beer_reviews set id=rownum;
alter table beer_reviews add constraint beer_reviews_pk primary key(id);
alter table beer_reviews rename column X#NAME to NAME;
alter table beer_reviews rename column BEER_NAME_#FULL# to BEER_NAME_FULL;

Workflow

At the end of the Overview chapter in User’s Guide there is a Oracle AI Vector Search Workflow that is useful to follow, but I didn’t entirely in this case. The five steps there maps to chapter 3 to 8.

Embeddings

I decided to perform this inside the database.

New table to store the embeddings:

create table beer_review_chunks(
    beer_review_id number, 
    chunk_id number,
    chunk_data varchar2(4000),
    chunk_embedding vector);

One (!) insert statement to embed all the descriptions:

insert into beer_review_chunks
select br.id beer_review_id, et.embed_id chunk_id, et.embed_data chunk_data, 
    to_vector(et.embed_vector) chunk_embedding
from beer_reviews br,
dbms_vector_chain.utl_to_embeddings(
       dbms_vector_chain.utl_to_chunks(
            dbms_vector_chain.utl_to_text(br.description), 
            json('{"normalize":"all"}')),
       json('{"provider":"database", "model":"MINILM_L12_V2"}')) t,
    JSON_TABLE(t.column_value, '$[*]' 
        COLUMNS (embed_id NUMBER PATH '$.embed_id', 
            embed_data VARCHAR2(4000) PATH '$.embed_data', 
            embed_vector CLOB PATH '$.embed_vector')) et;
commit;

This actually took a few minutes on my old lab server.

Querying with Similarity Searches

Yes, I skipped creating the vector indexes first.

Generating a vector for a search text:

VARIABLE query_string VARCHAR2(1000)
VARIABLE query_vector CLOB
BEGIN
  :query_string := 'IPA without fancy stuff';
  SELECT vector_embedding(MINILM_L12_V2 using :query_string as data) into :query_vector;
END;
/
 
PRINT query_vector

Query for similar beers

SELECT brc.beer_review_id, brc.chunk_id, br.name beer_name, brc.chunk_data
FROM beer_review_chunks brc join beer_reviews br on brc.beer_review_id=br.id
ORDER BY vector_distance(chunk_embedding , :query_vector, COSINE)
FETCH FIRST 4 ROWS ONLY;

You will have to scroll to the right to see all the text:

   BEER_REVIEW_ID    CHUNK_ID BEER_NAME                     CHUNK_DATA                                                                                                                                                                                                                                                                                                                                                                                
_________________ ___________ _____________________________ _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ 
              497           2 East India Pale Ale           stone fruits and firm bitterness from IPA's ancestral British home. Give our East IPA a try with some rich crab cakes or salmon, strike up a conversation with farmhouse cheddars, and find harmony alongside spicy dishes. East IPA's blend of tradition and exuberance sets the standard for hop-driven deliciousness.\t                                                                
              504           1 All Day IPA                   Notes:The beer you've been waiting for. Keeps your taste satisfied while keeping your senses sharp. An all-day IPA naturally brewed with a complex array of malts, grains and hops. Balanced for optimal aromatics and a clean finish. The perfect reward for an honest day's work and the ultimate companion to celebrate life's simple pleasures.Formerly Endurance - All Day IPA.\t    
              505           1 HopDevil                      Notes:Bold, spicy and menacingly delicious, this American-hopped IPA offers an aromatic punch and follows through with a lasting, full-bodied finish\t                                                                                                                                                                                                                                    
              610           1 Lost Sailor India Pale Ale    Notes:You don't need a compass to find your way to our flagship India Pale Ale. This classic British-inspired IPA boasts a well-rounded malt profile, complex floral and citrus aromas, and generous Goldings dry hopping. Well balanced between malt sweetness and hop bitterness.\t

Let’s create a new query vector and repeat the query:

BEGIN
    :query_string := 'Pepper mint funky beer';
    SELECT vector_embedding(MINILM_L12_V2 using :query_string as data) into :query_vector;
END;
/

SELECT brc.beer_review_id, brc.chunk_id, br.name beer_name, brc.chunk_data
FROM beer_review_chunks brc join beer_reviews br on brc.beer_review_id=br.id
ORDER BY vector_distance(chunk_embedding , :query_vector, COSINE)
FETCH FIRST 4 ROWS ONLY;

   BEER_REVIEW_ID    CHUNK_ID BEER_NAME                                   CHUNK_DATA                                                                                                                                                                                                                                                                                                                                                                                                 
_________________ ___________ ___________________________________________ __________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ 
              310           1 Billy's Chilies Beer (Timberline Series)    Notes:Our chili beer is made with five different kinds of fresh chili peppers. Rich pepper aroma greets you when the bottle is opened, followed by its satisfying beer taste and then subtle spiciness.\t                                                                                                                                                                                                  
              455           1 Mexican Chocolate Stout                     Notes:Gold Medal Winner at Great American Beer Fest - Herb and Spice Category. A rich black ale with roasted malt flavors blended with bittersweet chocolate, our own blend of three Mexican chili peppers and cinnamon.\t                                                                                                                                                                                 
              312           1 Hot-Jala-Heim                               Notes:This style of beer is classified in the "Fruit/Vegetable" category as a "Chili beer". The focus is obviously on the chili peppers used during the brewing process. Our version uses a mix of hot peppers, including jalapenos and anaheims (hence the name...Hot-Jala-Heim). It has a delightful 'peppery' aroma as well as taste with a 'mild' sensation of heat experienced after swallowing.\t    
              335           1 Green Chili Beer                            Notes:This is the chili-heads favorite brew. A mild base beer is spiced up with Anaheim and Serrano peppers. This beer has a wonderful green chili aroma and a touch of heat in the finish. Have it with tomato juice as a "liquid enchilada".\t

These two queries return really fast without a vector index, even on rather slow hardware. But then again, it is not really much data. Explain plan for the SQL above:

------------------------------------------------------------------------------------------------------    
| Id  | Operation               | Name               | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    
------------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT        |                    |     4 |  8376 |       |  1176   (1)| 00:00:01 |    
|*  1 |  COUNT STOPKEY          |                    |       |       |       |            |          |    
|   2 |   VIEW                  |                    |  2043 |  4177K|       |  1176   (1)| 00:00:01 |    
|*  3 |    SORT ORDER BY STOPKEY|                    |  2043 |  4215K|  5456K|  1176   (1)| 00:00:01 |    
|*  4 |     HASH JOIN           |                    |  2043 |  4215K|       |   272   (0)| 00:00:01 |    
|   5 |      TABLE ACCESS FULL  | BEER_REVIEWS       |  1840 |   141K|       |    68   (0)| 00:00:01 |    
|   6 |      TABLE ACCESS FULL  | BEER_REVIEW_CHUNKS |  2042 |  4056K|       |   204   (0)| 00:00:01 |    
------------------------------------------------------------------------------------------------------

Adding a Vector Index

create vector index beer_review_chunks_v1 on beer_review_chunks(chunk_embedding)
organization inmemory neighbor graph
distance COSINE
with target accuracy 95;

That aborted with:

ORA-51961: The vector memory area is out of space

I suspect this is due to the limitation of the Free edition, where the maximum size for the SGA is 2G. The sizing of the Vector Pool is documented in chapter 5 in the User’s Guide; that pool is part of the SGA.

A parameter vector_memory_size has value 0 in my database. I tried to change it to a value as suggested in the manual, but another error was thrown:

ORA-51955: The Oracle Database Vector Memory size cannot be increased more than 0 MB for PDB ID 3.

Last attempt was to increase it from cdb$root, no luck:

ORA-51950: The Oracle Database Vector Memory size cannot be increased.

The parameter is not yet documented in the Database Reference. I’ll have to dig into this at some point, I don’t expect to have this kind of problems when 23ai lands on the nearest Exadata.

Update 2024-05-28: Simon Pane of Pythian (check out his posts ) sent me a solution. Run this from the root container:

alter system set vector_memory_size=200m scope=spfile;
shutdown immediate
startup

It may be I was too greedy and set it too high a value when I tried.

ORACLE instance started.

Total System Global Area 1603726344 bytes
Fixed Size		    5360648 bytes
Variable Size		  369098752 bytes
Database Buffers	 1006632960 bytes
Redo Buffers		    4530176 bytes
Vector Memory Area	  218103808 bytes
Database mounted.
Database opened.

Note the third-last line.

The value was adjusted after startup:

SQL> show parameter vector_memory_size

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
vector_memory_size		     big integer 208M

With this change the error disappeared:

create vector index beer_review_chunks_v1 on beer_review_chunks(chunk_embedding)
organization inmemory neighbor graph
distance COSINE
with target accuracy 95;

Vector INDEX created.

I still have more to investigate regarding this index type. End of update

Conclusion

I haven’t tried these beers and can’t evaluate the results, but the tests so far looks promising. It sure helps to understand more about the models and what is going on here. In the queries above I did only similarity search, but we can combine it with a relational query making the results even more precise. I expect a larger and more realistic data set will show this better.

There is a lot to explore and learn. I’m not bored yet.

select sysdate - date '2023-10-11'; ↩︎