UUIDs are useful, especially when you expose data in REST APIs, but there are cases where you may want to stick to the good old sequence-based primary key.

UUIDs in Oracle Database #JoelKallmanDay

UUID, short for universally unique identifier, are also known as globally unique identifier (GUID). You can use the function sys_guid() in the Oracle database to generate it:

select sys_guid() guid
from dual
connect by level < 10;

The result, however, is not what we are used to see these days:

GUID                                
___________________________________ 
40536A604DFD9FF2E0637700000A85F6    
40536A604DFE9FF2E0637700000A85F6    
40536A604DFF9FF2E0637700000A85F6    
40536A604E009FF2E0637700000A85F6    
40536A604E019FF2E0637700000A85F6    
40536A604E029FF2E0637700000A85F6    
40536A604E039FF2E0637700000A85F6    
40536A604E049FF2E0637700000A85F6    
40536A604E059FF2E0637700000A85F6

It generates them in a monotonic order and it is difficult to see that they are unique.

In 23.9 the function uuid() was introduced:

select uuid() uidV4
from dual
connect by level < 10;

UIDV4                               
___________________________________ 
7211D5EB2A8D4F2ABF97350CA430E426    
E2E0F762E7E84F7BBF4D4F5275109EBA    
4428304718E54F4CBFB3ECA83EA35993    
FE5190FE9B584F5BBF601BBF08B08C57    
5303087A00764FCDBF927F3E5AE95986    
CD4202C7BD664FDABF223336A5BC2F02    
51A814ABB52D4F19BF0BD938EBF05D1F    
2D2571341D704F01BFF9B080F933AACC    
6459A47343694F69BFA155C620E91FD7

It looks more random, but not exactly how we are used to see them displayed. Oracle added another function to format them nicely:

select raw_to_uuid(uuid()) uidV4_formatted
from dual
connect by level < 10;

UIDV4_FORMATTED                         
_______________________________________ 
b2471554-f9e1-4f26-bf2a-1d7b04ced765    
d69694dd-b695-4fbf-bf8e-c4e553485b98    
63c579bd-e310-4f92-bf39-4d85e0fee379    
001fc2fa-02ec-4fc1-bfe2-6786606d3ea1    
28bee96b-c21c-4f4d-bf01-eee5724d69af    
e6b2473c-9dbc-4f70-bf40-4ac033bc4af7    
7f289ebd-5c21-4fbd-bf40-dfd59a0f2f8b    
1375aa43-c862-4f24-bff7-8516e2567299    
227c3b6c-4c5a-4f53-bf56-225483e04a9c

That’s better. There is also the reverse function of this:

select uuid_to_raw('28bee96b-c21c-4f4d-bf01-eee5724d69af') uid_raw ;

UID_RAW                             
___________________________________ 
28BEE96BC21C4F4DBF01EEE5724D69AF

This function returns a RAW(16) if the input passes the function is_uuid():

select '1375aa43-c862-4f24-bff7-8516e2567299' foo
    , is_uuid('1375aa43-c862-4f24-bff7-8516e2567299') OK
union
select 'fubar', is_uuid('fubar');

FOO                                     OK       
_______________________________________ ________ 
1375aa43-c862-4f24-bff7-8516e2567299    true     
fubar                                   false

(If your client returns 1 and 0 instead of true/false, then upgrade your client. SQLcl 24.2.0 returns 1, latest version 25.2 works fine).

So we now have:

uuid() returns RAW(16)
raw_to_uuid() formats UUID in RAW(16) and returns VARCHAR2.
uuid_to_raw() converts UUID the way we are used to see them to RAW(16), provided that the input is valid.
is_uuid() returns a boolean value after validation of input.

For is_uuid() and uuid_to_raw() the input may be enclosed by {} and contain hyphens, see the four different text formats in documentation .

I think the naming of these functions are slightly confusing. uuid_raw() in stead of just uuid() would have been less confusing. Try select is_uuid(uuid()); to get an illustration¹.

More details regarding UUIDs can be found in RFC 9562 and on Wikipedia .

UUIDs in your API

Say you have data about your large wine collection exposed through a REST API. Then you may have an endpoint that returns details for one specific wine: /wines/{wine_id} If that wine_id is your sequence based primary key, that is, an integer, people will soon figure out that they can download all your data by just increasing that number for each fetch. After a simple loop they have all your data. With an UUID they will need extreme luck to get a hit (and your monitoring system may block the client causing massive amounts of 404s).

You can easily add UUIDs to existing tables, or perhaps a mapping table is better. Then you can use UUIDs for publication and use the normal keys for internal SQL. I have seen in newer system that the primary and foreign keys are all based on UUIDs, but for large data sets that may have a downside. Connor McDonald looked at this in his blogpost Gooey GUIDs from 2018 (and recently updated). His concern is mostly the CPU time it takes to generate the GUIDs with data loaded in bulk (e.g. data warehouses), but I wonder if storage space may become an issue as well.

In a case a few weeks ago I came across tables with many million rows where all the keys were based on UUID v4. The keys were stored as formatted UUD V4 strings, not in RAW(16) as output from the old sys_guid() or the new uuid(). For simple lookup that worked great, but for reporting and other bulk operation it was not optimal. My first thought that these would add a lot of overhead on disk space, so I decided to do some testing.

Some tests on a 100 million rows dataset

I am using a dataset X-Wines ². You can download the full dataset from their Google Drive folder . I used FREE edition version 23.9 for these tests.

After load³ I have two tables, wines and ratings, where the latter has a reference to the former, but the constraints must be added after load:

alter table wines add constraint wines_pk primary key(wineid);
alter table ratings add constraint ratings_wines_fk 
    foreign key(wineid) references wines;
alter table ratings_pk add constraint ratings_pk primary key(ratingid);

One index added to the rating table to see what Oracle uses and stats collected in a lazy way:

create index ratings_wineid on ratings(wineid);
exec dbms_stats.gather_schema_stats(user);

This query finds the top 20 most rated wines (the having clause is not necessary for the result, it got in there after some evolved tests, and makes the CBO pick up the index.):

select winename,count(*) ratings_cnt
from wines join ratings using (wineid)
group by winename
having count(*) > 100
order by 2 desc
fetch first 20 rows only;

Here is the explain plan for the query:

Plan hash value: 1429720065                                                                             
                                                                                                        
----------------------------------------------------------------------------------------------------    
| Id  | Operation                 | Name           | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    
----------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT          |                |       |       |       | 31986 (100)|          |    
|*  1 |  COUNT STOPKEY            |                |       |       |       |            |          |    
|   2 |   VIEW                    |                |  3063 |   427K|       | 31986   (3)| 00:00:02 |    
|*  3 |    SORT ORDER BY STOPKEY  |                |  3063 |   122K|       | 31986   (3)| 00:00:02 |    
|*  4 |     HASH GROUP BY         |                |  3063 |   122K|       | 31986   (3)| 00:00:02 |    
|*  5 |      HASH JOIN            |                |    20M|   818M|  4720K| 31446   (1)| 00:00:02 |    
|   6 |       VIEW                | VW_GBF_8       |   100K|  3538K|       |  1365   (1)| 00:00:01 |    
|   7 |        TABLE ACCESS FULL  | WINES          |   100K|  2948K|       |  1365   (1)| 00:00:01 |    
|   8 |       INDEX FAST FULL SCAN| RATINGS_WINEID |    21M|   100M|       | 12867   (1)| 00:00:01 |    
----------------------------------------------------------------------------------------------------    
                                                                                                        
Predicate Information (identified by operation id):                                                     
---------------------------------------------------                                                     
                                                                                                        
   1 - filter(ROWNUM<=20)                                                                               
   3 - filter(ROWNUM<=20)                                                                               
   4 - filter(SUM("ITEM_2")>100)                                                                        
   5 - access("ITEM_1"="RATINGS"."WINEID")

In order to analyze the case where the UUIDs are stored as formatted string I decided to create two new tables (and a temporary one) with UUIDs V4 stored as varchar2(36):

-- drop table wines_guid_tmp purge ;

create table wines_guid_tmp 
as SELECT raw_to_uuid(uuid()) wine_guid,
    wineid, 
    WINENAME,
    TYPE,
    ELABORATE,
    GRAPES,
    HARMONIZE,
    ABV,
    BODY,
    ACIDITY,
    CODE,
    COUNTRY,
    REGIONID,
    REGIONNAME,
    WINERYID,
    WINERYNAME,
    WEBSITE,
    VINTAGES FROM WINES;

-- drop table ratings_guid purge;

create table ratings_guid
as SELECT raw_to_uuid(uuid()) rating_guid,
    USERID,
    w.wine_guid,
    VINTAGE,
    RATING,
    DATE$
  FROM RATINGS r join wines_guid_tmp w using (wineid);

-- drop table wines_guid  purge ;
create table wines_guid 
as SELECT wine_guid,
    WINENAME,
    TYPE,
    ELABORATE,
    GRAPES,
    HARMONIZE,
    ABV,
    BODY,
    ACIDITY,
    CODE,
    COUNTRY,
    REGIONID,
    REGIONNAME,
    WINERYID,
    WINERYNAME,
    WEBSITE,
    VINTAGES FROM WINES_GUID_TMP;

drop table wines_guid_tmp purge;

alter table wines_guid add constraint wines_guid_pk primary key (wine_guid);

alter table ratings_guid add constraint ratings_guid_pk primary key (rating_guid);
alter table ratings_guid add constraint ratings_guid_fk 
    foreign key (wine_guid) references wines_guid;

create index ratings_guid_wine_guid on ratings_guid(wine_guid);

exec dbms_stats.gather_schema_stats(user);

Repeating the query on the new tables:

select winename,count(*) ratings_cnt
from wines_guid join ratings_guid using (wine_guid)
group by winename
having count(*) > 100
order by 2 desc
fetch first 20 rows only;

The plan looks familiar:

Plan hash value: 2314587929                                                                                     
                                                                                                                
------------------------------------------------------------------------------------------------------------    
| Id  | Operation                 | Name                   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    
------------------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT          |                        |       |       |       | 89878 (100)|          |    
|*  1 |  COUNT STOPKEY            |                        |       |       |       |            |          |    
|   2 |   VIEW                    |                        |  3063 |   427K|       | 89878   (1)| 00:00:04 |    
|*  3 |    SORT ORDER BY STOPKEY  |                        |  3063 |   314K|       | 89878   (1)| 00:00:04 |    
|*  4 |     HASH GROUP BY         |                        |  3063 |   314K|       | 89878   (1)| 00:00:04 |    
|*  5 |      HASH JOIN            |                        |    21M|  2104M|  7864K| 89335   (1)| 00:00:04 |    
|   6 |       VIEW                | VW_GBF_8               |   100K|  6683K|       |  1510   (1)| 00:00:01 |    
|   7 |        TABLE ACCESS FULL  | WINES_GUID             |   100K|  6093K|       |  1510   (1)| 00:00:01 |    
|   8 |       INDEX FAST FULL SCAN| RATINGS_GUID_WINE_GUID |    21M|   741M|       | 38678   (1)| 00:00:02 |    
------------------------------------------------------------------------------------------------------------    
                                                                                                                
Predicate Information (identified by operation id):                                                             
---------------------------------------------------                                                             
                                                                                                                
   1 - filter(ROWNUM<=20)                                                                                       
   3 - filter(ROWNUM<=20)                                                                                       
   4 - filter(SUM("ITEM_2")>100)                                                                                
   5 - access("ITEM_1"="RATINGS_GUID"."WINE_GUID")

The response doesn’t look bad, 3.69 seconds compared to 3.02 seconds for the first query, but it is a 21% increase in response time (but perhaps good enough?). Anyway, have a look at line 8 in the execution plans, the amount of data read for the INDEX FAST FULL SCAN operation was many times higher for the last query; 741M vs 100M. You can get an idea when you look at the size of the indexes on disk:

select segment_name,sum(bytes)/1e6 mb
from user_segments
where segment_name in ('RATINGS_WINEID','RATINGS_GUID_WINE_GUID')
group by segment_name;

SEGMENT_NAME                         MB 
_________________________ _____________ 
RATINGS_WINEID               392.167424 
RATINGS_GUID_WINE_GUID         1179.648

With the HAVING clause removed the plans change to:

Plan hash value: 3266154295                                                                      
                                                                                                 
---------------------------------------------------------------------------------------------    
| Id  | Operation                | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    
---------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT         |          |       |       |       | 56823 (100)|          |    
|*  1 |  COUNT STOPKEY           |          |       |       |       |            |          |    
|   2 |   VIEW                   |          | 61248 |  8553K|       | 56823   (2)| 00:00:03 |    
|*  3 |    SORT ORDER BY STOPKEY |          | 61248 |  2871K|  5568K| 56823   (2)| 00:00:03 |    
|   4 |     HASH GROUP BY        |          | 61248 |  2871K|  5568K| 56823   (2)| 00:00:03 |    
|*  5 |      HASH JOIN           |          |   100K|  4717K|  2968K| 54129   (2)| 00:00:03 |    
|   6 |       VIEW               | VW_GBF_8 |   101K|  1776K|       | 52418   (2)| 00:00:03 |    
|   7 |        HASH GROUP BY     |          |   101K|   493K|   241M| 52418   (2)| 00:00:03 |    
|   8 |         TABLE ACCESS FULL| RATINGS  |    21M|   100M|       | 29368   (1)| 00:00:02 |    
|   9 |       TABLE ACCESS FULL  | WINES    |   100K|  2948K|       |  1365   (1)| 00:00:01 |    
---------------------------------------------------------------------------------------------    
                                                                                                 
Predicate Information (identified by operation id):                                              
---------------------------------------------------                                              
                                                                                                 
   1 - filter(ROWNUM<=20)                                                                        
   3 - filter(ROWNUM<=20)                                                                        
   5 - access("WINES"."WINEID"="ITEM_1")

and

Plan hash value: 2283342047                                                                          
                                                                                                     
-------------------------------------------------------------------------------------------------    
| Id  | Operation                | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    
-------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT         |              |       |       |       |   159K(100)|          |    
|*  1 |  COUNT STOPKEY           |              |       |       |       |            |          |    
|   2 |   VIEW                   |              | 61248 |  8553K|       |   159K  (1)| 00:00:07 |    
|*  3 |    SORT ORDER BY STOPKEY |              | 61248 |  6699K|    11M|   159K  (1)| 00:00:07 |    
|   4 |     HASH GROUP BY        |              | 61248 |  6699K|    11M|   159K  (1)| 00:00:07 |    
|*  5 |      HASH JOIN           |              |   100K|    10M|  6080K|   153K  (1)| 00:00:07 |    
|   6 |       VIEW               | VW_GBF_8     |   100K|  4896K|       |   151K  (1)| 00:00:06 |    
|   7 |        HASH GROUP BY     |              |   100K|  3623K|   887M|   151K  (1)| 00:00:06 |    
|   8 |         TABLE ACCESS FULL| RATINGS_GUID |    21M|   741M|       | 79377   (1)| 00:00:04 |    
|   9 |       TABLE ACCESS FULL  | WINES_GUID   |   100K|  6093K|       |  1510   (1)| 00:00:01 |    
-------------------------------------------------------------------------------------------------    
                                                                                                     
Predicate Information (identified by operation id):                                                  
---------------------------------------------------                                                  
                                                                                                     
   1 - filter(ROWNUM<=20)                                                                            
   3 - filter(ROWNUM<=20)                                                                            
   5 - access("WINES_GUID"."WINE_GUID"="ITEM_1")

The response time was around 50% worse for the second plan (The Time column in the plan is not reliable, the timings I used here come from set timing on in SQLcl.)

If you need to use GUIDs as keys, then storing them as RAW(16) will give you better numbers (Repeat the test with uuid() only). In a new project, I would ask if we can use normal system generated IDs and have the UUID V4 stored as RAW(16) in an extra column. That would add to the total space allocated on disk, but certainly worth it if performance becomes an issue as the data volume increases.

It won’t even run, the outer function expects text, but the inner function returns RAW/BINARY and the database complains with ORA-22849: Type BINARY is not supported for this function or operator ↩︎
de Azambuja, R.X.; Morais, A.J.; Filipe, V. X-Wines: A Wine Dataset for Recommender Systems and Machine Learning. Big Data Cogn. Comput. 2023, 7, 20. https://doi.org/10.3390/bdcc7010020 ↩︎
A few tips to load this dataset. Use SQLcl, but create the table without loading the data first: load wines XWines_Full_100K_wines.csv create Then modify columns that are too narrow since SQLcl doesn’t scan more than 5000 rows. (Enhancement request: allow us to scan much more than the upper limit of 5000 rows in set load scan_rows 5000): alter table wines modify (wineid number(10), winename varchar(300)); Then load the data with load wines XWines_Full_100K_wines.csv;. Likeways for the ratings: load ratings XWines_Full_21M_ratings.csv create ; alter table ratings modify ratingid number(10); load ratings XWines_Full_21M_ratings.csv ; ↩︎