
UUIDs are useful, especially when you expose data in REST APIs, but there are cases where you may want to stick to the good old sequence-based primary key.
UUIDs in Oracle Database #JoelKallmanDay
UUID, short for universally unique identifier, are also known as globally unique identifier (GUID). You can use the function sys_guid()
in the Oracle database to generate it:
select sys_guid() guid
from dual
connect by level < 10;
The result, however, is not what we are used to see these days:
GUID
___________________________________
40536A604DFD9FF2E0637700000A85F6
40536A604DFE9FF2E0637700000A85F6
40536A604DFF9FF2E0637700000A85F6
40536A604E009FF2E0637700000A85F6
40536A604E019FF2E0637700000A85F6
40536A604E029FF2E0637700000A85F6
40536A604E039FF2E0637700000A85F6
40536A604E049FF2E0637700000A85F6
40536A604E059FF2E0637700000A85F6
It generates them in a monotonic order and it is difficult to see that they are unique.
In 23.9 the function uuid()
was introduced:
select uuid() uidV4
from dual
connect by level < 10;
UIDV4
___________________________________
7211D5EB2A8D4F2ABF97350CA430E426
E2E0F762E7E84F7BBF4D4F5275109EBA
4428304718E54F4CBFB3ECA83EA35993
FE5190FE9B584F5BBF601BBF08B08C57
5303087A00764FCDBF927F3E5AE95986
CD4202C7BD664FDABF223336A5BC2F02
51A814ABB52D4F19BF0BD938EBF05D1F
2D2571341D704F01BFF9B080F933AACC
6459A47343694F69BFA155C620E91FD7
It looks more random, but not exactly how we are used to see them displayed. Oracle added another function to format them nicely:
select raw_to_uuid(uuid()) uidV4_formatted
from dual
connect by level < 10;
UIDV4_FORMATTED
_______________________________________
b2471554-f9e1-4f26-bf2a-1d7b04ced765
d69694dd-b695-4fbf-bf8e-c4e553485b98
63c579bd-e310-4f92-bf39-4d85e0fee379
001fc2fa-02ec-4fc1-bfe2-6786606d3ea1
28bee96b-c21c-4f4d-bf01-eee5724d69af
e6b2473c-9dbc-4f70-bf40-4ac033bc4af7
7f289ebd-5c21-4fbd-bf40-dfd59a0f2f8b
1375aa43-c862-4f24-bff7-8516e2567299
227c3b6c-4c5a-4f53-bf56-225483e04a9c
That’s better. There is also the reverse function of this:
select uuid_to_raw('28bee96b-c21c-4f4d-bf01-eee5724d69af') uid_raw ;
UID_RAW
___________________________________
28BEE96BC21C4F4DBF01EEE5724D69AF
This function returns a RAW(16)
if the input passes the function is_uuid()
:
select '1375aa43-c862-4f24-bff7-8516e2567299' foo
, is_uuid('1375aa43-c862-4f24-bff7-8516e2567299') OK
union
select 'fubar', is_uuid('fubar');
FOO OK
_______________________________________ ________
1375aa43-c862-4f24-bff7-8516e2567299 true
fubar false
(If your client returns 1 and 0 instead of true/false, then upgrade your client. SQLcl 24.2.0 returns 1, latest version 25.2 works fine).
So we now have:
uuid()
returns RAW(16)raw_to_uuid()
formats UUID in RAW(16) and returns VARCHAR2.uuid_to_raw()
converts UUID the way we are used to see them to RAW(16), provided that the input is valid.is_uuid()
returns a boolean value after validation of input.
For is_uuid()
and uuid_to_raw()
the input may be enclosed by {}
and contain hyphens, see the four different text formats in documentation
.
I think the naming of these functions are slightly confusing. uuid_raw()
in stead of just uuid()
would have been less confusing. Try select is_uuid(uuid());
to get an illustration1.
More details regarding UUIDs can be found in RFC 9562 and on Wikipedia .
UUIDs in your API
Say you have data about your large wine collection exposed through a REST API. Then you may have an endpoint that returns details for one specific wine: /wines/{wine_id}
If that wine_id
is your sequence based primary key, that is, an integer, people will soon figure out that they can download all your data by just increasing that number for each fetch. After a simple loop they have all your data. With an UUID they will need extreme luck to get a hit (and your monitoring system may block the client causing massive amounts of 404s).
You can easily add UUIDs to existing tables, or perhaps a mapping table is better. Then you can use UUIDs for publication and use the normal keys for internal SQL. I have seen in newer system that the primary and foreign keys are all based on UUIDs, but for large data sets that may have a downside. Connor McDonald looked at this in his blogpost Gooey GUIDs from 2018 (and recently updated). His concern is mostly the CPU time it takes to generate the GUIDs with data loaded in bulk (e.g. data warehouses), but I wonder if storage space may become an issue as well.
In a case a few weeks ago I came across tables with many million rows where all the keys were based on UUID v4. The keys were stored as formatted UUD V4 strings, not in RAW(16)
as output from the old sys_guid()
or the new uuid()
. For simple lookup that worked great, but for reporting and other bulk operation it was not optimal. My first thought that these would add a lot of overhead on disk space, so I decided to do some testing.
Some tests on a 100 million rows dataset
I am using a dataset X-Wines 2. You can download the full dataset from their Google Drive folder . I used FREE edition version 23.9 for these tests.
After load3 I have two tables, wines
and ratings
, where the latter has a reference to the former, but the constraints must be added after load:
alter table wines add constraint wines_pk primary key(wineid);
alter table ratings add constraint ratings_wines_fk
foreign key(wineid) references wines;
alter table ratings_pk add constraint ratings_pk primary key(ratingid);
One index added to the rating table to see what Oracle uses and stats collected in a lazy way:
create index ratings_wineid on ratings(wineid);
exec dbms_stats.gather_schema_stats(user);
This query finds the top 20 most rated wines (the having
clause is not necessary for the result, it got in there after some evolved tests, and makes the CBO pick up the index.):
select winename,count(*) ratings_cnt
from wines join ratings using (wineid)
group by winename
having count(*) > 100
order by 2 desc
fetch first 20 rows only;
Here is the explain plan for the query:
Plan hash value: 1429720065
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 31986 (100)| |
|* 1 | COUNT STOPKEY | | | | | | |
| 2 | VIEW | | 3063 | 427K| | 31986 (3)| 00:00:02 |
|* 3 | SORT ORDER BY STOPKEY | | 3063 | 122K| | 31986 (3)| 00:00:02 |
|* 4 | HASH GROUP BY | | 3063 | 122K| | 31986 (3)| 00:00:02 |
|* 5 | HASH JOIN | | 20M| 818M| 4720K| 31446 (1)| 00:00:02 |
| 6 | VIEW | VW_GBF_8 | 100K| 3538K| | 1365 (1)| 00:00:01 |
| 7 | TABLE ACCESS FULL | WINES | 100K| 2948K| | 1365 (1)| 00:00:01 |
| 8 | INDEX FAST FULL SCAN| RATINGS_WINEID | 21M| 100M| | 12867 (1)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=20)
3 - filter(ROWNUM<=20)
4 - filter(SUM("ITEM_2")>100)
5 - access("ITEM_1"="RATINGS"."WINEID")
In order to analyze the case where the UUIDs are stored as formatted string I decided to create two new tables (and a temporary one) with UUIDs V4 stored as varchar2(36):
-- drop table wines_guid_tmp purge ;
create table wines_guid_tmp
as SELECT raw_to_uuid(uuid()) wine_guid,
wineid,
WINENAME,
TYPE,
ELABORATE,
GRAPES,
HARMONIZE,
ABV,
BODY,
ACIDITY,
CODE,
COUNTRY,
REGIONID,
REGIONNAME,
WINERYID,
WINERYNAME,
WEBSITE,
VINTAGES FROM WINES;
-- drop table ratings_guid purge;
create table ratings_guid
as SELECT raw_to_uuid(uuid()) rating_guid,
USERID,
w.wine_guid,
VINTAGE,
RATING,
DATE$
FROM RATINGS r join wines_guid_tmp w using (wineid);
-- drop table wines_guid purge ;
create table wines_guid
as SELECT wine_guid,
WINENAME,
TYPE,
ELABORATE,
GRAPES,
HARMONIZE,
ABV,
BODY,
ACIDITY,
CODE,
COUNTRY,
REGIONID,
REGIONNAME,
WINERYID,
WINERYNAME,
WEBSITE,
VINTAGES FROM WINES_GUID_TMP;
drop table wines_guid_tmp purge;
alter table wines_guid add constraint wines_guid_pk primary key (wine_guid);
alter table ratings_guid add constraint ratings_guid_pk primary key (rating_guid);
alter table ratings_guid add constraint ratings_guid_fk
foreign key (wine_guid) references wines_guid;
create index ratings_guid_wine_guid on ratings_guid(wine_guid);
exec dbms_stats.gather_schema_stats(user);
Repeating the query on the new tables:
select winename,count(*) ratings_cnt
from wines_guid join ratings_guid using (wine_guid)
group by winename
having count(*) > 100
order by 2 desc
fetch first 20 rows only;
The plan looks familiar:
Plan hash value: 2314587929
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 89878 (100)| |
|* 1 | COUNT STOPKEY | | | | | | |
| 2 | VIEW | | 3063 | 427K| | 89878 (1)| 00:00:04 |
|* 3 | SORT ORDER BY STOPKEY | | 3063 | 314K| | 89878 (1)| 00:00:04 |
|* 4 | HASH GROUP BY | | 3063 | 314K| | 89878 (1)| 00:00:04 |
|* 5 | HASH JOIN | | 21M| 2104M| 7864K| 89335 (1)| 00:00:04 |
| 6 | VIEW | VW_GBF_8 | 100K| 6683K| | 1510 (1)| 00:00:01 |
| 7 | TABLE ACCESS FULL | WINES_GUID | 100K| 6093K| | 1510 (1)| 00:00:01 |
| 8 | INDEX FAST FULL SCAN| RATINGS_GUID_WINE_GUID | 21M| 741M| | 38678 (1)| 00:00:02 |
------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=20)
3 - filter(ROWNUM<=20)
4 - filter(SUM("ITEM_2")>100)
5 - access("ITEM_1"="RATINGS_GUID"."WINE_GUID")
The response doesn’t look bad, 3.69 seconds compared to 3.02 seconds for the first query, but it is a 21% increase in response time (but perhaps good enough?). Anyway, have a look at line 8 in the execution plans, the amount of data read for the INDEX FAST FULL SCAN operation was many times higher for the last query; 741M vs 100M. You can get an idea when you look at the size of the indexes on disk:
select segment_name,sum(bytes)/1e6 mb
from user_segments
where segment_name in ('RATINGS_WINEID','RATINGS_GUID_WINE_GUID')
group by segment_name;
SEGMENT_NAME MB
_________________________ _____________
RATINGS_WINEID 392.167424
RATINGS_GUID_WINE_GUID 1179.648
With the HAVING
clause removed the plans change to:
Plan hash value: 3266154295
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 56823 (100)| |
|* 1 | COUNT STOPKEY | | | | | | |
| 2 | VIEW | | 61248 | 8553K| | 56823 (2)| 00:00:03 |
|* 3 | SORT ORDER BY STOPKEY | | 61248 | 2871K| 5568K| 56823 (2)| 00:00:03 |
| 4 | HASH GROUP BY | | 61248 | 2871K| 5568K| 56823 (2)| 00:00:03 |
|* 5 | HASH JOIN | | 100K| 4717K| 2968K| 54129 (2)| 00:00:03 |
| 6 | VIEW | VW_GBF_8 | 101K| 1776K| | 52418 (2)| 00:00:03 |
| 7 | HASH GROUP BY | | 101K| 493K| 241M| 52418 (2)| 00:00:03 |
| 8 | TABLE ACCESS FULL| RATINGS | 21M| 100M| | 29368 (1)| 00:00:02 |
| 9 | TABLE ACCESS FULL | WINES | 100K| 2948K| | 1365 (1)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=20)
3 - filter(ROWNUM<=20)
5 - access("WINES"."WINEID"="ITEM_1")
and
Plan hash value: 2283342047
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 159K(100)| |
|* 1 | COUNT STOPKEY | | | | | | |
| 2 | VIEW | | 61248 | 8553K| | 159K (1)| 00:00:07 |
|* 3 | SORT ORDER BY STOPKEY | | 61248 | 6699K| 11M| 159K (1)| 00:00:07 |
| 4 | HASH GROUP BY | | 61248 | 6699K| 11M| 159K (1)| 00:00:07 |
|* 5 | HASH JOIN | | 100K| 10M| 6080K| 153K (1)| 00:00:07 |
| 6 | VIEW | VW_GBF_8 | 100K| 4896K| | 151K (1)| 00:00:06 |
| 7 | HASH GROUP BY | | 100K| 3623K| 887M| 151K (1)| 00:00:06 |
| 8 | TABLE ACCESS FULL| RATINGS_GUID | 21M| 741M| | 79377 (1)| 00:00:04 |
| 9 | TABLE ACCESS FULL | WINES_GUID | 100K| 6093K| | 1510 (1)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=20)
3 - filter(ROWNUM<=20)
5 - access("WINES_GUID"."WINE_GUID"="ITEM_1")
The response time was around 50% worse for the second plan (The Time
column in the plan is not reliable, the timings I used here come from set timing on
in SQLcl.)
If you need to use GUIDs as keys, then storing them as RAW(16) will give you better numbers (Repeat the test with uuid()
only). In a new project, I would ask if we can use normal system generated IDs and have the UUID V4 stored as RAW(16) in an extra column. That would add to the total space allocated on disk, but certainly worth it if performance becomes an issue as the data volume increases.
-
It won’t even run, the outer function expects text, but the inner function returns RAW/BINARY and the database complains with
ORA-22849: Type BINARY is not supported for this function or operator
↩︎ -
de Azambuja, R.X.; Morais, A.J.; Filipe, V. X-Wines: A Wine Dataset for Recommender Systems and Machine Learning. Big Data Cogn. Comput. 2023, 7, 20. https://doi.org/10.3390/bdcc7010020 ↩︎
-
A few tips to load this dataset. Use SQLcl, but create the table without loading the data first:
load wines XWines_Full_100K_wines.csv create
Then modify columns that are too narrow since SQLcl doesn’t scan more than 5000 rows. (Enhancement request: allow us to scan much more than the upper limit of 5000 rows inset load scan_rows 5000
):alter table wines modify (wineid number(10), winename varchar(300));
Then load the data withload wines XWines_Full_100K_wines.csv;
. Likeways for the ratings:load ratings XWines_Full_21M_ratings.csv create ;
alter table ratings modify ratingid number(10);
load ratings XWines_Full_21M_ratings.csv ;
↩︎