Practical Guide to Load Large CSV Files to Oracle

August 21, 2024

CSV, oracle, sqlcl

Image by Mudassar Iqbal from Pixabay

Sometimes you need to be practical and apply some tricks to get the job done instead of waiting for the perfect tool. With SQL Developer Command Line you can load CSV with millions of lines with not too much hassle.

Loading large CSV files to Oracle database

After a long summer break it is time to start blogging again. Today I loaded a CSV file containing all companies in Norway registered at The Brønnøysund Register Centre . It has over 1.1 million lines, more than you want to open in a GUI like SQL Developer. But SQL Developer has a command line version that is almost perfect for the job. You’ll find the download link and more information at https://www.oracle.com/database/sqldeveloper/ . This URL always gives you the latest version and from a Mac or Linux box you can just do:

curl -O \
https://download.oracle.com/otn_software/java/sqldeveloper/sqlcl-latest.zip

Unzip the file and start it with:

sqlcl/bin/sql -L someuser@somehost/someservice.

I forgot to mention; you need Java version 11 or higher.

Removing newlines SQLcl doesn’t like

The file has of course comma as delimiter, and double quotes " as enclosure. But, in this file " is also used inside some of the fields. When that happens it is escaped by itself by using " twice. One example is the line for the company with organization id 911760347, one field contains """Knausen"" Vikan". SQLcl handles that well by default setting for double in loadformat (see below).

A more annoying problem is that some descriptive fields contains the newline character (line breaks), which SQLcl does not like (and from a short research I got the impression it is not the only tool that has this problem). If you have Oracle database installed you probably have Perl installed as well. I’ve done this on Linux and on my Mac, I’m sure you can do something similar in another language. This one-liner takes the file and replaces occurrences of newline (\n) with a single space when it is inside a column value. I decided to create another file just in case:

 perl  -pe 's/([^"])\n/$1 /gs ; s/([^,"]"")\n/$1 /igs' \
 enheter_2024-08-21T04-20-40.551841944.csv > prep_enheter.csv

If a line ends with something else than " or it ends with "" without a comma or another " in front the newline character \n will be replaced with a space.

Update 2024-08-26: Connor McDonald wrote a reply to my post on Twitter with this link to a question on AskTom . It gives you an idea how to preprocess the file before loading using a different end-of-line character/sequence.

Success after three attempts is not bad

With this out of the way you can start loading the new file, prep_enheter.csv in this case.

SQLcl can scan the file and create the table for you before it loads the data into it. But you can also get it to just show you the DDL for the table without creating it and loading the data:

load enheter prep_enheter.csv show_ddl

Look at the SQL and you see that illegal characters in the header has been replaced; . with # etc. This is super useful since I don’t have to edit a 579MB file by hand (or messing around with vi, head and sed.)

By default SQLcl scans 100 rows before deciding on the DDL. In this case I want to set it to max which is 5000:

set load scan_rows 5000

I wish I could set it much higher, some problems appeared after 22.000 lines.

By looking at the file we can see that the date format is yyyy-mm-dd, and ask SQLcl to use the same:

set load date yyyy-mm-dd

Number of lines pr batch is 50 in this version, I did not investigate the optimal number, but figured 1000 is a good number:

set load batch_rows 1000

By defaults it commits every 10 batches, I did not change that number. These settings can all be combined in one command, of course:

set load scan_rows 5000 date yyyy-mm-dd batch_rows 1000

You can check the current setting with:

show load
show loadformat

Paste the CREATE TABLE statement generated with the load enheter prep_enheter.csv show_ddl command into an editor and prepare yourself for a few rounds of changes. I’m calling the file cr_table.sql.

Then do first attempt on a load with:

load enheter prep_enheter.csv new

This will create the table based on scanning 5000 rows and load data into it. It will probably fail after a little while with errors like ORA-12899: value too large for column. If SQLcl had scanned more rows it would probably have used wider columns. Anyway, it is easy to fix in the SQL, just find the column and crank up the column length. For the last column I replaced the VARCHAR2 type with CLOB1.

A couple of other things, some date columns got the wrong datatype, change it to DATE. You may consider changing the datatype to VARCHAR2 for columns that only contains digits and therefore get the NUMBER data type when it is really something you are not going to do any calculations on, like ORGANISASJONSNUMMER; I changed it to VARCHAR2(11).

After these changes, drop the table and create it again:

drop table enheter;
@cr_table.sql

Then load the file without creating the table first:

load enheter prep_enheter.csv

That is, no option after the file name. After I had resolved the problems with the newline character inside the field values I had only a 2-3 attempts before I managed to load the whole file. It takes a few minutes, or many, depending on the network, your database hardware, etc. Having an ExaCC helps.

Since it commits regularly you can see the progress with a simple select count(*) from enheter; in another SQLcl session.

In case you want to redo the load without any changes to the table structure, use this option to have SQLcl truncate the table before load:

set load truncate on

At the source for this dataset there is a REST API to search for updates after a specific date, so I don’t have to download and ingest the whole file regularly.

It took almost five minutes to load on a 23ai FREE version, which is not bad, I think.

select count(*) 
from enheter 
where KONKURSDATO > sysdate - 30;
   COUNT(*) 
___________ 
         228 

228 bankruptcies last thirty days. This is an interesting dataset. I can find the number of bankruptcies in the last month, see active businesses by county, what dominates in each region, etc. I will use this dataset further to explore new features in Oracle 23ai. Similarity search and see how vector indexes behaves on larger dataset; I’ve heard a rumour about performance problems I need to check out.


  1. In an Autonomous database in OCI extended character size is enabled by default and the limit for VARCHAR2 is 32767 size. You can check the parameter MAX_STRING_SIZE. STANDARD means the limit is 4000, EXTENDED means 32767. See article at oracle-base.com on how to enable it on your database. ↩︎