Normalizing Data Sets

Here the other day I was helping my son with his homework. His task was to create a simple data model, create it in MySQL, and make a simple web page to access the data. I thought that after creating a few INSERT statements on his own I could help him with demo data. Since he had decided to create a movie database I googled for some movie data sets and decided to use the IMDB 5000. You can download it from Kaggle.com (free registration probably required).

These datasets are often delivered in one file, which means that the data are denormalized. For his simple data model I only needed a few tricks to get the datas into the tables.

The first step is to import the data using SQL Developer. I have one advice on this; do it as simple as possible for you; you are going to copy the data further so there is no reason to make it perfect from start. Just use VARCHAR2 with plenty of space unless you know that all rows have the exact same format. In other situations where you will import the same kind of data repeatedly (like your bank statement), it may be worth the effort to set the correct data type from start. If importing data to SQL Developer is difficult for you, leave me a note in the comment session and I will blog about it.

In my case I called the imported table for MOVIES_IMPORT. Here is a simple example to start with; create a table with all the directors:

create table directors as 
select rownum id, name
from (
  select distinct director_name name 
  from movies_import where director_name is not null
);

Point here is that in the data set there is only one director for each film, so a simple column in a MOVIE table with a foreign key to this table will work. Adding a couple of key constraints:

alter table directors add constraint director_pk primary key(id);
alter table directors add constraint director_uk unique(name);

The actors are a bit different. The data set contains three columns for the actors, with headings ACTOR_1_NAME, ACTOR_2_NAME, and ACTOR_3_NAME. No problem, this SQL will do:

create table actors as
select rn id,name 
from (
  select rownum rn, name
  from (
    select actor_1_name name
    from movies_import
    union
    select actor_2_name name
    from movies_import
    union
    select actor_3_name name
    from movies_import
    )
);

Adding a couple of constratins again:

alter table actors add constraint actor_pk primary key(id);
alter table actors add constraint actor_uk unique(name);

In the data set several genres are stored in one column, with the genre names separated with |. This means that there is a many-to-many relationship between the movies and the genres. First I created an empty table with constraints:

create table genres (id number,name varchar2(50),
  constraint genre_pk primary key(id),
  constraint genre_uk unique (name));

The following anonymous block uses a tool from APEX to split a column value into rows and store them in the table. If the genre is already there the exception catches the ORA-00001 error and continues:

declare
  l_elements_arr apex_application_global.vc_arr2;
  l_str varchar2(2000) ;
  l_sep varchar2(10) := '|';
  e_dupl exception;
  pragma exception_init(e_dupl,-1);
  j number;
begin
  j := 0;
  for g in (select genres from movies_import) loop
    l_elements_arr := apex_util.string_to_table(g.genres, l_sep);
      for i in 1..l_elements_arr.count loop
        begin
          j := j +1;
          insert into genres values(j,l_elements_arr(i));
        exception when e_dupl then
          null;
        end;
      end loop;
  end loop;
end;
/

A note about constraints here; some will suggest to add the constraints later, but I have so many bugs in my code that I prefer the database to catch them early on. The logic above would not have worked without the unique constraint from the start.

I like to keep it simple so I made a couple of simple helper functions to find the primary key in the DIRECTOR and GENRE table:

create or replace function director_id(p_name in varchar2) return number is
  l_id number;
begin
  select id into l_id
  from directors
  where NAME=p_name;
  return l_id;
exception when no_data_found then
  return null;
end;
/

create or replace function genre_id(p_name in varchar2) return number is
  l_id number;
begin
  select id into l_id
  from genres
  where name=p_name;
  return l_id;
exception when no_data_found then 
  return null;
end;
/

Here is the SQL for the MOVIE table:

create table movies  as 
  select rownum id 
    , movie_title
    , director_id(director_name) director_id
    , movie_imdb_link
    , language
    ,country
    , content_rating
    , budget
    , gross
    ,title_year
    ,imdb_score
    , movie_facebook_likes
    ,num_user_for_reviews
    ,duration
    , num_critic_for_reviews
    from movies_import;

Almost done, just a couple of tables left. First one is to map the movies to the genres:

create table movie_genres (
  movie_id number,
  genre_id number, 
  constraint movie_genres_pk primary key(movie_id,genre_id),
  constraint movie_genres_fk1 foreign key(movie_id) references movies(id),
  constraint movie_genres_fk2 foreign key(genre_id) references genres(id))
  ;

We could probably have used just genre as the primary key instead of the surrogate key. Here is the mapping table for movies and actors:

create table movie_actors (
  movie_id number,
  actor_id number,
  constraint movie_actors_pk primary key(movie_id,actor_id),
  constraint movie_actors_fk1 foreign key(movie_id) references movies(id),
  constraint movie_actors_fk2 foreign key(actor_id) references actors(id));

At this point while working on next step I discovered there were some duplicates in the data set. Just to teach myself a lesson; I hadn’t created a primary key on the MOVIE table so Oracle happily inserted all the rows. If you check there are two rows for The Avengers. They occur at line 19 and 796 in the downloaded file and look very similar to me.

Well, sure there are several movies in the world with the same name, but at this time I started to get tired of this data set and just wanted to produce some insert statements for my son. So I decided to ignore it and remove the duplicates. Some day I may blog about how to find true duplicates in such a data set with SQL.

Here is some SQL to list movies with more than one row for each movie name:

select *
from (
select id,movie_title,
  row_number() over (partition by movie_title order by id) rn,
  count(*) over (partition by movie_title) cnt
from movies
order by movie_title
)
where cnt>1;

This SQL removes the duplicates:

delete from movies
where id in (select id 
  from (
    select id ,
    row_number() over (partition by movie_title order by id) rn,
    count(*) over (partition by movie_title) cnt
    from movies)
  where cnt>1 and rn> 1);

Should probably add the keys then:

alter table movies add constraint movies_pk primary key(id);
alter table movies add constraint movies_uk unique (movie_title);

After some trial and errors I ended up with this code to update the mapping tables:

declare 
  l_movie_id number;
  l_elements_arr apex_application_global.vc_arr2;
  l_str varchar2(2000) ;
  l_sep varchar2(10) := '|';
begin
  for m in (select *
    from (select  mov.id movie_id, i.movie_title, i.genres, 
     i.actor_1_name,i.actor_2_name,i.actor_3_name,
     row_number() over (partition by i.movie_title order by i.movie_title) rn
     from movies_import i, movies mov
     where i.movie_title=mov.movie_title
       and mov.director_id = director_id(i.director_name))
    where rn=1) loop
    l_elements_arr := apex_util.string_to_table(m.genres, l_sep);
    for i in 1..l_elements_arr.count loop
      exit when l_elements_arr(i) is null;
      dbms_output.put_line(m.movie_id || ':' || l_elements_arr(i));
      insert into movie_genres
      values (m.movie_id,genre_id(l_elements_arr(i)));
    end loop;
    continue when m.actor_1_name is null; 
    insert into movie_actors values(m.movie_id,actor_id(m.actor_1_name));
    continue when m.actor_2_name is null;
    insert into movie_actors values(m.movie_id,actor_id(m.actor_2_name));
    continue when m.actor_3_name is null;
    insert into movie_actors values(m.movie_id,actor_id(m.actor_3_name));
  end loop;
end;
/

I have to be honest and say the process was not all that simple, duplicates and other errors ment I had to rewrite code and go back a few steps. But it was usefull to see that adding constraints from the start stops you from inserting wrong data, and detect problems in the data set.

In other words, trying to normalize a data set used for mining lets you find issues and learn about the data set before you write a lot of code. This was an iterative process that I may try again. What if the data set is huge? Maybe a sample will work, but your Oracle database will handle data sets far above the size when downloading them becomes impractical.