Mike Ault's Blog: Random Data

Wednesday, April 27, 2005

Random Data

Had an interesting bit of work come my way yesterday. For testing we needed a 600 row table with four fields. The first field had X possible entries, the second had Y and the third had Z with the last entry being a simulated count, one third of which had to be zero.

My thought was to use varray types and populate them with the various possible values, then use dbms_random.value to generate the index values for the various varrays. The count column was just a truncated call to dbms_random in the range of 1 to 600. All of this was of course placed into a procedure with the ability to give it the number of required values.

Essentially:

create or replace procedure load_random_data(cnt in number) as
define varray types;
define varrays using types;
declare loop interator;
begin
initialize varrays with allowed values;
start loop 1 to cnt times
set dbms_rabdom seed to loop interator;
insert using calls to dbms_random.value(1-n) as the indice for the various varrays;
end loop;
commit;
end;

This generated the rows and then using a primary key constraint with the exceptions clause I would then be able to delete the duplicates leaving my good rows.

create exceptions table using $ORACLE_HOME/rdbms/admin/utlexcpt.sql

alter table tab_z add constraint primary key (x,y,z) exceptions into exceptions;

delete tab_z where rowid in (select row_id from exceptions);

Well, the procedure worked as planned. However, when I attempted to use the exceptions clause on a concatenated key it did not work after the first run where it left 99 out of 600 rows it ended up rejecting all of the rows in the table on all subsequent runs! I tried it several times, each time it rejected every row in the table, even though I could clearly see by examining the first hundred or so rows that they weren't all duplicates.

The left me the old fallback of using the selection of the rowid values and then deleteing the min(rowid) values:

delete from tab_z Z
where z.rowid > (
select min(z2.rowid) from tab_z Z2
where Z.X = Z2.X and
Z.Y=Z2.Y and
Z.Z=Z2.Z);

This worked as expected. Of course you have to generate X(n1)*Z(n2)*Y(n3)*tab(n4) rows it seems to get the n4 number of rows where n1-3 are the count of the number of unique values for the given column and n4 is the number of rows in the table. For example, for my 600 rows I ended up needing to generate 30000 rows to get 600 non-dupes on the three index values. In the 30000 rows I ended up with 600 sets of duplicates in the range of 9-80 duplicates per entry. I re-seeded the random function using the loop interator with each pass through the loop.

Of course after all the work with dbms_random I figured out that it would have been easier just to cycle through the possible values for the varrays in a nested looping structure. Of course that would have led to a non-random ordered table. For the work we were doing it would have been ok to do it that way, but I can see other situations where the random technique would be beneficial.

13 comments:

David Aldridge said...: Ah, interesting problem. So the first three columns have to have a unique key on them then, and there had to be 600 rows. I infer that x*y*z must be greater than 600, would that be right?

I wonder if this code would work for most cases ...

create table mike(col1,col2,col3,col4)
as
select
mod(rownum,15),
mod(rownum,7),
mod(rownum,9),
decode(mod(rownum,3),0,0,rownum)
from
dual
connect by 1=1 and level <= 600

I've taken X=15, Y=7, Z=9 here.; 6:03 AM
David Aldridge said...: Ah, interesting problem. So the first three columns have to have a unique key on them then, and there had to be 600 rows. I infer that x*y*z must be greater than 600, would that be right?

I wonder if this code would work for most cases ...

create table mike(col1,col2,col3,col4)
as
select
mod(rownum,15),
mod(rownum,7),
mod(rownum,9),
decode(mod(rownum,3),0,0,rownum)
from
dual
connect by 1=1 and level <= 600

I've taken X=15, Y=7, Z=9 here.; 6:03 AM
Mike said...: Ah, but the client specified the range of values and the text strings to pop in. For example, col1 might have 13 values such as 'TEST1'...'TEST13', col2 could have 'MAR-95' and 'APR-95' and col3 would have 20 possible values of 'SCHL1' to 'SCHL20'. You then need to mix and match them randomly and uniquely.; 7:35 AM
David Aldridge said...: I would have thought in that case that expressions like ...

TEST'||TO_CHAR(mod(rownum,15)+1,'fm99')

and

TO_CHAR(ADD_MONTHS(TO_DATE('01-jan-1995'),mod(rownum,15)),'MON-YY')

would have worked .. or if the strings aren't deterministic in that way then put them into a table with another column to index from 0 .. 15, and join to it.

Feasible?; 8:03 AM
Bob B said...: How about this...

CREATE TABLE FIRST_COL
AS
SELECT OBJECT_ID
FROM ALL_OBJECTS
WHERE ROWNUM <= 300;

CREATE TABLE SECOND_COL
AS
SELECT OBJECT_ID
FROM ALL_OBJECTS
WHERE ROWNUM <= 2;

CREATE TABLE THIRD_COL
AS
SELECT OBJECT_ID
FROM ALL_OBJECTS
WHERE ROWNUM <= 400;

SELECT COL1, COL2, COL3, DECODE( MOD( ROWNUM, 3 ), 0, 0, MOD( ABS( DBMS_RANDOM.RANDOM ), 600 ) + 1 ) COL4
FROM (
SELECT fc.OBJECT_ID COL1, sc.OBJECT_ID COL2, tc.OBJECT_ID COL3
FROM FIRST_COL fc, SECOND_COL sc, THIRD_COL tc
ORDER BY DBMS_RANDOM.RANDOM
)
WHERE ROWNUM <= 600

Ran in 15 seconds on a single CPU development box running Oracle 10g. The only setup issue is putting the correct values in the FIRST_COL, SECOND_COL, THIRD_COL tables. This code should work just as quick in 8i and 9i (I used to do this on 8i box); 2:15 PM
Pete Scott said...: The 'problem' (not really a problem - it's what is supposed to do) with "exceptions into" is that it puts the duplicate(s) AND the original rows into the exceptions table; after all how would a database "know" which is the correct row to reject. This is very useful when cleaning data in an ETL process, but is not much good in your repeated random insert process. Eventually, and quite soon really, your code will pick values that are already in the table and cause those rows to be deleted.

As David and Bob B both say, a loop (perhaps ordered by DBMS_RANDOM.RANDOM) is the way to go; 5:01 AM
David Aldridge said...: I like Bob's method -- you can fill those three tables with completely random entries that couldn't be automatically generated, and the cartesian product will give you unique combinations.

You could even address the number of relative occurances for each entry in the final result set by having another column with an integer to exxpress this, then join each indivudual table to it's own list of integers. Did that make any sense?; 5:59 AM
Thomas Kyte said...: Well, the procedure worked as planned. However, when I attempted to use the exceptions clause on a concatenated key it did not work after the first run where it left 99 out of 600 rows it ended up rejecting all of the rows in the table on all subsequent runs! I tried it several times, each time it rejected every row in the table, even though I could clearly see by examining the first hundred or so rows that they weren't all duplicates.

Mike, if true that would be a bug. did you file one since you had a nice test case? I've been trying to reproduce from your psuedo code but I cannot. Could it have been that since you kept seeding with the same values, you kept generating the same data over and over ?

I'lll gladly file a bug if you can modify my test case to reproduce your error, but I've not heard of exceptions into working as you describe.

create table t (x int, y int, z int);

create or replace procedure load_random_data( p_cnt in number )
as
type array is table of number;

l_data1 array := array(1,2,3,4,5,6,7,8,9);
l_data2 array := array(40, 50, 60, 70, 80, 90 );
l_data3 array := array(1,2,3,4,5,6,7,8,9,10,11,12);
l_idx1 number;
l_idx2 number;
l_idx3 number;
begin
for i in 1 .. p_cnt
loop
dbms_random.seed(i);
l_idx1 := trunc(dbms_random.value(1,l_data1.count+1));
l_idx2 := trunc(dbms_random.value(1,l_data2.count+1));
l_idx3 := trunc(dbms_random.value(1,l_data3.count+1));
insert into t values ( l_data1(l_idx1), l_data2(l_idx2), l_data3(l_idx3) );
end loop;
commit;
end;
/

and my test procedure was:

create or replace procedure test_it
as
begin
execute immediate 'truncate table t';
load_random_data(600);
execute immediate 'truncate table exceptions';
begin
execute immediate 'alter table t add constraint
t_pk primary key(x,y,z)
exceptions into exceptions';
exception
when others then
dbms_output.put_line( sqlerrm );
end;

for x in ( select count(*) cnt from exceptions )
loop
dbms_output.put_line( x.cnt || ' dups' );
end loop;

for x in (
select x,y,z
from t
where rowid in ( select row_id from exceptions )
group by x,y,z
having count(*) = 1 )
loop
dbms_output.put_line( x.x || ', ' || x.y || ', ' || x.z );
end loop;
end;
/

I could not get a single false positive, ever. I took out the seed (never necessary, it self seeds) and ran it, cannot get it to false postive:

ops$tkyte@ORA9IR2> exec test_it
ORA-02437: cannot validate (OPS$TKYTE.T_PK) - primary key violated
354 dups

PL/SQL procedure successfully completed.

ops$tkyte@ORA9IR2> exec test_it
ORA-02437: cannot validate (OPS$TKYTE.T_PK) - primary key violated
370 dups

PL/SQL procedure successfully completed.

ops$tkyte@ORA9IR2> exec test_it
ORA-02437: cannot validate (OPS$TKYTE.T_PK) - primary key violated
358 dups

PL/SQL procedure successfully completed.

ops$tkyte@ORA9IR2> exec test_it
ORA-02437: cannot validate (OPS$TKYTE.T_PK) - primary key violated
373 dups

PL/SQL procedure successfully completed.

ops$tkyte@ORA9IR2> exec test_it
ORA-02437: cannot validate (OPS$TKYTE.T_PK) - primary key violated
366 dups

PL/SQL procedure successfully completed.

ops$tkyte@ORA9IR2> exec test_it
ORA-02437: cannot validate (OPS$TKYTE.T_PK) - primary key violated
368 dups

PL/SQL procedure successfully completed.

(my approach would have looked like this -- no dups possible:

create or replace type array as table of number
/

create or replace procedure load_random_data( p_cnt in number )
as
l_data1 array := array(1,2,3,4,5,6,7,8,9);
l_data2 array := array(40, 50, 60, 70, 80, 90 );
l_data3 array := array(1,2,3,4,5,6,7,8,9,10,11,12);
begin
insert into t
select c1, c2 ,c3
from ( select *
from (select a.column_value c1,
b.column_value c2,
c.column_value c3
from (select * from table(cast(l_data1 as array))) a,
(select * from table(cast(l_data2 as array))) b,
(select * from table(cast(l_data2 as array))) c
)
order by dbms_random.random)
where rownum <= p_cnt;
end;
/

ops$tkyte@ORA9IR2> exec test_it
0 dups

PL/SQL procedure successfully completed.; 8:57 AM
Pete Scott said...: Mike, could this be what happens?
In your pseudo code you had:
delete tab_z where rowid in (select row_id from exceptions);
As I posted above, this will delete the first occurance of the key and all off its duplicates. If you then ran the row generate code again with the SAME seed value you would generate exactly the same set of 'random' numbers. However, this time all of the rows you create will duplicate rows already in the table from the first pass and hence will be removed when you run the delete after the exceptions into.; 12:05 PM
Mike said...: It is not that important how the data is generated. However, what is important is that the exceptions into clause didn't seem to catch the bad data, but instead marked all as bad data. Once the dups were eliminated with the delete command shown, the primary key created with no problems. If it was a problem with the generation of the data, when I reviewed it, I would have caught it.

I have sent the actual code in to Oracle for review, perhaps they can shine a light on where the problem is occuring.

Note that I don't seed with the same value, I seed with the loop counter each time through the loop, thus the seed is always changing. Between each run I truncated both the target and exceptions tables (after the first two where the anomolous behaviour was noted.

This is verison 10.1.0.4.

Mike; 3:15 PM
Thomas Kyte said...: What Pete above said is 100% correct.

Lets say you ran your code once and it generated:

1 2 3
4 5 6
4 5 6
.....

You would then add the constraint with exceptions into -- flagging the 4,5,6 rows.

You deleted both. Leaving 1,2,3.

Now you run the load data again. it would put the same rows into the table it did the first time. You would now have

1 2 3
1 2 3
4 5 6
4 5 6

when you put the constraint on, it would flag ALL rows and you would delete them all.

I'm afraid it looks like the code only did what you asked it to do.

EXCEPTIONS into flags *all* duplicated rows, not just the 2nd on.

And by seeding with a constant, repeated set of seeds every time you ran it, it would generate *the same data*. So the second run would cause everything to be a dup.

Sorry -- but this was not a problem with exceptions into as far as I can see.; 3:55 PM
Thomas Kyte said...: Note that I don't seed with the same value, I seed with the loop counter each time through the loop, thus the seed is always changing

code you sent was:

for i in 1..cnt loop
sys.dbms_random.seed(i);
insert into clinic_rollup values (
clinics(sys.dbms_random.value(1,15)),
icd9s(sys.dbms_random.value(1,20)),
months(sys.dbms_random.value(1,2)),
trunc(sys.dbms_random.value(1,600),0));
end loop;

first time through loop, seed=1

Next "first time" through loop, seed=1

90th "first time" through the loop, seed=1

everytime you run this procedure, no matter how many times you run it, it would generate the same data over and over and over again....

ops$tkyte@ORA10GR1> exec load_random_data(600);
PL/SQL procedure successfully completed.

ops$tkyte@ORA10GR1> rename clinic_rollup to x;
Table renamed.

ops$tkyte@ORA10GR1> create table clinic_rollup(a varchar2(10), b varchar2(6), c varchar2(9), d number );
Table created.

ops$tkyte@ORA10GR1> exec load_random_data(600);

PL/SQL procedure successfully completed.

ops$tkyte@ORA10GR1>
ops$tkyte@ORA10GR1> select * from x minus select * from clinic_rollup
2 union all
3 select * from clinic_rollup minus select * from x
4 /

no rows selected

Always generated exactly the same 600 rows every time, like clockwork.; 4:04 PM
Mike said...: Tom,

Yep exactly as you state and exactly as I expected. You failed to read where I truncated the table and exceptions table between each run after the first two. The idea was to generate 600 rows with unique keys. Not to test dbms_random, not to test the exceptions into clause and not to validate or disprove the documentation.; 8:22 AM

Mike Ault's Blog

Wednesday, April 27, 2005

Random Data

13 comments:

Blog Archive

Links

About Me