Mike Ault's Blog: Using a HASH Signature in Data Warehouses

On my current assignment I am dealing with slowly changing dimensions, I found an interesting technique I would like to share.

In data warehouses one item that can be difficult to deal with are slowly changing dimensions. Slowly changing dimensions are dimensions whose values may alter over time such as addresses, deployments, etc. Often it is a desire of the data warehouse manager to maintain a running history of these changes over time, this makes the dimension a type 2 dimension. The reason these are difficult to deal with is because determining what has changed becomes difficult if there are more than a few fields in the dimension. The data warehouse manager has the choice of a difficult, field by field verification of each field in the record or using some other form of field verification to determine if a record has changed in a dimension thus requiring a new record insertion. When the new record is inserted into a slowly changing dimension the old record is marked as invalid either by a flag or via a date field or set of date fields allowing for a bi-temporal dimension.

One method for finding out if a record has a changed field is to generate a CRC or a hash for the entire pertinent sections of the record, store this signature with the record and then on each record that is to be inserted, a signature is calculated for the important sections of the record. This new signature is compared to the stored signature and if the signatures match, the record is tossed, if the signature is changed, the new record is inserted, the old is marked as invalid and the new marked as the current record, if the new record matches none of the existing signatures, the record is inserted but no existing records are altered.

However, calculating a CRC or HASH can be difficult, unless you know something about what Oracle provides in the various packages contained in Oracle. For example the OUTLN_EDIT_PKG contains the GENERATE_SIGNATURE package. The GENERATE_SIGNATURE package creates a 32 byte hash signature for up to a 32k-1 sized string value passed to it either internally or via a procedure call. If the pass-in is from a procedure call from SQL then the length would be limited to 4000 bytes. Providing your critical columns are less than 4000 bytes concatenated length you can generate a signature that can be used to compare records, in this scenario, each date or number field should be converted using the appropriate function to a character equivalent.

To use the OUTLN_EDIT_PKG.GENERATE_SIGNATURE package the first thing to do is make sure your user that owns the schema for the data warehouse (DWH) has EXECUTE privilege on the OUTLN_EDIT_PKG:

From the owner of the OUTLN_EDIT_PKG, the user SYS, grant the proper privilege:

GRANT EXECUTE ON outln_edit_pkg TO ;

In order to use the GENERATE_SIGNATURE package most effectively, I suggest wrapping it into a deterministic function, for example:

From the DWH user you could create a function similar to:

CREATE OR REPLACE FUNCTION get_sign (text VARCHAR2)
RETURN RAW
DETERMINISTIC
AS
signature RAW(32);
text2 VARCHAR2(32767);
BEGIN
text2:=text;
IF text2 IS NULL THEN
text2:='1';
END IF;
sys.outln_edit_pkg.generate_signature(text2,signature);
RETURN signature;
END;
/

Note that this is just a bare-bones function, exception checking should be added.

Notice I use a default value of '1' if the input value is null, because the function generates a hash signature you cannot pass it a null value, so in the case of a null value we change it to a 1, you of course could use any string value you desire, just be sure to document it!

Once you have a deterministic function you can either use it as an INSERT into a table column, or, use it to generate a function based index. By using a function based index and query re-write you can eliminate the need to have a column in the table. This allows you to regenerate the matching columns without having to update the complete dimension each time there may be columns added to the dimension. Also, there are performance benefits to using a function based index or an internal column over doing either a column by column match. Let's look at some examples:

First, we have our test table TEST, in this case I created it from the DBA_OBJECTS view using a simple CTAS.

SQL> DESC test
Name Null? Type
----------------------------------------- -------- -------------------
OWNER VARCHAR2(30)
OBJECT_NAME VARCHAR2(128)
SUBOBJECT_NAME VARCHAR2(30)
OBJECT_ID NUMBER
DATA_OBJECT_ID NUMBER
OBJECT_TYPE VARCHAR2(18)
CREATED DATE
LAST_DDL_TIME DATE
TIMESTAMP VARCHAR2(19)
STATUS VARCHAR2(7)
TEMPORARY VARCHAR2(1)
GENERATED VARCHAR2(1)
SECONDARY VARCHAR2(1)

SQL> SELECT COUNT(*) FROM test;

COUNT(*)
----------
32359

Once we have our test table, we create our function to call the GENERATE_SIGNATURE and then use it to create a function based index:

SQL> CREATE INDEX test_fbi ON test(get_sign(ownerobject_namesubobject_name));

Index created.

Before we can use query rewrite we must analyze the table and index as well as set the proper initialization parameters, (query_rewrite_enabled = true and query_rewrite_integrity = trusted).

SQL> ANALYZE TABLE test COMPUTE STATISTICS;

Table analyzed.

Let's look at a query that will use this index and see what the performance is like.

SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = get_sign(a.ownera.object_namea.subobject_name);

OBJECT_ID
----------
35567

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=324
Bytes=11016)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=324
Bytes=11016)
2 1 INDEX (RANGE SCAN) OF 'TEST_FBI' (NON-UNIQUE) (Cost=2
Card=129)

Statistics
----------------------------------------------------------
74 recursive calls
0 db block gets
14 consistent gets
1 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> /

OBJECT_ID
----------
35567

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=34)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=34)
2 1 INDEX (RANGE SCAN) OF 'TEST_FBI' (NON-UNIQUE) (Cost=2
Card=1)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> /

OBJECT_ID
----------
35567

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=34)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=34)
2 1 INDEX (RANGE SCAN) OF 'TEST_FBI' (NON-UNIQUE) (Cost=2
Card=1)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Now let’s drop the index and re-run our query to check performance without it:

SQL> DROP INDEX test_fbi;

Index dropped.

SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = get_sign(a.ownera.object_namea.subobject_name);

OBJECT_ID
----------
35567

Elapsed: 00:00:01.04

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=138 Card=324
Bytes=11016)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=138 Card=324
Bytes=11016)

Statistics
----------------------------------------------------------
243 recursive calls
0 db block gets
526 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
6 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> /

OBJECT_ID
----------
35567

Elapsed: 00:00:01.03

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=138 Card=324
Bytes=11016)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=138 Card=324
Bytes=11016)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = get_sign(a.ownera.object_namea.subobject_name);

OBJECT_ID
----------
35567

Elapsed: 00:00:01.03

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=132 Card=324
Bytes=11016)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=132 Card=324
Bytes=11016)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
446 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

As you can see the simple SELECT statement using the function based index performs in too small a time interval for Oracle's timing command to measure (the test was run several times, with identical results), while the SELECT not able to use the index takes a consistent time just over 1 second (the tests were run several times, and the times averaged).

As you can see even when the table is read from memory (physical reads = 0) the consistent gets (446) are much higher than those from the index based selects ( consistent gets 4) and the index based query performs better even though it has 74 recursive calls. The recursive calls were only present on the first run, after that the recursive calls were 0.

In situations where the call is repeated many times, such as for loading dimension tables, then the performance improvement should be dramatic.

In other tests, using an added column SIGNATURE, a RAW(32), and an index on the column we can get identical performance to a function based index and table combination:

First, using a modified SELECT to go against the SIGNATURE column and no index:

SQL> alter table test add signature raw(32);

Table altered.

SQL> update test a set a.signature=get_sign(a.ownera.object_namea.subobject_name);

32359 rows updated.

SQL> commit;

Commit complete.

SQL> set autotrace on
SQL> set timing on
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = a.signature;

OBJECT_ID
----------
35567

Elapsed: 00:00:00.08

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=116 Card=324
Bytes=7128)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=116 Card=324
Bytes=7128)

Statistics
----------------------------------------------------------
277 recursive calls
0 db block gets
534 consistent gets
3 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> /

OBJECT_ID
----------
35567

Elapsed: 00:00:00.07

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=116 Card=324
Bytes=7128)
1 0 TABLE ACCESS (FULL) OF 'TEST' (Cost=116 Card=324
Bytes=7128)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

So without an index the use of a special SIGNATURE column performs worse than the table and function based index combination. However, when we add an index, the performance is the same from the timing and the statistics.

SQL> create index test_ui on test(signature);

Index created.

Elapsed: 00:00:00.07
SQL> analyze table test compute statistics;

Table analyzed.

Elapsed: 00:00:02.05
SQL> SELECT a.object_id FROM test a
2 WHERE
3 get_sign('SYSTEM''TEST''') = a.signature;

OBJECT_ID
----------
35567

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=20)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=20)
2 1 INDEX (RANGE SCAN) OF 'TEST_UI' (NON-UNIQUE) (Cost=2
Card=1)

Statistics
----------------------------------------------------------
26 recursive calls
0 db block gets
8 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> /

OBJECT_ID
----------
35567

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=20)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=20)
2 1 INDEX (RANGE SCAN) OF 'TEST_UI' (NON-UNIQUE) (Cost=2
Card=1)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> /

OBJECT_ID
----------
35567

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=20)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST' (Cost=2 Card=1
Bytes=20)
2 1 INDEX (RANGE SCAN) OF 'TEST_UI' (NON-UNIQUE) (Cost=2
Card=1)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
382 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

However, if there is a chance that columns will be added or removed using a column inside the dimension will require the entire dimension to be updated a row at a time. In low row-count dimensions this may be acceptable, in large dimensions with thousands of rows it may be better to use the table and function based index solution.

5 comments:

Pete Scott said...: An interesting approach - never thought about doing it that way before.

One method I have used in the past for reference or dimension tables is to use the SQL minus function to detect changes between the incoming staging 'tables' and data stored in 'warehouse' tables - the resulting data set is then applied through what ever business rules you have. In the case of type two dimensions you would take stage minus warehouse where current flag is true (or something like that)
I have always found these 'set based' select to whizz along - but of course your circumstances may preclude this.; 5:13 AM
Mike said...: Ajay,

I was looking for that function, it may be better, I will give it a try and see.

Mike; 1:18 PM
Mike said...: Unfortunately ora_hash() is only in 10g, for this project the system may be deployed anywhere from 8i-10g depending on the county (I know, I have to be kidding, but unfortunately that is the reality). So using the outln_edit_pkg is the best option for this. However, for 10g and above the ora_hash looks promising!

Mike; 1:33 PM
Mike said...: Howard,

I hav always said that the proofs provided were excellent for what they proved and withing the limits of their system. Same with this one. As I do more with it, I will also provide any expansions as needed, such as if it falls over at a certain size.

Mike; 6:20 PM
Unknown said...: Perhaps an awnser to your comparison questions can be found in release 11g.

Please review the article on:

http://download.oracle.com/docs/
cd/B28359_01/server.111/b28322/
man_comp.htm; 7:49 AM

Mike Ault's Blog

Wednesday, May 25, 2005

Using a HASH Signature in Data Warehouses

5 comments:

Blog Archive

Links

About Me