Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Saturday, April 30, 2005

More Random Data

I have been catching a load of grief over my random data post. Folks didn't like my procedure for loading the data, and folks didn't like that I said the EXCEPTIONS INTO seemed to be exhibiting anomalous behavior.

As to the first point, was I attempting to create a picture perfect, bullet proof general purpose data loader that was elegant, fully self documenting and blessed by all and sundry Oracle deities? No. Was I attempting to solve an immediate problem as quickly as possible? Yes. Did the procedure work for that? Yes.

As to the EXCEPTIONS clause, my understanding of an exception is that you get a ordinal value, a "zero" case, then, when duplicates (in the case of a primary key) are found, they are exceptions however, as has been shown to me, Oracle treats all occurrences, ordinal or not, as exceptions. So, Oracle is working as programmed. Do I agree with that? No. However, it has been that way for years. And darned of they will change it for me, imagine that!

In the SQL Manual they have a terse:

"exceptions_clause - Specify a table into which Oracle places the rowids of all rows violating the constraint. If you omit schema, then Oracle assumes the exceptions table is in your own schema. If you omit this clause altogether, then Oracle assumes that the table is named EXCEPTIONS. The exceptions table must be on your local database."

If you don't have an ordinal row, how can subsequent ones violate it?

In the Administrator's guide they provide a quick overview of the EXCEPTIONS clause with a small example. The small example (a single violation) does show two rows being mapped in the exceptions table. Taking some snippets from there:

“All rows that violate a constraint must be either updated or deleted from the table containing the constraint.” Which is incorrect. What if one of the rows is the proper row? Do I delete or update it? In the situation where there are multiple identical rows, except for a bogus artificial key, this statement would lead someone to believe all the rows needed to be deleted. The section only confuses the matter more if someone comes in with a misunderstanding about how the exception table is populated. A second line: “When managing exceptions, the goal is to eliminate all exceptions in your exception report table.”
Also seems to indicate that every line in the exceptions table is an exception, when there are clear cases where the line will be a valid row, not an exception. It all ties back to the definition of exception.

Here is the complete reference for those that doubt my veracity:

http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96521/general.htm#13282

If I ask, in the following series of numbers how many unique, single digit values are there?
1,1,1,2,2,2,3,3,3,4
The answer is 4 (1,2,3,4) and 6 duplications of the numbers, not 1 (4) and 7 exceptions as Oracle’s exception processing would answer.

If you go into it with my initial understanding of what an exception is, then the documentation seems to confirm that, if you go into it with Oracle's understanding of an exception it proves that. I have asked that a simple definition of what Oracle defines an exception to be to be added if possible to the manuals (SQL and Aministrators guide) however I am meeting with extreme resistance.

So, the behavior noted in my previous blog is not a bug, just the way Oracle works. However, you need to take this into account when dealing with Oracle's EXCEPTIONS INTO clause. I would hate for some one working on critical data to use the same definition of exception I did and delete some very important data. When using the EXCEPTIONS INTO clause, be sure to test on a test platform with non-critical data any UPDATE or DELETE activities planned and completely understand the outcome before applying the usage to production data.

Oh, and for those that want to de-dup tables, accoring to Tom Kyte, here is the fastest method known to man:

http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:1224636375004#29270521597750

25 comments:

Thomas Kyte said...

Mike, in the set of tuples:

(x,y)
-----
(1,2)
(1,3)
(1,4)

given you want a constraint on X (unique), which are the duplicates? Which two are your "duplicates" and why is the other one "more special".

Whatever you pick, I'll pick a different two, because I'm like that.

Now, you really want the software to pick which two? How does it pick, say it picks:

(1,2)
(1,4)

for you, and

(1,3)
(1,4)

for me, because the data is in a different order on disk. These are rows -- not a sequence of numbers here. You had N rows with the "same key". As they exist simultaneously -- they are ALL EXCEPTIONS. Until you remove two of them in this example. Then none of them are.

however I am meeting with extreme resistance.

if you are talking about emails you sent to me -- well Mike, I don't write the docs. I am not "Oracle" (you said you sent the scripts to Oracle, you sent them to me, a person who happens to work there). If you believe the documentation to be faulty, please by all means file a documentation bug via metalink support. I contacted you after reading your post because what you described and the manner you described it said "bug with exceptions into". I was relieved to find that it was not a bug with exceptions into, but rather a misunderstanding on your part about how it worked. I wanted your test case to file a bug if there was one.

That is what I myself do when I find something I believe to be incorrect or described in insufficient detail in the docs -- metalink, you can file doc bugs as well. The resistance you met from me was, I am of the opinion that the documentation is accurate, correct and more than detailed enough (but then again, I'm a SQL guy -- I work in sets all of the time, to me if it worked any other way it would be utterly useless and non-deterministic).

Sorry if pointing you to a method I believe to be a really fast way to de-dup data was insulting to you. I just said it was the fastest way to de-dup, based on my experience, didn't mean anything by it. I didn't say "fastest way known to man".

Mike said...

Again Tom,

Just as a point of reference the first occurrence should be marked in the exceptions table as the ordinal or first occurrence, then the others listed with it's rowid as well as their own. In a table with millions of rows this gives you a starting point. I am not saying one is better or more right, it is just the first occurrence. It would allow me to quicky find the first occurance and its duplicates in the source table to allow me to correct the problem, either through manual or automated data cleanup.

With 3 rows, it is a snap, with 3 million or more it is difficult especially if there are many exceptions, no matter whose definition you use. And in the special case that started this, all the rows are marked as exceptions which would be no help at all.

Tom, whether you like it or not, you are viewed as Oracle's mouth piece, you are the voice of Oracle for the DBA community.

I realize you don't write the docs, but I was surprized by your vehement objections that the docs were just fine as is. In polls of experienced DBAs (now with total years of experience close to 70) only 1 knew the exceptions into clause worked this way instead of the way I also thought it did. This is what concerns me, the docs don't seem to have made it clear what is going on.

David Aldridge said...

Mike,

I think that the functionality of the clause is not only the most appropriate, but it is also the only desirable functionality. My thoughts on the subject were lengthy enough for me to bump it over to my own blog.
http://oraclesponge.blogspot.com/#111488202711439434
... or ...
http://oraclesponge.blogspot.com and look for EXCEPTIONS INTO.

Comments here or there are welcome.

Ajay said...

The way the "exceptions into" clause works makes sense. You are validating all the data already in the table against the constraint. Thats very different from DUP_VAL_ON_INDEX (which is the functionality I think Mike expected to see).

Bob B said...

An ordinal row makes sense only in a serial environment. If each transaction must commit or rollback before the prior transaction, then we can probably determine which was the ordinal row and its sequential duplicates.

Throw 2 (or more) users in the mix and that certainty in the ordinal goes down the drain.

CREATE TABLE MY_TABLE (
x INT PRIMARY KEY,
y INT
);

T1: User 1 issues INSERT INTO MY_TABLE VALUES ( 1, 0 );
T2: User 2 issues INSERT INTO MY_TABLE VALUES ( 1, 100 );
T3: User 2 commits;
T4: User 1 commits;

Now which user has the ordinal row? User 1 thinks its User 1 because he inserted the row first. User 2 thinks its User 2 because he committed first.

Also, as Tom Kyte pointed out, there are times where the ordinal row is meaningless. In your case, you wanted to keep the first row and ditch the rest. What if you had a table of addresses? In that case, you'd probably want the most recent row because its the most likely to be correct.

I've read stories about how immigrants coming into America via Ellis Island were given bananas. Many, never having eaten a banana before, ate the peel and threw the actual banana away.

So, if you saw me eat a banana peel and then I told you that bananas taste horrible, what would you do? (Going for an analogy here, hope its sensible)

Peter K said...

Mike,
I think you are over-reacting. How Oracle handles exceptions would be how I would handle it. Given two or more rows with duplicate keys, there is no way that you can tell which is the correct one so you will have to mark all of them as exceptions. As you very well know there is no order to data being retrieved by Oracle unless you explicitly asked for it to be ordered so throwing away the rows after the first row could be different depending on when you execute the query.

Lighten up, folks pointing out that your basic assumption might be incorrect doesn't mean that they are deliberately trying to trip you up. It's always beneficial to get another viewpoint. It's like trying to debug a bug with a syntax error and when you are so close to the code you ended up not seeing the obvious and another pair of eyes definitely will help.

Mike said...

Every one seems to be saying that I proposing the ordinal row is the best row, that is not what I am saying. I am saying it would indicate which row to start your search on. Even if every row is placed into the exceptions table, just a mark in the sand (the ordinal row) would help in resolution of the situation.

There is also a prevalent view in the Oracle community that the exceptions into clause works the way I initially assumed, this can lead to much mayhem in the users databases.

Thomas Kyte said...

How would an ordinal row help at all??

As rows in table have no order.

so, you have

(1,2)
(1,3)
(1,4)

and are told "row with values (1,3) was the first to violate this constraint"

so what? Now what? that doesn't mean a thing. My rule is the row with the highest second value 'stays', your rule might be the row this lowest second value 'stays', their rule might be all three go off to data assurance to be sorted out.

You tried to use exceptions into for de-dupping. Exceptions into is a constraint check that works for check constraints, nulls, foreign keys, and just happens to do primary and unique constraints as well (eg: it is not a primary key thing, it is a constraint thing). It is not the right tool to de-dup data by some rule, DELETE is. You could use the exceptions into information to make the delete run faster -- but that would involve two passes - so just doing the delete would be the way to go.

There is also a prevalent view in the Oracle community

Not according
to my poll (two is the right answer, people that picked 2 got the right answer) nor according to the people I've communicated with about this. Given an example, over 2/3rds of the people are getting the right answer.

What leads to mass mahem in databases is someone using something they haven't tested and understood.

(70 years of combined experience - is that 4 really tenured people or 35 people)...

David Aldridge said...

>> There is also a prevalent view in the Oracle community that the exceptions into clause works the way I initially assumed, this can lead to much mayhem in the users databases. <<

It only leads to mayhem at the unfortunate intersection of two sets of people -- those who believe that they don't need to read the documentation to check their assumptions, and those who don't test functionality prior to implementation.

Bob B said...

Well said David

David Aldridge said...

Thanks Bob!

I thought it was rather a trenchant point -- and as soon as I thought that I had to post this followup in order to use the word "trenchant".

Aside from that though it's a truism that bad practices cause user-database mayhem.

Bill S. said...

Mike,
Read your blog posts and those of the others (incl. the Sponge, of course ;-)). My understanding of the way the EXCEPTIONS INTO worked was that:
a) If you are inserting into a table with the constraint enabled, it works as YOU describe it (i.e. - your "ordinal" row as it were is dominant, and the others are "excepted")

b) HOWEVER, if you disable the constraint, load up your data, and then try to de-dupe by enabling the constraint and validating with the EXCEPTIONS INTO, then ALL your duplicated rows are excepted because THERE IS NO WAY TO DETERMINE ORDINALITY.

Given your example, your assumption was not correct. Always remember, TRUST BUT VERIFY, even (especially!) the documentation.

Regards,

Bill S.

Mike said...

Well as I hae said, hindsite is always 20/20 once someone calls attention to this, the incident rate goes down because people think about it. That is all I really wanted was for folks to think about it and make sure they understood what I happening before they use the clause. Everyone who has responded is in the "been there, done that" category and therefore knows better t is the others I worry about.

Mike said...

Just as an aside, in my University session at IOUG I asked how many had used it, out of a room of about 30 people, DBA's all, maybe 5 raised their hands. When I asked how many had found it useful, none raised their hands.

Bob B said...

30 people in one class isn't a large enough sample (in my opinion) to give any veracity to your poll. While you affirm that everyone in the class is a DBA, I predict that none of them were heavily involved with ETL projects.

An analogy:
30 professionals attend a management class. I ask them how many have heard of a DBA, 5 raise their hands. Then I ask how many have used a DBA, none raise their hand. Does that mean DBAs are useless to professionals?

David Aldridge said...

Going back to the original problem of identifying and eliminating the duplicates from that big pile of data, what technique did you use in the end Mike?

Howard J. Rogers said...

I recall doing data loading work in the early 1990s. Data would report two flower beds in front of the park gates -but the Park Manager would be able to sigh, "Ah, yes. We re-developed that bed to make it smaller/bigger/different. Must have forgotten to remove the original row. Get rid of that one there". No way on Earth could I have made a call on which of the duplicates was right or wrong. There was a physical reality that only the data owner could assess the data by.

And that must be the general case. The database cannot -and MUST not- make calls on the significance of rows. All it can do is flag ALL the potential issues and leave you to liaise with the data owner to decide how to resolve matters. Sometimes you will delete rows. Sometimes you will update them. Sometimes you will realise that what you thought was a business rule -and hence a candidate for constraining- isn't.

No simple answers to this sort of problem.

Mike said...

As I have said, I agree, the database should not try to tell us which is the best or proper row, I just would like to see it tell us "Here is the first row, now here are all of the other rows that duplicate its key value" again just to mark a starting point. In a table of a couple of hundred or even a couple of thousand, manual comparison is viable, but make that a couple of million it it takes on an epic scale.

Howard, as you see I do allow comments on some blogs, on others I don't. Isn't free choice a wonderful thing? On the blogs where I didn't allow postings, as I said, I wanted them to stand alone. On ones where I have closed postings I closed them because I couldn't see were more arguing over the point would bring out more useful data to the readers.

David, in the original blog "Random Data" I show the method I used for data deletion.

Thomas Kyte said...

Here is the first row,

The point is, we would not be in a relational database if that happened. We would be using some other product.

There is no first, last, middle, 3rd, nth row. The concept is foreign to this entire situation. Without some order by, there is no such thing as row order. It is one of the most fundemental concepts of this thing called a relational database.


If you are going to automagically de-dup, you wouldn't use the exceptions table. DELETE is the proper command for that. EXCEPTIONS is for, well, expection reporting.


What is the "first" row to violate a CHECK constraint?

What is the "first" row to violate a foreign key constraint?

There is no such thing as a "first" row to violate a primary key constraint -- they all do when simultaneously present. The concept of "first" doesn't even begin to make sense in the context of a relational database.


And even if there were -- so? You have an exceptions table with:


ROWID TABLE CONS_NAME FIRST
fadafda T X null
fadad T X 1
afdafdad T X 1

now what? for your de-dup, a very unique implementation, it was safe to take the random 'first' row you hit and keep that one.

But in general -- in real life, in a data warehouse -- so what? What could you do with that information? remember, millions of rows. What can do you do with the fact that these *random* (they truly are, load data into ASSM and you'll get different 'first' rows than in manual, parallel direct path load and you'll get different orders than a conventional parallel load and so on and so on -- they are random first rows) first rows are tagged?

Nothing.

David Aldridge said...

>> David, in the original blog "Random Data" I show the method I used for data deletion. <<

Yup, my bad.

You didn't have an opportunity to compare this with the analytic function method, I suppose?

Howard J. Rogers said...

"Howard, as you see I do allow comments on some blogs, on others I don't. Isn't free choice a wonderful thing?"

But free speech is even more wonderful, Mike.

"On the blogs where I didn't allow postings, as I said, I wanted them to stand alone."

You (like all of us) have the right to choose to blog or not to blog. I believe that freedom to choose comes with a responsibility too: to take whatever rejoinders others may deem your thoughts deserve.

You are free to duck that responsibility, of course. But your credibility takes a nose-dive when you do.

Mike said...

I realize that rows are in random order in a relational database. However, when being seached for indexing they will be searched in some order, whether it is by rowid, or phyical block order. Therefore, the first occurance is relavent within the context of creating an index or enabling a primary key which is the context for which this issue is being discussed.

Therefore within the situation I described, 600 possible unique key combinations, with multiple rows with duplicate key combinations marking the first row found (it would have no entry for its ordinal row) and then reporting the other rows with this marker row as the ordinal row would make finding and correcting the duplicates much easier.

Is the method Oracle is using wrong? No, not within their definition of exception. Would giving an ordinal row, first occurance row, or somehow indicating "Start checking duplicates from here", whatever you want to call it, make duplicate resolution easier? Yes, in my opinion.

Just because this is how it's always been done is not a reason to continue to do so. Just because a rather narrow definition of exception was used, which was correct for theory but of less use in practice, doesn't mean we can't change things.

Again, in talking with over 50 (as of last count) practicing DBAs, few have ever used this feature, and even fewer found it useful when they attempted to use it. If it was as wonderful and easy to use as many of you seem to believe, I should be able to find one DBA out there that has used it, likes it and would recommend it to his fellow DBAs (other than you).

I agree, now that I have come to understnad Oracles definition of exception, I can find some small uses for this. However, I have other, easier means of finding duplicate rows and correcting them than adding a table, enabling a constraint, having it fail, review the contents of the second table in light of the first, then resolving the problems in the first, repeat as needed.

Mike said...

Howard,

Yes, freedom of speech is great. Newspapers use it all the time. However, they don't publish every letter to the editor that comes through. Comments on blogs are like letters to the editor. As editor I can pick and choose when I want to publish letters to the editor, which letters I publish and in general anything I please.

Your rights stop at the end of your nose. I allow posting on most of my blogs and am fairly light on the editing, in fact, other than delete a post or two, because the poster asked me to, I haven't deleted any.

Again, it is my choice, not yours. Sorry if that upsets you.

David Aldridge said...

>> However, I have other, easier means of finding duplicate rows and correcting them than adding a table, enabling a constraint, having it fail, review the contents of the second table in light of the first, then resolving the problems in the first, repeat as needed. <<

Sure -- single SQL statement will do this faster. But even your proposed amendment of omitting or identifying a random row likely wouldn't be as fast as (and wouldn't be as deterministic as) a single SQL statement either. You'd still have to go through all of those stages.

I'm not surprised that few DBA's you spoke to found it useful, but enhancing the functionality in the way you descibe is not IMHO likely to change that.

Howard J. Rogers said...

Oh Mike.

"Yes, freedom of speech is great. Newspapers use it all the time. However, they don't publish every letter to the editor that comes through."

And that will be because newsprint is a finite resource, and there is limited space in a newspaper. Unlike in Blog, where space is practically infinite. Which means that....

"Comments on blogs are like letters to the editor."

...is entirely wrong.

"As editor I can pick and choose when I want to publish letters to the editor, which letters I publish and in general anything I please.

Your rights stop at the end of your nose."

My rights stop where you permit them to stop, true enough. It was ever thus in dictatorships. But your responsibilities don't stop just because you don't like criticism, which is more to the point I was raising.

"I allow posting on most of my blogs "

How very gracious of you!

"and am fairly light on the editing, in fact, other than delete a post or two, because the poster asked me to, I haven't deleted any."

Well, that's good to know, given what happens in that other place with which you are associated.

"Again, it is my choice, not yours. Sorry if that upsets you."

It doesn't upset me Mike. It makes you look ridiculous, and trashes your name and reputation at the same time. If you're happy with that, fair enough. I'm not, because oddly enough I think you're better than that. Or at least ought to be.

But please... be honest about it. If you don't like the criticism and wish to keep it to a minimum, fair enough and say so. But stop this silly comparisons with newspapers, which deal with physical newsprint and transport costs, when the medium you have chosen to publish in has no such associated externalities.

Any censorship or "selectivity" that takes place here is *your choice*. Certainly, and you won't find me arguing about that! Whether or not it *should* be what you choose to do is what I'm interested in.