Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Wednesday, October 26, 2005

Through the Looking Glass: 10gR2 RAC Installation

On my most recent assignment I had the dubious pleasure of installing Oracle10g Release 2 for a RAC installation on an OCFS2 file system on RedHat 4.0 on a Xeon 64 bit system. Of course the first issue is, as of this writing the OCFS2 file system is not certified for use in RedHat4.o -64 bit with 10gR2 RAC. Just because 10gR2 is the latest RAC and OCFS2 is the latest cluster file system from Oracle don’t let that confuse you…far be it for me to state they should work together.

The client had the boxes up and running just as asked, of course we had to add a few packages, notably the config libraries that don’t get installed with the base “total” install of RedHat, the vim editor and a couple of other convenience packages. We also were using iscsi which had yet to be configured. However within a day we had the iscsi running over Gig Enet to a Left-Hand disk array and OCFS2 up and running.

Next of course you install Cluster Ready Services. The client had the latest install disks sent directly from Oracle. I loaded the DVD and commenced installation. Other than the new screens, such as multiple OCR configuration files (they now allow a mirror, bravo!) and multiple voting disk locations (up to three, another kudo) the screens are similar to the ones we know and love. However, once it got to the final screen where it does the actual install, link, configuration and setup is where the fun began.

On the link step the system complained:

### Error Messages: ###
INFO: Start output from spawned process:
INFO: ----------------------------------
INFO:
INFO: /var/oracle/product/10.2.0/crs/bin/genclntsh
INFO: /usr/bin/ld: skipping incompatible /var/oracle/product/10.2.0/crs/lib/libxml10.a when searching for -lxml10/usr/bin/ld: cannot find -lxml10
INFO: collect2: ld returned 1 exit status
INFO: genclntsh: Failed to link libclntsh.so.10.1
INFO: make: *** [client_sharedlib] Error 1
INFO: End output from spawned process.
INFO: ----------------------------------

The lxm10 module deals with XML parsing in the client stack, so naturally the first thing out of Oracle support was: OCFS2 is NOT yet certified to use with RHEL4.0 10g R2 in RAC env.

Workaround is to use:
a) raw device or ASM
b) RHEL3 with OCFS1

So, please go to a supported config and if the problem still persists, please update this TAR.

Huh? What the heck does OCSF2 have to do with the XML parsing library not being found? We were not using a shared ORACLE_HOME and were not using OCFS2 for anything yet (that doesn’t happen until root.sh is run on the last step).

After getting to a duty manager we were able to get a bit more help. Next, after uploading several sets of logs and traces that all said basically the same thing (lxml10 was missing) it was suggested that perhaps downloading the OTN version might help since it was newer.

This we did. It linked. Now I realize this is a radical suggestion, but shouldn’t Oracle QC have taken one of the production run DVDs and did a full test install on the target platform before making it available?

Call me crazy I guess…but it seems to me something this obvious would have been caught by a one-eyed QC inspector with one arm tied behind his back wearing an ipod blasting heavy metal into his ears while driving down the 101 freeway watching the install on his web enabled cell phone…

Now we got to root.sh execution, and of course it went without a hitch…not! Next we got:

/var/oracle/product/10.2.0/crs/bin/crsctl create scr oracle/var/oracle/product/10.2.0/crs/bin/crsctl.bin:
error while loading shared libraries:
libstdc++.so.5: cannot open shared object file:
No such file or directory
/bin/echo Failure initializing entries in /etc/oracle/scls_scr/rhora1.

Long sigh….the LD_LIBRARY_PATH shows the /usr/lib as being a part of it, and the libstdc++.so.5 soft link to libstdc++.so.5.0.7 library is there, as is the links target library. I even tried placing a softlink to the libstdc++.so.5.0.7 in /usr/lib in the $ORACLE_CRS/lib directory (calling it libstdc++.so.5 of course). Can’t wait to see the next response from support… Will keep you all posted!

Latest news: We found the DVD ordered/sent was the X86 not the X86_64 version, however, since the library in the first issue is from the DVD seems like it is still an issue. However, the downloaded version is definitely X86_64. Of course the latest twist is even though support asked us to download and see if it would install they now say since we loaded a downloaded version it is not supported until we get new CDs. You just can't win...however, the support analyst says he is still pursuing the issue internally and will keep the tar open. Someone is showing some sense!

While we were waiting I decided to check the reboot on both machines. During a reboot command the system issues a call to the halt command which stops all processes, this causes the system to spin on the ocfs2 o2hb_bio_end_io:222 Error: IO error -5 because the ocfs2 filesystems are not unmounted yet. We have attempted to place the umount commands into an init.d script and run it at priorities as high as 2 however, it seems the init scripts are run after the halt command so it does no good.

The question now becomes: How can we force the unmount of the OCFS2 filesystems before the halt call during a shutdown/reboot?

The latest response on our sev 2 TAR:

Thank you for the update. The owning support engineer, xxxx.AU, has gone off shift for the weekend in Australia and is not currently available; however, they will have the opportunity to review and progress the issue during their next scheduled shift. In the mean time, if you feel that this is a critical down production issue or an issue for which you require prompt assistance from an available support engineer, please advise us of this specifically by phoning your local support number (see http://www.oracle.com/support/contact.html for a listing) to advise call response of your need for attention; otherwise, no update is required on your part at this time and xxxx.AU will follow-up with you during their next scheduled shift.

Ok, we bought it that OCFS2 is not supported. We switched to RAWS and used them, got the same error on CRS install. Week after next if nothing new from Oracle we back track to 10gR1 and go with RAWs and ASM.

First lesson learned...Oracle demands the libstdc.so.5 library it will accept no substitutions! The only thing Oracle helped with so far...but we had to find the issue and the proper RPM to install.

Second lesson learned...do not use links with the raws needed for CRS (2 for the config disk and its mirror and 3, yes 3! for the voting disks) or the root.sh will fail. Once we went directly to the raw devices themselves CRS installed. Solved this ourselves.

Third lesson learned...Oracle expects routable IP addresses on the VIP if it doesn't get it them the cluster configuration verification step fails (ignore it if you use unroutable IPs) and the vipca silent install will fail at the end of the root.sh, just run it manually from the command line. We found this ourselves after first error.

Fourth Lesson Learned: Don't let the SA configure the entire system disk as one large partition, leave room to add swap if needed.

Fifth Lesson: If the ssh won't equivilize on the second (third, fourth..etc) node even though you've done everything right, on the offending node do this:

Login as oracle user:
$ cd $HOME
$ chmod 755 .
$ chmod 700 .ssh
$ cd $HOME/.ssh
$ chmod 600 authorized_keys

Also, to get rid of the annoying last login message, which will result in an error, add this to your sshd_config file on the RAC system nodes:

PrintMotd no
PrintLastLog no

Now on to the actual database and ASM install. I can't wait...

Well...ASM wouldn't link properly... so:

Article-ID: Note 339367.1
Title: Installing 10.2.0.1 Db On A Redhat Linux X86-64 Os Version 4.0, Errors
SOLUTION/ACTION PLAN
=====================
To implement the solution, please execute the following steps:
Download this file at http://oss.oracle.com/projects/compat-oracle/files/RedHat/Red Hat:

binutils-2.15.92.0.2-13.0.0.0.2.x86_64.rpm 2005.10.05

RHEL 4 Update 1 patched binutils necessary for 10gR2 install on x86_64

Then try the relink all again

Ok...now everything worked but lsnrctl threw segmentation faults, so...

Article-ID: Note 316746.1 (Which by the way, can't be found on metalink)
Title: Segmentation Fault When Execute Sqlplus, Oracle, Lsnrctl After New/Patchset Install
ACTION PLAN
============
Please do the following:

1. cd /usr/bin (as root)
2. mv gcc gcc.script
3. mv g++ g++.script
4. ln -s gcc32 gcc
5. ln -s g++32 g++
6. login as oracle software owner (make sure environment is correct)
7. cd $ORACLE_HOME/bin
8. $ script /tmp/env_relink.out
9. $ env 10. $ ls -l /usr/bin
11. $ relink all
12. $ exit
13. Send env_relink.out to Oracle support

Finally! A system with CRS running, OracleNet running and ASM running hopefully ready for the databases to be created...11/17/2005.

Isn't attention to prompt and courteous customer support a wonderful thing?


Mike

Sunday, October 16, 2005

Bureaucracy at its Finest

A young, recently married, couple I know did something few would have the courage or conviction to do. They gave up their jobs, sold or stored their possessions and joined the Peace Corps. I wish I could say that they finished their training, have been assigned to an interesting post and are enthusiastically pursuing their new endeavor together. Unfortunately that is not the case.

Upon arrival to their training assignment, as I understand it, they were informed that they have no place to stay. Then they were informed that it would be at least another three months before a training slot became available and then they probably couldn’t be trained, or assigned together. They were then sent back home. Needless to say, they are both very upset and have completely reversed their decision to join the Peace Corps. The Peace Corps is losing two fine candidates, and the young couple a great life enhancing opportunity, all through bureaucratic incompetence.

Luckily both of the couple are well qualified (one is a teacher, the other has a good degree) so finding new jobs or getting back the jobs they left should not be a problem. However, the shear order of magnitude of the bureaucratic incompetence shown in this episode is staggering. To send people thousands of miles away from home, spending thousands of tax dollars to do so, only to not have housing, not have the training available and finally to tell them that even if they tough it out they wouldn’t be assigned together just boggles the mind. Has to make you wonder how well they are tracking their personnel in the field and what level of support they are providing to them.

In this day of terrorist activity, kidnappings and hate of all things American that these young people would be willing to, at great personal risk, serve others is highly commendable. That they cannot fulfill this desire to be of service to others due to the stupidity and incompetence of others is deplorable.