Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Wednesday, October 26, 2005

Through the Looking Glass: 10gR2 RAC Installation

On my most recent assignment I had the dubious pleasure of installing Oracle10g Release 2 for a RAC installation on an OCFS2 file system on RedHat 4.0 on a Xeon 64 bit system. Of course the first issue is, as of this writing the OCFS2 file system is not certified for use in RedHat4.o -64 bit with 10gR2 RAC. Just because 10gR2 is the latest RAC and OCFS2 is the latest cluster file system from Oracle don’t let that confuse you…far be it for me to state they should work together.

The client had the boxes up and running just as asked, of course we had to add a few packages, notably the config libraries that don’t get installed with the base “total” install of RedHat, the vim editor and a couple of other convenience packages. We also were using iscsi which had yet to be configured. However within a day we had the iscsi running over Gig Enet to a Left-Hand disk array and OCFS2 up and running.

Next of course you install Cluster Ready Services. The client had the latest install disks sent directly from Oracle. I loaded the DVD and commenced installation. Other than the new screens, such as multiple OCR configuration files (they now allow a mirror, bravo!) and multiple voting disk locations (up to three, another kudo) the screens are similar to the ones we know and love. However, once it got to the final screen where it does the actual install, link, configuration and setup is where the fun began.

On the link step the system complained:

### Error Messages: ###
INFO: Start output from spawned process:
INFO: ----------------------------------
INFO:
INFO: /var/oracle/product/10.2.0/crs/bin/genclntsh
INFO: /usr/bin/ld: skipping incompatible /var/oracle/product/10.2.0/crs/lib/libxml10.a when searching for -lxml10/usr/bin/ld: cannot find -lxml10
INFO: collect2: ld returned 1 exit status
INFO: genclntsh: Failed to link libclntsh.so.10.1
INFO: make: *** [client_sharedlib] Error 1
INFO: End output from spawned process.
INFO: ----------------------------------

The lxm10 module deals with XML parsing in the client stack, so naturally the first thing out of Oracle support was: OCFS2 is NOT yet certified to use with RHEL4.0 10g R2 in RAC env.

Workaround is to use:
a) raw device or ASM
b) RHEL3 with OCFS1

So, please go to a supported config and if the problem still persists, please update this TAR.

Huh? What the heck does OCSF2 have to do with the XML parsing library not being found? We were not using a shared ORACLE_HOME and were not using OCFS2 for anything yet (that doesn’t happen until root.sh is run on the last step).

After getting to a duty manager we were able to get a bit more help. Next, after uploading several sets of logs and traces that all said basically the same thing (lxml10 was missing) it was suggested that perhaps downloading the OTN version might help since it was newer.

This we did. It linked. Now I realize this is a radical suggestion, but shouldn’t Oracle QC have taken one of the production run DVDs and did a full test install on the target platform before making it available?

Call me crazy I guess…but it seems to me something this obvious would have been caught by a one-eyed QC inspector with one arm tied behind his back wearing an ipod blasting heavy metal into his ears while driving down the 101 freeway watching the install on his web enabled cell phone…

Now we got to root.sh execution, and of course it went without a hitch…not! Next we got:

/var/oracle/product/10.2.0/crs/bin/crsctl create scr oracle/var/oracle/product/10.2.0/crs/bin/crsctl.bin:
error while loading shared libraries:
libstdc++.so.5: cannot open shared object file:
No such file or directory
/bin/echo Failure initializing entries in /etc/oracle/scls_scr/rhora1.

Long sigh….the LD_LIBRARY_PATH shows the /usr/lib as being a part of it, and the libstdc++.so.5 soft link to libstdc++.so.5.0.7 library is there, as is the links target library. I even tried placing a softlink to the libstdc++.so.5.0.7 in /usr/lib in the $ORACLE_CRS/lib directory (calling it libstdc++.so.5 of course). Can’t wait to see the next response from support… Will keep you all posted!

Latest news: We found the DVD ordered/sent was the X86 not the X86_64 version, however, since the library in the first issue is from the DVD seems like it is still an issue. However, the downloaded version is definitely X86_64. Of course the latest twist is even though support asked us to download and see if it would install they now say since we loaded a downloaded version it is not supported until we get new CDs. You just can't win...however, the support analyst says he is still pursuing the issue internally and will keep the tar open. Someone is showing some sense!

While we were waiting I decided to check the reboot on both machines. During a reboot command the system issues a call to the halt command which stops all processes, this causes the system to spin on the ocfs2 o2hb_bio_end_io:222 Error: IO error -5 because the ocfs2 filesystems are not unmounted yet. We have attempted to place the umount commands into an init.d script and run it at priorities as high as 2 however, it seems the init scripts are run after the halt command so it does no good.

The question now becomes: How can we force the unmount of the OCFS2 filesystems before the halt call during a shutdown/reboot?

The latest response on our sev 2 TAR:

Thank you for the update. The owning support engineer, xxxx.AU, has gone off shift for the weekend in Australia and is not currently available; however, they will have the opportunity to review and progress the issue during their next scheduled shift. In the mean time, if you feel that this is a critical down production issue or an issue for which you require prompt assistance from an available support engineer, please advise us of this specifically by phoning your local support number (see http://www.oracle.com/support/contact.html for a listing) to advise call response of your need for attention; otherwise, no update is required on your part at this time and xxxx.AU will follow-up with you during their next scheduled shift.

Ok, we bought it that OCFS2 is not supported. We switched to RAWS and used them, got the same error on CRS install. Week after next if nothing new from Oracle we back track to 10gR1 and go with RAWs and ASM.

First lesson learned...Oracle demands the libstdc.so.5 library it will accept no substitutions! The only thing Oracle helped with so far...but we had to find the issue and the proper RPM to install.

Second lesson learned...do not use links with the raws needed for CRS (2 for the config disk and its mirror and 3, yes 3! for the voting disks) or the root.sh will fail. Once we went directly to the raw devices themselves CRS installed. Solved this ourselves.

Third lesson learned...Oracle expects routable IP addresses on the VIP if it doesn't get it them the cluster configuration verification step fails (ignore it if you use unroutable IPs) and the vipca silent install will fail at the end of the root.sh, just run it manually from the command line. We found this ourselves after first error.

Fourth Lesson Learned: Don't let the SA configure the entire system disk as one large partition, leave room to add swap if needed.

Fifth Lesson: If the ssh won't equivilize on the second (third, fourth..etc) node even though you've done everything right, on the offending node do this:

Login as oracle user:
$ cd $HOME
$ chmod 755 .
$ chmod 700 .ssh
$ cd $HOME/.ssh
$ chmod 600 authorized_keys

Also, to get rid of the annoying last login message, which will result in an error, add this to your sshd_config file on the RAC system nodes:

PrintMotd no
PrintLastLog no

Now on to the actual database and ASM install. I can't wait...

Well...ASM wouldn't link properly... so:

Article-ID: Note 339367.1
Title: Installing 10.2.0.1 Db On A Redhat Linux X86-64 Os Version 4.0, Errors
SOLUTION/ACTION PLAN
=====================
To implement the solution, please execute the following steps:
Download this file at http://oss.oracle.com/projects/compat-oracle/files/RedHat/Red Hat:

binutils-2.15.92.0.2-13.0.0.0.2.x86_64.rpm 2005.10.05

RHEL 4 Update 1 patched binutils necessary for 10gR2 install on x86_64

Then try the relink all again

Ok...now everything worked but lsnrctl threw segmentation faults, so...

Article-ID: Note 316746.1 (Which by the way, can't be found on metalink)
Title: Segmentation Fault When Execute Sqlplus, Oracle, Lsnrctl After New/Patchset Install
ACTION PLAN
============
Please do the following:

1. cd /usr/bin (as root)
2. mv gcc gcc.script
3. mv g++ g++.script
4. ln -s gcc32 gcc
5. ln -s g++32 g++
6. login as oracle software owner (make sure environment is correct)
7. cd $ORACLE_HOME/bin
8. $ script /tmp/env_relink.out
9. $ env 10. $ ls -l /usr/bin
11. $ relink all
12. $ exit
13. Send env_relink.out to Oracle support

Finally! A system with CRS running, OracleNet running and ASM running hopefully ready for the databases to be created...11/17/2005.

Isn't attention to prompt and courteous customer support a wonderful thing?


Mike

11 comments:

Noons said...

and then they call people who have been complaining about this state of affairs the "c.d.o.s lunatics".

Because other than those, precious few others are making any noises!

Your description of the QC person is absolutely spot-on: someone is NOT doing its job inside Oracle. And for a long time as well.

Tom said...

Unfortunately, the OUI requires the addition of the 32-bit libraries as well as the 64-bit ones.

This is where your problem lies.... In addition to the required packages posted by oracle, install their 32-bit counterparts as well, and this problem should go away.

You may also be hitting an error solved by patch 2617419 in metalink.

We hit this months ago playing with a 64-bit R1 RAC install which was at that time unsupported...but we did get it to work :)

Tom Callahan
TESSCO Technologies
410-229-1361

Mike said...

In the $ORA_CRS_HOME directory both the /lib and /lib32 directories are present and loaded. As far as I can determine the patch you mention is for the opatch utility which we are not using for this as it is a new install.

Thanks for the comments though!

Mike said...

The problem with deleting OCFS is that in a RAC environment you must share the voting and configuraiton disks. SInce they must be there to load CRS and CRS must be there to load ASM it leaves with either a vendor provided CFS, Oracle's OCFS or RAW. If you have a vendor provided CFS you usually don't need ASM or OCFS. Most don't like to mess with RAWs. Leaves us in a bit of a pickle if they take away OCFS.

Mike

Tom said...

Just for giggles, these are the packages we had to install as mentioned before.... I especially note the compat-libstdc++-33-3.2.3-47.3.i386.rpm package for your issue....

bzip2-1.0.2-13.EL4.2.x86_64.rpm gedit-2.8.1-4.x86_64.rpm mikmod-3.1.6-32.EL4.x86_64.rpm
bzip2-devel-1.0.2-13.EL4.2.x86_64.rpm glibc-devel-2.3.4-2.9.i386.rpm openssl-0.9.7a-43.2.x86_64.rpm
bzip2-libs-1.0.2-13.EL4.2.x86_64.rpm gnutls-1.0.20-3.2.1.x86_64.rpm openssl-devel-0.9.7a-43.2.x86_64.rpm
compat-gcc-32-c++-3.2.3-47.3.x86_64.rpm gzip-1.3.3-15.rhel4.x86_64.rpm sudo-1.6.7p5-30.1.1.x86_64.rpm
compat-glibc-2.3.2-95.30.i386.rpm krb5-devel-1.3.4-17.x86_64.rpm sysreport-1.3.15-2.noarch.rpm
compat-glibc-headers-2.3.2-95.30.x86_64.rpm krb5-libs-1.3.4-17.x86_64.rpm tcpdump-3.8.2-10.RHEL4.x86_64.rpm
compat-libgcc-296-2.96-132.7.2.i386.rpm krb5-workstation-1.3.4-17.x86_64.rpm telnet-0.17-31.EL4.3.x86_64.rpm
compat-libstdc++-296-2.96-132.7.2.i386.rpm libaio-0.3.103-3.x86_64.rpm zlib-1.2.1.2-1.1.i386.rpm
compat-libstdc++-33-3.2.3-47.3.i386.rpm libaio-devel-0.3.103-3.x86_64.rpm zlib-1.2.1.2-1.1.x86_64.rpm
compat-libstdc++-33-3.2.3-47.3.x86_64.rpm libgcc-3.4.3-22.1.i386.rpm zlib-devel-1.2.1.2-1.1.x86_64.rpm
elfutils-0.97-7.x86_64.rpm libpcap-0.8.3-10.RHEL4.x86_64.rpm
elfutils-libelf-0.97-7.x86_64.rpm libstdc++-devel-3.4.3-22.1.i386.rpm

Thanks,
Tom Callahan

Tom said...

Also...make sure your LD_LIBRARY_PATH in your Oracle Users's .profile contains /lib64

Mike said...

I have not seen anything relating to this, however it is odd they would have it not compatible with 10gR2

Mike

Mike said...

Could be that they have since fixed the issues. They were in the process of confirming OCFS2 with 10gR2 when we were doing the install.

Mike said...

Seperate files on seperate disks is my understanding, else why have dups if one disk failure kills them all? That being said, in a SAN or NAS whre it is striped to there and back again it may make little difference.

bmw330idba said...

What about NFS? Just curious why not a single reply or mention of NFS.

Mike said...

Afraid this was a while ago, but we didn't want to use NFS cross mounting (I assume this is what you are referring to) because of the issues with performance and such that were perceived to exist.

Oh, OCFS2 is now part of the RedHat kernel and works great with Oracle11g (ggod thing since they deprecated RAW!)