x4500 Diagram

THUMPER - First Contact with Sun x4500

In August of 2008 we received our first Thumper, a Sun x4500. This machine has two dual-core AMD Opteron processors (Model 290 at 2.8Ghz), 16 GB of memory, six 8-port SATA controllers and comes with IPMI 2.0 support as well as the Sun ILOM. This page documents details of how I get our Thumper up and running. This was my first serious contact with Sun hardware and as you'll read I had some troubles, just because I'd never seen how Sun does things.

I'm sure SUN aimed this machine at installations which had massive requirements for space or as an NFS server. The processors are relatively slow for our statistics work, but surely have enough for an NFS server. We, however, wanted it just to help backup our fast growing 20TB+ of data. The machine comes with 48 SATA drives in four U and as you might expect, it is really really heavy (200 pounds).

Physical Set Up

Lugging this thing around for testing and eventual setup was a chore. Fortunately, it comes in parts (49 parts, to be precise) and so anytime weight was an issue, I pulled the drives and make a stack of them. I thought the rails were really flimsy (compared to rails from Dell and IBM for boxes half this weight) and this eventually became a problem. Despite my efforts to align things correctly, the rails did not get lined up just right and they ate themselves, crushing some flimsy aluminum in the rail assembly. When the ball bearings started to fall on the floor, I knew these rails were history. Fortunately, I had a very heavy duty shelf from some old IBM gear, so I used that. I saw others who managed to get their Thumpers installed with these flimsy rails so I know it can be done, just not by me the first time at least.

Since this was my first experience with a Sun device, I actually did read the Installation Guide. It had enough to help this Sun newbie to get things working - but not just right away. I plugged in cables for the USB keyboard, VGA cable, service processor ethernet, regular ethernet (eth0) and powered device on. It roared for a few seconds and then the fans settled down, but I saw nothing on the VGA screen. Is it up? DOA? Heck if I could tell.

Booting

Well this initial welcome was a little disappointing. I looked at my DHCP server logs and found it had assigned an address, so that meant something was working. I tried to SSH to it (e.g. ssh root@myipaddress) and got a password prompt. The documentation said the password was 'changeme' (see, I told you I read the Installation Guide) and I was in. Knowing no better, I thought I had a Solaris session. Oh, no, grasshopper.

I quickly figured out I was in the Service Processor (SP) which Sun calls the Integrated Lights Out Manager (ILOM). This is actually pretty easy to use. It's got enough online help that I quickly was able to see devices and things ILOM-ish. I managed to create a few extra SP accounts and then carefully proved to myself that I could login successfully.

I found the SP MAC address and set it up in my DHCP server so it assigned the MAC address I was expected for the SP. Another boot and then my tools which use IPMI worked just like everything else (see IPMI notes). Now I could at least powercycle the machine.

Since I was messing with Service Processor things, I tried to use a web interface, openning a web browser connection to port 80 on my IP address. I got a web page prompt as expected, logged in with the accounts I created. From these web pages I could see various details of the hardware.

It took a bit of guesswork to sort out the correct JAVA packages (this is Sun, you know, all solutions use JAVA) to install. My Debian etch system needed these installed: ia32-sun-java5-bin java-common java-gcj-compat java-gcj-compat-plugin sun-java5-jre. You'll need 'main contrib non-free' in your /etc/apt/sources.list file. What a joy it is to have a functioning remote console, compared to Dell's DRAC software which hasn't actually worked in a couple of years now!

I had a functional Service Processor, but there was no sign of an operating system and Sun claimed to have Solaris 10 installed and ready for me. I plugged in a serial cable using the NetManger 'ethernet' port. While the SP was booting I could see messages on the serial console. It gave me no more real information, but at least I now know what's going on during the roar. Still nothing on the VGA screen.

I tried all sorts of things and during one of the many powercycles of the machine, I noticed a 0037 in the lower right hand corner of the VGA display. "Ahh!", me thinks, "It's a hardware problem!" I placed a call to Sun and got a tech to call me back a few minutes later. By getting the SP working up front, I saved myself lots of hassle. The tech had me do a 'reset /SP' command in the SP session. This caused the SP to power cycle and when it came back, the VGA screen began to show things. Yea! Finally an operating system!

After finally getting the VGA console to show things I could check out the BIOS, which looked much like BIOS on most other machines. Nice to see familiar things finally. Now that I had a VGA, Solaris 10 set prompted me for a few things and my Thumper was up.

x4500 picture

Testing the Drives

For new hardware I always run bonnie++ to see how well the disks behave. Of course, since this was Solaris I felt completely crippled immediately. I fetched a dozen or so useful tools from sunfreeware.com and then I had some basic tools (e.g. GNU tar, gzip, wget and, of course, bonnie++). How can you do anything on a machine without these?

The Solaris system comes with a ZFS (raidZ) pool of 42 drives. I immediately ran bonnie++ on this and to my great surprise, it was more than twice as fast than any other hardware on which I've run bonnie++. Mind you, I have not been exceedingly careful doing these tests, but twice was a great surprise. Perhaps bonnie++ is getting fooled by Solaris in someway and caching is making the hardware appear faster than it is in reality. See these notes for more details.

My next step was to remove a few drives from the zfs pool and create a single UFS filesystem on one drive. I mounted this and ran the same bonnie++ test. Once again the Sun system was surprisingly fast - 1.5 times faster than all others. I did nothing to optimize anything - just took the defaults and ran the test on an idle machine.

Now it was time to see how Linux behaves. I replaced the 1TB Hitachi drive in slot 0 with another SATA drive. I had the option if using a USB CDROM to boot my Debian etch distribution, but I chose to do a PXE boot and install a minimal system - later upgrading it to include all the packages I'd normally want.

I never take a default when asked about the drives and was actually a little surprised when the manual partition manager actually showed me 48 drives from which to choose - even /dev/sdy, the boot drive. Everything went as expected until it tried to write the grub configuration.

Some poking around (alt-F2 gets a second console) showed me that grub/device.map did not go all the way to disk 'y'. I tried changing the file to include more drives, but kept running into problems. I spent way too much time trying to hack the environment in the installation image.

We all know that SCSI devices 'move' when you insert or remove devices. This is the a giant pain we all have to deal with. I thought, "If devices get ADDED when I add a device, how about I remove some devices - like 47 drives!". I popped out 47 drives, leaving only slot 0 with a drive and re-installed.

It worked exactly as you'd expect - writing data to /dev/sda and booted just fine, The first time up, I changed the devices in /etc/fstab and grub/device.map, powered the machine down, put all the drives back and came up again. It worked like a champ - that was way better than trying to out-guess the install process.

It was tempting to go back to Solaris and ZFS, which seemed very easy to use and certainly performed well. In the end, however, I decided since the machine was just for backup, performance was not a big factor. If backups take 4 hours instead of 2 hours, I don't really care. It is far more important that the Thumper use all the operating system and tools that my 60+ other machines use. Heck, the machine didn't even come with a useful tar command. I'd spend a huge amount of time trying to make it look like my Linux machines - compiling lots of libraries and tools. Or I could just use Debian and it'd all work. Sorry, Solaris, you're not even close to being a contender.

Organizing the Drives

I must admit I was really surprised when the Linux boot drive for a Thumper was drive sdy in disk0 (slot 0, see map at top of this page). Of course I'd never seen anything like it before. Even more surprising was that the alternate boot drive (slot 1, disk1) is drive sdac. That's not very intuitive since the disks are right next to each other, but it works, so I guess that's OK.

That is, it works if the drives never change. I needed to get a RAID unit up, so I stole 15 of those great big 1TB drives and used them elsewhere. After removing the drives, I tried to guess what the boot drive would be. Unfortunately, I could figure no way to determine what Linux drive was in what slot. Several reboots, pulling all the drives and some more gyrations and I finally figured out my new boot drive was sdp. I really hate that aspect of the Linux kernel drive scheme. Solaris maps drives by hardware addresses and they don't move. While the names are strange to my eye and I can never remember them, at least they don't move.

My next task was to get some space set up for my backups. We always use the xfs file system, but what else to use? I had these choices to use:

  • Native - just mount dozens of drives all over the filesystem. It's easy to do, very hard to manage and bad if a drive fails.

  • MD - use mdadm to create software RAID devices. It's not that hard, results in one device (more if you want) which look like one large drive to the kernel. If a drive fails, you aren't supposed to lose data (never tested this). If the drives move because I take out more drives or put more back, the MD device is broken until you reassemble it. Another really bad thing about MD was that the daemon is apt to run off and check the drives. On a small MD device, this probably isn't much of a burden. On a 10TB MD drive, this is a killer. It didn't take long before I abandoned MD because of this.

  • LVM - use a number of pv* and lv* commands to create a logical volume which results in one device (more if you want) which look like one large drive to the kernel. The commands are a bit complex and confusing, but not that hard once you figure them out. Unlike MD, when devices move after a reboot, LVM finds them without intervention. The bad thing is that this is not a RAID and a failed drive results in losing all/some of your data.

  • ZFS - is available on Linux in a pretty crippled mode. It uses the kernel user space code (FUSE) and works, but is miserably slow. Yes, I know I said the machine was for backup, but... The good part is that ZFS is extremely easy to administer, provides a true software RAID so hardware failures are not fatal and automatically recovers from a wide set of problems. I was very impressed with ZFS and will revisit it someday, I hope.

The good part of having all these drives is that could, and did, try all of these choices. Of course I ran bonnie++ on each and you can see the results here.

I really like ZFS, but in the end I decided that the convenience of LVM was too compelling. Yes, if we have a hardware failure, I could lose data... but then it is only backup data and while important, the risk is acceptable.

Managing All Those Drives

A few weeks later I was ready to replace the drives I stole from the Thumper. Actually, I decided for the moment to replace the 1TB drives with an odd lot collection of 400GB, 500GB, 750GB and 1TB drives. Someday I suspect we'll buy 1.5TB or 2TB drives to replace the smaller ones. One nice aspect of my choice to use LVM is that I can, with effort, remove these smaller drives and upgrade them.

Which brings me to this last topic. How will I ever figure out which slot drive sdan (for example) is? Another problem is that I did not allocate all the drives when I created my initial set of LVM devices. I didn't even do them all in order. How can I figure out what drives are being used for what?

LVM provides commands to figure this out, but when dealing with dozens and dozens of drives, this gets really confusing. I know it doesn't take much to confuse me, so I wrote a tool to sort this out.

showdisks.pl is a Perl program which creates an HTML page showing what's where. As you might expect it uses the LVM commands to collect LVM information. Since I had them working, I added support for MD and ZFS too. It uses the /dev/disk/by-* links to collect information on UUID and serial numbers.

I eventually decided to include support for a user (that's me) supplied map of serial numbers and slots. This allows me to generate a HTML page like this. Now at a glance I can see exactly how things are arranged and when I want to upgrade that 400GB drive to 2TB, I'll know what drive to remove from the LVM and the slot of the drive.

Sun told me they had a tool to show the layout of the drives and slots (from here) but all I found were tools locked into RedHat and SuSE. Unfortunately, they chose not to make the source available. So I never got to try this. I'm sure they show a drive-slot map on Solaris, but I can't see how it can be done with a Linux kernel. I found a number of diagrams on the Sun site showing drive-slot maps, but all of them were incorrect. Perhaps things changed in the Linux kernel. Anyway, I ended up doing it myself.