Sat 20 April 2013

Filed under Howto

Tags Cool

failed disks

Not every admin is equally intelligent, interested, disciplined or motivated. Those attributes represent the balancing act of professional service in any capacity. However, when both motivation and interest are lacking, it's probably a sign you hate your job. Cleaning up after messes made by disinterest and apathy has become something of a past-time for me. Employers discover this the way skiers discover avalanches - a silent but rapidly rising disaster. Most of us will simply be blind-sided when a situation hits critical mass. That's why Documentation, Budgets, RAIDs and Backups are necessary to fundamental competence.

recovery

The package today was a Case Management server (so-called) which began life as a Compaq Presario CQ5320F desktop. Perhaps not a bad machine, but never in life should it be called server. When I try to picture the prior admin, all I see is a man spending a great deal of effort to find out how close he can get to a cliff's edge, while still doing handstands. It does not matter that the new admin is speedily retrofitting all of the infrastructure while remaining on budget. Virtualization, redundant equipment, central storage and backups can't fix the two old machines which haven't yet been migrated. The failing system is our Case Management server. The Case Management server has failed to properly backup for some weeks now. Second guessing won't help, suffice it to say that numerous attempts did not yield success. (Which makes it just about identical to everything else the predecessor left us.) This weekend's scheduled power outage seems to have triggered the failure. All other systems restarted and ran properly, but the Windows 2003 Case Management Server hung on login. Not even Safe Mode worked. In desperation, the Admin attempted a 'Last Known Good Config' boot, which surely guarantees problems if it does not work. It too failed.

The onsite admin spent a deal of time before calling me in. He wasn't sure what he was dealing with and we needed Case Management, if at all possible, by Monday. My offsite thirty-second diagnosis was: Likely Disk Failure. I arrived in the role of Troubleshooter. The admin has been onsite for about ninety days and he's been busy. Since day one, his hands have been full with maintaining service, designing solutions, building and implementation. His keen eye and patient effort has resulted in a general migration away from peril with zero down-time. We spent just under an hour debating various approaches to the problem, since everything would require hours of data copy, and we didn't want to tarnish his sterling image. Multiple attempts at extraction might crash drives or push us out of our window of opportunity.

careful assembly

The predecessor had taken a Compaq Desktop, with a 500GB drive, and then later added a Barracuda 1TB disk. Since the predecessor lacked mounting rails, and didn't want to be bothered to use the other 3.5" slot, he felt it appropriate to mount the second disk at a 35 degree angle, in a full-width slot, but with only one screw. Due to bad luck, both drives began to fail at about the same time. This machine has been slow for months, but everything was slow from the beginning. The only change after restart was that sort of permanent slowness which most of us call "crashing". I don't know when slow becomes static, but this server was offering us an excellent opportunity to measure.

resources

Resources: NetApp FAS2220, QNAP NAS, VMware ESXi 5.1 cluster, and the failing Server with both disks.

Qnap - setup nfs export casemgmt, and a 1.1TB iSCSI Target.

NetApp - Keep on trucking, serving persistence to the VMware Cluster.

We booted the Compaq with SysRescueCD. We found the disks, and ran SmartMonTools against the drives. They passed basic inspection, but reported "replace immediately". After mounting the QNAP:/casemgmt, we started in on dd_resuce.

fdisk -l revealed two disks, number one ~500GB, number two ~1TB. In a second Virtual Terminal, I loaded iftop. I quickly saw the bandwidth usage climb to 95mb/sec. I didn't know if that was Megabits or Bytes, and after a little digging, I sadly determined that it was mega-bits, which meant our link was only 10/100. The Ethernet chipset was an Nvidia Forcedeth, the card was a 10/100, and the unchanging "Estimated Time to Completion" was sightly under 12 hours. We scrambled around, found nothing and eventually headed to the only Portland computer store to carry gigabit PCI-Express NICs. Computek, just west of I-405 on Jefferson, sold us a an Intel PCI-E 1Gb NIC. When we shutdown to install the NIC, we relocated the Data Drive to a different workstation (also with a Gigabit NIC).

The Case Management Server was restarted with SysRescueCD, and we resumed the data extraction with dd_rescue. The System Drive had two major faults, which I didn't log, but they occurred between the 37GB and 41GB boundaries. After that, the disk was fine. The speed was adequate, beginning near 72MB/sec, and ending just under at 19MB/sec, with occasional sustained drops to around 10MB/s. The average, based on extraction time, was a little more than 40MB/sec - assuming all the math is correct and my tools were correct.

The Data Drive was Much Slower. Despite the fact that it only had ~385GB of data, it took more than twice as long to copy. Read's started at 20MB/sec, slowly rose to a peak of 32MB/sec with many, MANY, drops to under 10MB/sec. Tried a few multi-threaded copy tools. Primarily we used FastCopy for the initial sweep, and Robocopy (because of the /ZB flag) to clean up. FastCopy ran for more than five hours and was able to reach all but ~5GB of the data. Robocopy ran for a little more then twenty minutes and had no trouble at all.

Once the copy was completed, we ran a read-only chkdsk to give us an over-all view of disk usage, and we also examined Properties in Windows Explorer for the Top-Level directories. All the numbers totalled and so we archived the failing disk.

p2v, manual edition

It's been a long time since I manually P2v'd a Windows Server. This Windows Server wasn't working properly to begin with, the System Drive had experienced significant failures, and I was forced to run an FS-altering chkdsk prior to real recovery. The QNAP held our image file on an NFS export. Windows 7 would provide us with Microsoft-written NTFS repair tools. All I needed was a shim-layer to provide access to the System Drive Image File.

For this, we deployed an Ubuntu 12.04 LTS VM in the VMware Cluster. We gave it 2GB of Ram, and 4 vCPUs. We installed the minimal packages, added htop, iftop, mc, nfs-utils, open-iscsi-utils, and iscsitarget (et. al.). With our shiny VM, I mounted the QNAP with "hard,nointr" NFS options and moved to iSCSI setup. First, I set the IQN, then I grabbed an ietd.conf template and changed the path to reflect the location of the System Drive Image. I left all masking wide-open and then used Microsoft iSCSI Initiator tools on Windows 7 to add the Ubuntu server as a Discovery Portal. The LUN immediately appeared, and I connected it to a drive-letter. We ran chkdsk /f, and the file-system recovered nicely. Only three files were lost, and I was unable to determine the files from the contents, which leads me to suspect completely corrupt data in them anyway.

We did not have a licensed NTFS copy tool, so we decided to use ntfs-3g tools under Linux to accomplish the transfer. After a few futile attempts to do a loop-back iSCSI connection, I tried to get a loop-mount going instead. The problem with loop-mounting is that it only works directly if you have a partition image, not a disk image. I needed to locate the start of the partition, and as I examined the problem I found that I was not familiar enough with the math of sectors and offsets to feel comfortable basing a critical system recovery on it.

Five minutes of Google turned up kpartx. I don't know who wrote it, but that poor bastard probably had received a few direct courses in the School of Hard Knocks. (Despite it being my Alma Mater, I prefer mail-order courses these days.) I think you can always tell software written at the pointy of the sharp stick called "Experience". When a utility is VERY easy and requires NO BRAINS to operate, it probably means that the author is either a genius or had a lot of practice. I don't want to diminish the authors' genius, but given that no one I've heard of did this before him, I am voting for the latter.

kpartx was very small, it's in the main Ubuntu repo, and it has two easy flags - one to interpret the image and tell you what it's going to do, and another to do it. I completely ignored the remaining options. I ran the test, and received exactly what I expected. Then I ran the live command, and I had a new loop device in /dev/mapper. The loop device worked smoothly with ntfsclone --save-image, and the resulting image was 9GB.

That's right, 3 hours, a full disk, and 9GBs of actual data. I tried very hard not to be deeply annoyed. Once we had this, it was still a damaged image of a Windows 2003 Standard SP2 Server originally running on SATA disks, and I needed to load it into a new Guest Machine, on an emulated LSI controller, in the VMware cluster to make it run.

We setup a Windows 2003 Guest, "CaseMgmt", and set the Virtual Drive to 40GB. With CaseMgmt VM shutoff, we attached the virtual disk to the Ubuntu VM, and used fdisk to create an aligned partition. This was when we discovered that ntfsclone won't allow you to restore an image to a smaller partition than the original. After a little monkeying about, we gave up and increased the size of the CaseMgmt Virtual Drive to 500Gb. After a lot of testing, including using Windows 2008 to partition the Virtual Drive with aligned partition, we gave up. Given the nature of the problems we were experiencing, it was growing to be a waste of time.

Our testing cycle went like this: Shutdown the CaseMgmt VM, boot the Ubuntu VM, attempt an operation, then Shutdown Ubuntu and start CaseMgmt. It was pretty simple, and our VMware Cluster running atop our FAS2220 meant that it took about 30 seconds to start and 10 seconds to stop - for both Ubuntu and the CaseMgmt VMs.

At length, we abandoned aligned partitions, and created a new Virtual Drive. We started the CaseMgmt VM on a Windows 2003 Standard CD ISO Image. We used the initial phase of Setup to partition and format the C: Drive. Then we halted the CaseMgmt VM, and restarted Ubuntu. We used ntfsclone to restore the partition, and anecdotally, it seemed slower. Since I had not planned for this moment, I had not timed previous ntfsclone operations. Hence, it was pointless to attempt to get any performance data.

When next I booted the CaseMgmt VM, Windows 2003 did indeed attempt to Boot, however, it stopped very quickly with "INACCESSIBLE_BOOT_DEVICE". We booted to the Windows 2003 Standard Install Disk again, and found that the local-system Administrator password was not properly documented by the previous admin. We booted a copy of Hiren's Boot CD and used the Password Reset tools to clear the Administrator password. From there, we tried enabling a few drivers, but all this yielded was an unbootable, and unrecognizable, Windows instance.

We restored again with Ubuntu, and this time, abandoned all pretense at p2v. I executed a Windows 2003 Standard Repair Install. Everything worked exactly as I had hoped. The resulting image booted, ran, remained joined to the domain, and the services of the Case Management software even operated correctly.

grinding

All of the time we spent working on the System Drive, we were keeping our eyes on the Data Drive. Now that we had time, we also loaded up HDD Guardian to give us a GUI to look at the Smart info for the Data Drive. There were two firmware updates pending for it, and it had a just under 600 bad sectors. The recommendation in orange was: Backup your data and replace this disk immediately.

As I stated above, we carefully examined the results and when all was complete, we detached the iSCSI Drive from the Workstation and attached it to the CaseMgmt VM with Microsoft iSCSI Initiator. Once we set the drive letter, the CaseMgmt services immediately resumed operation and behaved normally.

The only remaining hitch is Windows Updates. Internet Explorer, and many other services, wouldn't run until we re-installed Service Pack 2. Thankfully, the VMware Tools did not care either way. The first two things we did after our successful boot up was to set the local Administrator Password and install VMware Tools. I would not attempt deployment of any .Net software on this host. The install of .Net is in an uncertain state, the patches keep trying to apply but there are errors about mscoree.dll and other .net-related DLLs. My past with the 1.1 and 2.0 releases of the .Net CLR informs me that it's simpler to rebuild a server than it is to fix .Net. This server, and it's configuration, are hardly suitable for long-term use. However, this was not a migration, it was a staying action intended only to circumvent disaster. Long term services will be deployed in a more suitable fashion.

gratitude

I am grateful to all the fellows of the OpenSource movement for providing me with tools which can be assembled to provide useful, powerful and subtle solutions. This operation was one-step shy of genuine data-recovery, and none of it would have been possible without Linux, GNU and the massive host of utilities. Thank, in particular, to kpartx which takes the complex math out of loop-mounting partitions within hard drive images.

notes

kpartx

I only used -l and -a. The -l lists, the -a set's up loop devices for each detected partition.

kpartx -l <image file>
kpartx -a <image file>

(for more info, see the nfolamp article.)

nfs and dd_rescue

For them who might be curious about exact commands....

Mounting the share

mount -t nfs qnap:/casemgmt /nfs/casemgmt -o nfsvers=3,rsize=8192,wsize=8192,soft,intr,nolock

Listing partitions

fdisk -l

Starting dd_rescue

dd_rescue /dev/sda /nfs/casemgmt/casemgmt-disk0.img -l /nfs/casemgmt/casemgmt-disk0.log

nvidia-forcedeth

The Nvidia chipsets - Video or Motherboard, have always been a mix of frustration and pleasure. I don't know if Forcedeth even supports Gig-E, but in this case it was pretty lame. Also, ethtool is a morass of complexity and obscure options. FFS, can someone please make this: "ethtool ethX" give a report of the well-known stats about any card? TSO, TX/RX Cksum, Phy State, Phy Speed, etc, etc. I hate how much I have to read the manual the two times per year that I need to find out what is the damn link speed.

aligned partitions

I love linux tools, but it's painful to use traditional fdisk to align a partition. Now that the heat is off, I think I could have used parted to perform this work - but I didn't. Now, I don't intend to test again.

nfs

It seems that most NAS-solutions dislike locking. That said, when we first attempted to connect to the QNAP, we had a lot of failure. Finally I used this:

showmount -e xx.xx.xx.xx

That revealed... NOTHING.... and we discovered that NFS had never been used from the QNAP. After enabling NFS, it was very nice.

NFS Command used during Image Extract:

mount -t nfs xx.xx.xx.xx:/*exportname* /local/path -o nfsvers=3,rsize=8192,wsize=8192,soft,intr,nolock

NFS Command used during iSCSI Target Export:

mount -t nfs xx.xx.xx.xx:/*exportname* /local/path -o nfsvers=3,rsize=8192,wsize=8192,hard,nolock

iSCSI

If you're over Gig-E, this is probably a great thing. It's better if you don't have to do iSCSI across your production edge-network, and still better with dedicated interfaces. That said, we can stream ~90MB/sec from our QNAP, over 1500-byte-frames production edge network. It's Good Enoughâ„¢ for me.

#/etc/iet/ietd.conf
Target iqn.2012-07.com.ashbyte:MomsLT_NTFS
    Lun 1 Path=/media/ExternalVol00/rootSnapshot01/Sunbeam2/MomsLT_drive.img
    Alias MomsLT
    InitialR2T Yes
    ImmediateData Yes
    MaxOutstandingR2T 1
    MaxConnections 1
    MaxSessions 0
    HeaderDigest None,CRC32C
    DataDigest None,CRC32C
    QueuedCommands           32              # Number of queued commands

ntfs-3g

These guys should probably be showered with money from Microsoft and most MS Users. They have provided the only legitimate alternative to Windows Boxes for accessing, managing and munging NTFS volumes. I dislike using it, but I am ALWAYS thankful for their tools.

driver injection

I would love to see a GOOD Windows 2003/XP/etc offline driver injection toolkit. I have experimented with this before, but it's very tedious and unrewarding. If I keep this up, I may write one. I would love it if the injection framework only installed VMware drivers =). But, I would similarly love it, if there was an offline tool for simply changing partition or other information for a Windows registry which would allow for boot-drive adjustments which currently require a re-install. God Bless *nix for their love of "/" (root).

Comment

Tue 30 November 2010

Filed under Howto

The Case

I have a Dell server which is now deprecated in favor of a NetApp FAS2040. I

tried FreeNAS with UFS and with ZFS. I am going to rate FreeNas as

"fairly lame". That's fairly harsh considering that it's great in concept, and pretty great at

execution ...

Read More

Tue 21 September 2010

Filed under Howto

Tags Cool Security

Extreme Networks

We setup a small SAN, using 1Gb networking. We have deployed a NetApp, Extreme Summit x450a and VMware 4.[01] ESXi. Along the way we had some problems, specifically w/ Jumbo Frames.

Our requirements were fairly specific:

  • VMware and Netapp compatible Port Groups (LACP, Trunking, Etherchannel, etc) must ...
Read More

Wed 30 July 2008

Filed under Howto

Tags Cool Geek

OpenVZ can have various heinous problems with udev. Most often, you cannot enter the VZ from the admin, and you cannot connnect via SSH.

beast / # vzenter 51enter into VE 51 failed
Unable to open pty: No such file or directory

In Fedora you can make a simple change to /etc ...

Read More

Fri 30 May 2008

Filed under Howto

Tags Cool Python

I love flac files, and I always rip my CDs in EAC - straight to FLAC. The problem is that a couple months ago I was transfering some of my older music to my MacBook and I discovered something awful. The horror known as iTunes doesn't grok Flac. I failed ...

Read More

Wed 09 January 2008

Filed under Howto

Tags Python

Turbogears lacks init scripts, and if you use it long enough you will long for them. The Ashbyte crew hacked up something quick and dirty. They aren't much, but they'll save time for Gentoo users.

The main difficulty is the child processes that TG spawns when running with ...

Read More

Wed 10 October 2007

Filed under Howto

Tags Geek Lamer Moments

... or how to share folders from Windows 2003 with Macs via AFP/MacFile/File Services for Machintosh

I don't know what I missed, but this is the one type of sharing that requires using Computer Management.

Right-Click My Computer, Click Manage. Click on Shared Folders. From there you can ...

Read More

Sun 11 February 2007

Filed under Howto

Tags Cool Geek

I wrote a little blurb on my experience converting from Tapes to MP3s.Cassette Tapes to MP3 and CD

Read More

Thu 08 February 2007

Filed under Howto

Tags Cool

I have been building OpenVZ kernels on Gentoo for a while now. Recently while attempting to build Openvz-sources-028.015 most of the VE options disappeared. I looked online but did not find anything. As usual, the OpenVZ folks recommend downloading the pre-packaged config.

Stymied by the online forums, (after about ...

Read More

Fri 01 December 2006

Filed under Howto

Tags Cool Python

Pacopablo posted a quick example of doing mass-environment changes to Trac sites on Asylumware. Today I used it, and I updated it.

Read More

Tue 28 November 2006

Filed under Howto

Tags Cool PostgreSQL Python Security

I hate Apache. I really do. I refuse to vindicate that hatred. There are great aspects about it, but the things I want to do are hampered by things like the sewer-refuse-styled configuration syntax.

I like Nginx. It is fast, simple, and is amazing. It does proxy, reverse proxy, rewrite ...

Read More

Tue 21 November 2006

Filed under Howto

Tags Cool Security

One problem that I have frequently is remembering how to list NFS exports on a remote server. It's really simple:

`showmount`_

osXlt:~ joshua$ showmount -a gambit
All mount points on gambit:
osXlt:~ joshua$ showmount -a forge
All mount points on forge:
*:/data
*,bubbles.mynetwork.com:/data
*,bubbles.mynetwork ...

Read More

Mon 18 September 2006

Filed under Howto

Tags Geek

Cliff Wells is up to Good Stuff (tm). There is now the Code Mongers wiki to help translate it into other languages.

Read More

Wed 30 August 2006

Filed under Howto

Tags Lamer Moments

My good buddy Cliff has been solving various problems with WebServices. We have been working for the last few years on sorting out Apache, Cherokee and Lighttpd. Cliffs vicious tongue has recently been laid to the trunk of Lighttpd. At the recommendation of the venerable Bob Ippolito, we are leaving ...

Read More

Wed 17 May 2006

Filed under Howto

Mac OS X is cool, no doubt. But recently I had some mount problems.

http://www.redhat.com/archives/fedora-list/2005-December/msg03968.html

Read More

Tue 21 March 2006

Filed under Howto

Tags Cool Geek

I never ran the preleases, but I was hot to trot as soon as Gnome 2.14 was released. I was like a kid in a candy store. I have never run forward with such reckless abandon, but if FC5 can do it, Gentoo can do it. Seeing cool articles ...

Read More

Tue 21 March 2006

Filed under Howto

Tags Cool

I may not be much of a shell scripter, or a programmer. However, I wrote a utility that saved me more than 30 minutes of time tonight. The Bastard does mass unmasking and keywords overrides for portage packages. Who needs this? People who wish to run Gnome 2.14 before ...

Read More

Wed 15 March 2006

Filed under Howto

Tags Cool Security

My old laptop is broken. It took many abuses. The flickering backlight on the screen. The little black plastic chunks that fell out now and again. The way the harddrive wouldn't stay powered on the battery. *sigh* Those really were the good old days. The 'good old days' ended ...

Read More

Fri 17 February 2006

Filed under Howto

Tags Cool

I have had the lovely experience of running Starcraft on Linux and having it work VERY WELL! I've wished for years that Starcraft was stable on Wine. Sometime in the last 4 years, it became stable and totally usable.

I thought I would try again since Wine is now ...

Read More

Tue 14 February 2006

Filed under Howto

Tags Cool Python

One of my conspirators, Senior Pacopablo, left a quick guide for me demonstrating the ease of monkeying with Trac environments {en masse}. It's dead useful for situations in which you are making a standard change across a bunch of Trac sites.

Read More

Wed 01 February 2006

Filed under Howto

As if the Gentoo Page wasn't enough, I made a little wiki as I went through the process of updating my Gentoo Box to use SELinux. Maybe it will be a series, I guess that depends on the amount of drama that this causes.

Read More

Wed 14 December 2005

Filed under Howto

Tags Lamer Moments

Gnome started dying on startup - with no errors - after something changed on my laptop recently. I do change a lot of stuff, and it was pretty weird. Finally I loaded just X:

export display=:0
X&
xterm&

Nice, simple, etc. then I ran gnome-session in my xterm to test, and ...

Read More

Wed 14 December 2005

Filed under Howto

Tags Lamer Moments PostgreSQL

In MySQL you can use 'show databases' along with many other convenient 'show' commands. PostgreSQL is a different beast entirely. using real databases can be challenging if you grew up on MySQL.


Connect to the database with psql:


psql -U postgres template1
opsdb ~ # psql -U postgres template1
Welcome to psql ...

Read More

Up To Something © Joshua M Schmidlkofer Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More