 |
 |
The past few days in the life of mattyb, Oracle DBA
|
 |
|
 |
|
Professional Poster
Join Date: Feb 2008
Location: France
Status:
Offline
|
|
Not sure how many of you might be interested in this, but I thought that I’d post anyway - its my excuse for the lack of posting recently. I know how important my posts are to you all
We are 90% finished a datacentre move. We’ve now got a much more modern facility, with a large properly air-conditioned machine room, decent network for the servers as well as the offices, IP phones - a place that you’d be proud to show potential clients (we host as well as run managed servers for quite a few clients).
During the weekend of the 28/29th of June, we moved the servers and the SAN (hard drives basically - lots of them) of our largest clients. Our SAN is from a company called EMC and they decided to add some more disks during the production window of the Monday morning. Things started to go pear-shaped. Our performance for all clients using filesystems on the SAN was terrible. Users couldn’t connect, run their applications, their batches etc. Apparently the SAN needed to verify the disks that were added (can’t remember the terminology plus its in French so I’m not sure if I could translate it properly). They couldn’t give us a timescale. So, very little client activity Monday, no batch runs over night. Tuesday same thing, plus lots of backups failed over Monday night. EMC still couldn’t give us a timescale when the disks would finish their verification. Wednesday more of the same, now the backups are spitting up bad block errors on several large databases (over 100G). Datawarehouses no problem since the performance was so bad we have been unable to update them since the weekend. We have about 10 datawarehouses of several terabytes. We are down to one large client’s backup not working. One file is giving a bad block error.
Bit of tech speak : a tablespace in Oracle is basically a logical container and the container has tables, indexes etc in it. You allocate files (datafiles) to a tablespace. Some people store data in one tablespace and indexes (for example) in another - we do this. Indexes can (usually) be easily rebuilt. The file in question was one of two files that belonged to a tablespace full of data - no indexes. The file was 32G.
I looked at what was in this tablespace. Largest table 42 million lines. Next 35 million. The next 27 million. I stopped looking after that.
So, we’re running a 200G database for this client, we have a two day old backup. He’s screaming down the phone about when his app will be available. We’re screaming at EMC about when the disks will be verified. Great fun. Thursday and the decision is taken that when the disks are verified, we restore the good backup for our 200G friend. We get the all clear from EMC at 17H00. We brainstorm, we put a plan into action - just before launching the restore I re-explain that what we are about to do, if it goes wrong, we will loose the present database (the only ‘good’ database being the backup of a few days previously). We decide to put another plan into place the next day.
Thursday, EMC gives us some extra disks to restore the backup onto. They have to do this since during the project they are the only ones able to allocate disks to a server. We create the filesystems, I restore the database backup, I recover the database using the archivelogs (full of the work done on the database since the last good backup). We take a backup of the newly restored database. The client is happy the database is back up and that we now have a recent backup. We run the batch that hasn’t been able to run for the past few days. Late Friday night I get an email saying what a hero I am (along with the others that worked on the problem) and that we can sleep soundly. Over the weekend our Citrix server farms are moved, along with the developers machines (we develop some software that can cost several millions of euros). The data from the development servers needs to be replicated onto the new EMC SAN.
Monday morning. I am informed by my boss that EMC took the disks that we restored the database on and allocated them to another machine, wiping the disks in the process.
Monday and Tuesday were not fun days.
Apparently after all the crap that we’ve put up with over the past week, there is a grand total of 7.50€ that is missing from the clients database. My boss took out a 10€ note from her wallet and offered to send it to the client. The head honcho didn’t really appreciate it.
I’ve been doing Oracle DBA work for over four years now. I’ve never had to recover a production database before. I’ve practiced it lots of times, but I never did it ‘under fire’. 90% of the time I’ve really loved it. That other 10% of the time is a really really really high stress, WTF, we are loosing money, the client isn’t happy, home at 4am nightmare.
If anybody reading this is thinking about being a DBA - of any database - then I’d still say go for it. The work is interesting and varied. The technology surrounding databases like Oracle, SQL Server, DB2 and MySQL can be cutting edge. Most of the largest applications in use by the largest companies have a database behind it. There is always something new to learn, to develop, to speed up, to backup etc.
If you get into the database administration arena, FFS, learn how to backup and restore/recover your databases. You aren’t a DBA if you cannot get that database running again.
|
|
XBL : Ze Veteran
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Apr 2000
Location: Berkshire, UK
Status:
Offline
|
|
Dump EMC. Buy NetApp. When you do this, you will thank me.
|
|
Paco is bitter about the loss of his .mac webpage. Image will return when his sadness lessens.
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 1999
Status:
Offline
|
|
Originally Posted by Paco500
Dump EMC. Buy NetApp. When you do this, you will thank me.
He might be using Cisco, and they (unfortunately) are tied to EMC.
|
|
"…I contend that we are both atheists. I just believe in one fewer god than
you do. When you understand why you dismiss all the other possible gods,
you will understand why I dismiss yours." - Stephen F. Roberts
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Apr 2000
Location: Berkshire, UK
Status:
Offline
|
|
Originally Posted by olePigeon
He might be using Cisco, and they (unfortunately) are tied to EMC.
Wha? If you are talking about Cisco FC switches- they are just FC switches and work just dandy with NetApp, and are fully supported both ways. NetApp will even provide them as they are an authorized reseller. Sounds like you have been exposed to the mighty EMC F.U.D. machine. Whatever EMC tells you- verify it with a 3rd party.
While I don't expect he can really dump EMC, had he been on NetApp, this would have been a non-issue. There are quite a few reasons Oracle hosts all of their customers in their Austin Data Center on NetApp- it's more better.
If one is forced to use EMC- at least throw a NetApp V-Series in front of it and make it useful.
|
|
Paco is bitter about the loss of his .mac webpage. Image will return when his sadness lessens.
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 1999
Status:
Offline
|
|
Originally Posted by Paco500
Wha? If you are talking about Cisco FC switches- they are just FC switches and work just dandy with NetApp, and are fully supported both ways. NetApp will even provide them as they are an authorized reseller. Sounds like you have been exposed to the mighty EMC F.U.D. machine. Whatever EMC tells you- verify it with a 3rd party.
Twas Cisco that stated they're tied to EMC, not EMC. But if it works, then it works. I'd take Cisco over anything just from experience.
I guess it could also work with XSAN. 
|
|
"…I contend that we are both atheists. I just believe in one fewer god than
you do. When you understand why you dismiss all the other possible gods,
you will understand why I dismiss yours." - Stephen F. Roberts
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Mar 2001
Location: The Rockies
Status:
Offline
|
|
I'm surprised there isn't a primetime drama in the mode of ER or NYPD Blue on your life yet, mattyb.
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Apr 2007
Location: Iowa
Status:
Offline
|
|
The company I'm working for uses Oracle, and having only seen it from the client side, I envy no one that deals with it server-side.
|
"Specific knowledge on a topic usually demonstrates in-depth knowledge."
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Feb 2008
Location: France
Status:
Offline
|
|
Originally Posted by BRussell
I'm surprised there isn't a primetime drama in the mode of ER or NYPD Blue on your life yet, mattyb.
I don't want George Clueless as me. The negotiations are ongoing.
Originally Posted by Laminar
The company I'm working for uses Oracle, and having only seen it from the client side, I envy no one that deals with it server-side.
Oracle databases rule the world. Dunno about the apps, except Siebel which is a great big steaming pile of dog shite.
Today at work its calm. Bit scary actually.
|
|
XBL : Ze Veteran
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2006
Location: Punta Cana, República Dominicana
Status:
Offline
|
|
Originally Posted by mattyb
If you get into the database administration arena, FFS, learn how to backup and restore/recover your databases. You aren’t a DBA if you cannot get that database running again.
Don't forget to periodically test your backups. I can't tell you how many client sites I've been at where they never tested the backups only to find out they were bad.
My first hand experience with this was at Motorola. My company had implemented a huge order entry/manufacturing system for them. Long story short, hard disk failures, backups were bad, DBA's and SysAdmins are in deep trouble. Luckily, we had been dumping the contents of EVERY database table to flat files cuz we didn't trust the DBA. We were able to restore the entire database and have them up and running in no time. Always nice when us consultants can show up the local boys.
Turns out the DBA was a major Oracle bigot, and we happened to be using Informix (I really miss that RDBMS). They refused to give us administrative privileges to administer the DB and were less than helpful otherwise. After that fiasco, the tide certainly shifted.
|
|
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Feb 2008
Location: France
Status:
Offline
|
|
Originally Posted by Atheist
Don't forget to periodically test your backups. I can't tell you how many client sites I've been at where they never tested the backups only to find out they were bad.
I got here three months ago. There were no written procedures, the four DBAs that were already here (but in another team) had never done a restore / recovery from a cold backup. Due to budget constraints they hadn't even tested any restores. I told them that they weren't DBAs since they hadn't done any of this. They didn't like that. My boss told me off, she was laughing when she did it though.
The move project finishes in a week. We are going to use RMAN instead of the current system of cold backups.
I've asked for a fairly substantial machine as a dev/test/RMAN/Grid Control box. Hopefully they'll have seen the need after our recent events.
P.S. EMC SAN was forced onto us by 'teh mother company'. We got a good deal OK ? tw@s
|
|
XBL : Ze Veteran
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Apr 2000
Location: Berkshire, UK
Status:
Offline
|
|
Originally Posted by mattyb
P.S. EMC SAN was forced onto us by 'teh mother company'. We got a good deal OK ? tw@s
Bah! Good deal? You wasted days! Pissed off clients! It cost you time, money and good will! This is not a "good deal"!
Testing your backups is so important because tradtional backup methods were devised in the dark ages. That tape is still in the data center is a travesty. If you'd been on NetApp using SnapShots, FlexClones, Snap Manager for Oracle, etc, you would have been back up and running in seconds, not days. Plus, you would have likley been getting much better disk utilisation (RAID-DP vs RAID 10) and performance, so even if raw TB to raw TB EMC was less, would would make up the cost differnces in efficencies.
And only in the most extreme cases does running Oracle on FC SAN have benefit. NFS has turned out to be damn good for this. That's the way Oracle themselves do it for their hosted customers. Even more savings by ditching the FC infrastucture!
Here endeth my rant.
|
|
Paco is bitter about the loss of his .mac webpage. Image will return when his sadness lessens.
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|