alt.oracle: We don't need no steennkking recoveries!

Since this is a new blog, let's start with something basic - backups. Everybody knows you do need those 'steenking backups'. You know it, the rest of your team knows it, even the suits know it (they read it in a Delta Airlines inflight magazine). But there are a couple of problems I see with backups these days. The first lies with the DBA and, sadly, it can get your ass fired.

Yes, you did a nice and proper RMAN backup of your database. You did the right syntax, you reviewed the log, you even did a 'report need backup' command and made sure it came back clean. The real question is: if you lose everything, do you know how to put it back together? In my observation, it's utterly confounding how few DBAs today know how to do a recovery. Because let's face it - doing a backup is a deceptively simple process. You start RMAN and type 'backup database'. You can make it more complicated than that, but it doesn't always have to be. Backup is clean, orderly and comfortable. Recovery is messy, complicated and scary if you don't know what you're doing. Ask yourself these questions.

You lose a datafile. Do you know how to do a complete recovery without restoring the whole database? Do you know how to do it while the database is online?
You lose an entire database – control files, redo logs and datafiles. Can you do a complete recovery from that? Try it – it's an enlightening exercise.
A brilliant developer drops the most important table in your production database. Can you do a point-in-time incomplete recovery to roll the database forward to the point right before the table was dropped?
You lose an entire database, including the archivelogs. Have you tried the process of pulling the last backup off of tape and then doing a restore?

The list goes on. So how do you learn to do this stuff? You learn by doing it. I tell students in my classes that if you want to know how do recoveries, break stuff and then see if you can fix it. Literally. Build a database that no one else is using. Then, delete the system datafile and try to recover. Delete two redo logs and see how far you can get. Delete the system datafile AND the control files and see what you can do. It's one of the most enlightening experiences a DBA can go through. You'll learn what really makes a database tick. Consider this scenario – your mission critical, never-goes-down, life-blood of the company database experiences a media failure. The suits are surrounding you in your cube, watching your every move, questioning your every decision, telling you how much every moment of downtime is costing them, while sweat pours off your face onto your shaking hands as you try to think of the next command to type. I've seen it happen before.

Several years ago, I worked for a company that had a division that decided they needed their own personal DBA – the "regular" DBA group wasn't giving them enough TLC, I guess. They hired a guy who claimed to have extensive knowledge of Oracle version 9iR2 (way back then, yeah). He started on a Monday. That same day, the server on which his database lived had a drive controller error that wrote corrupt data across the entire RAID array. Time for recovery! Unfortunately, new guy (who was a really nice fellow) didn't have a clue as to what to do, and, worse, he didn't know some basic 9i stuff. He didn't know what an spfile was and he kept trying to "connect internal". Long story short, new guy started on a Monday and was gone by Wednesday.

Spare yourself the agony. Practice, practice, practice. Test, test and test again. We'll talk about the second problem next time. Until then, go break something – and try to fix it.

alt.oracle

Tuesday, January 4, 2011

We don't need no steennkking recoveries!

1 comment: