When I describe (or rather, attempt to describe) technical subjects to people that aren't familiar with them, I tend to fall back on automotive metaphors. No, I'm not a gearhead, and no, it doesn't work every time, but it does seem to be the first thing that pops in my head. In the case of the SQL Server Integration Services Data Flow, I think it works - and I'd like your thoughts on it.
The General Idea
I tend to describe the data flow pipeline as an automotive assembly line. Most people are familiar with the general physical constraints of an assembly line, and a lot of the problems that happen in assembly lines happen in the data flow as well - it's just that they're easier to describe because you're talking about physical things. In particular, the concept of data buffers "moving past" the components in the data flow is very analogous to car chassis' moving past assembly stations in the line. This is typically a very difficult concept to convey, but becomes simpler in this metaphor. A common example I give is that you can think of the Derived Column component like a station on the assembly line that bolts on doors to the car. (That would be a new column.) Or a Lookup component having to attach a customer's desired "options list" to the car, knowing only the car's ID on the assembly line.
Describing Performance Problems
With the assembly line metaphor, you can fairly easily describe the common performance problems that affect a data flow. The difference between blocking and non-blocking components typically means the audience has to try to hold numbers and concepts in their heads, but gets easier with cars. For example, you can describe the Sort component as being like a "limited edition" badge for the cars. This badge is required to have a "1 of 1000" type imprint on it... except it needs to be accurate and know how many cars (and badges) need to be created before it can affix the first one to the first car. Therefore, it has to wait until ALL the cars have reached that point in the assembly line so that it can count them and correctly generate the set of badges.
Once everyone hears that explanation, it's fairly easy to get them to understand that with a "production run" of ten cars, it might not be too bad to have all the chassis wait on the line itself. But when you get to one thousand cars, you need to take floor space inside the building to "hold" the cars. Get to a million, and you realize that you'll have to park the cars outside of the plant, because there just isn't enough floor space (RAM) to hold on to all of them. Keeping some of the cars inside the plant doesn't help at all, and actually reduces the space you have left over for all the other stations (transforms) on the line.
You can even describe the EngineThreads property as actual workers. You have ten, and they can work at any station - but you only have ten...
Does This Work For You?
Of course, the metaphor breaks down handily when you get to the Multicast component... but so far it's seemed fairly robust in getting people to understand how the pipeline can concurrently operate multiple transforms on the data in parallel, even when the flow looks very serial in the designer.
I'd like your feedback on this metaphor - does it work for you? Where else does it break down - and is that breakdown at an acceptable point in how knowledgeable someone would have to be in SSIS in order to spot it?
If this idea doesn't go down in flames, I have some follow on thoughts to take it further...
Book Review: Big Red - Voyage of a Trident Submarine
I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...
2009-03-10
1,439 reads