Guidelines on how to set up a data warehouse process

Question

Guidelines on how to set up a data warehouse process

prefet

SSC-Addicted

Points: 492
More actions
December 1, 2017 at 8:56 am

#409820

Hi,
I'm looking for advice on how to set up a process for data warehousing some data. We currently use an excel spreadsheet with some VBA code that calls a connection string then uploads the data into the right tables. As we are collecting data from different vendors, when need to look at the data (hence Excel) but eventually we store the data in a (as much as possible) normalized way in the database. Users open the spreadsheet, import some of the data on the server, correct and normalize then send the data back to the database.
Right now the code is very ineffecient in terms of run time and error prone. What I would like to do is eliminate the load part of the process with SSIS. Issue is I have only limited experience with Visual Studio and none with SSIS. I see SSIS, SSDS, BIDS etc. online but can't make sense of most it. Right now all I use is SSMS to look at the data.
Any ideas on how to go about ?
Much appreciated.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

DimUser Ten Centuries Points: 1043 More actions · Answer 1

what version of SSIS/SSDT would you be using?
The reason I ask is that at some point (I can't remember which version) MS added an Excel Data Source connector in to SSIS. I think it was some point after 2008 R2, but can't be sure when. Before the Excel Data Source you needed to treat Excel as a flat file source I think, which is more tricky.

Here is something to consider (assuming you have a more recent version of SSDT):

In SSIS add an excel data source, and extract the data from excel file.
use a lookup to compare the values to the original data in the database (this will require another data source).
Any values that don't match are edited rows, so configure a 'No Match Output' for the lookup, and then connect that to a OLE Destination (your database).

Something like that should work. Obviously the above is quite vague.
One thing to consider, if your users download a lot of data to excel, it may be a good idea to write an additional column in to the table, with vba. this will be a rowversion column, that will allow you to flag edited rows. the reason I mention this is that, if you are trying to do a lookup on lots of columns, for lots of rows, your lookup transformation could take a long time. With the rowversion you could extract only the edited rows... a lot more efficient.
you may also want to consider some logging, for quality purposes (e.g. extracted rows/updated rows/start time and end times).

A couple of pointers to consider with the excel data source in SSDT; you may need to alter the x64 bit runtime values in the project properties, otherwise you may not be able to run the project in debug mode, and you will find it may not run as you'd hoped when you schedule it. Also, may need to do some data conversion is SSIS if you have string values in your data (e.g. varchar), as I believe excel uses Unicode.

good luck

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 2

Sounds like a pretty big ask. Here are some thoughts from me.

When I started down the path of data warehousing with SQL Server, I too had very little experience with SSIS. I worked with a senior DBA who taught me a great deal of how everything we could do in SSIS could also be done in T-SQL. Therefore, you can create ETL pipelines that are primarily T-SQL in the sense of bulk loading flat files into a staging table, clean/conform the data, and normalize/or denormalize the data into a final model with automation through SQL Agent Jobs.

SSIS is a great tool and has some pros and cons from just sticking with T-SQL for your ETL. But I would say that learning SSIS is a good step to take along with learning more advanced T-SQL topics too. It will give you the option to go either way or what I truly do today, which is mix and match both T-SQL and SSIS together.

Some key points I look at after the fact is how to distribute the ETL process across many resources (i.e.: disks, computers, etc) along with how to parallel load data with the tools available. Another big one is validation of data going in and out along with defining business rules such as taxonomies and so forth that ultimately help users consume what you are ingesting, cleaning, conforming and exposing.

prefet SSC-Addicted Points: 492 More actions · Answer 3

Thanks guys for the feedback as I understand the question was quite vague. Having previously worked as an analyst, most of the the work I did was building the queries to extract the data on an existing database through SSMS. All the DBA work was done by somebody else on the team. The administration part is a step up in terms of skills/responsibilities but currently necessary. I like the T-SQL solution for the ETL with SQL Agent Jobs as it doesn't stray too far from what I know.

We run the spreadsheets and SSMS on a virtual machine with everybody having access to the server (not ideal) so pulling the process across many resources shouldn't be too much of an issue (i hope) once I've got the initial process done.

RonKyle SSC-Dedicated Points: 31557 More actions · Answer 4

This is a pretty open question. You may want to consider hiring a consultant to get you started. Even that, unfortunately, can be a hit and miss. I got started in this area when a consultant we hired produced a cube that had some inaccurate data and as the DBA I was tapped as the DBA to figure what was wrong. But there are lots of good ones. SSIS has a step learning curve, but it's well worth it if you have any significant sized records. I would also suggest you invest in a book on data warehousing by Ralph Kimball. He has several, but one is a good intro. The title escapes me at the moment, but I can get it if you need it.