Reading data from Oracle using values in MSSQL2005

  • I have a requirement where I need to read data from a huge Oracle database using parameters / values stored in a MSSQL2005 database.

    The oracle database contains a table that grows by 100 million records a day. It is therefore not practical or time efficient to pull all the records and then try and find the records that I need to process. The key values into the oracle table for the records that I need are stored as values in the MSSQL database.

    It is not possible at this stage to create a table on the oracle DB and load the values (+- 100K a day) and then run an oracle job / scrip to extract the rows that I need.

    I use SSIS to do the all other ETL tasks. Can somebody perhaps suggest a solution? Conceptually I know what to do but for some reason cant get it to work. I fail to pass the value to the foreach loop.

    I know I must read the values from the sql table and then for each of the values pass the value as a parameter to a sql statement that will select the records with parameter value and insert into another MSSQL table for further processing.

    Any suggestions will be greatly appreciated.

  • You can use an expression on the query for the oracle connection. Programmatically, you can construct the query you wish to run via script tasks, and then in the expression for the query running against the oracle connection, put the variable in.

    I have done this with success connecting to an OpenLink ODBC data connection.

    For example,

    You have a data flow task pulling from Oracle and dumping into a SQL Server table.

    The Oracle connection needs to have a "dummy query" that will parse against the connection successfully.

    On that same connection, you can change the expression of the Source Query (not exactly sure on the verbiage here) to use the variable expression from the package variable storing the query you'd like to use.

    The variable may also need to contain a "dummy query" that will parse against the Oracle connection (not sure if it compiles that way or not)

    Please feel free to e-mail me and I can provide some screen shots of what I am referring to.

  • Hello

    I saw ur link and have aquick question. help greatly appreciated. i need to have an odbc connection to oracle db and then access the info via sql select. pls help me.

  • Create a global package variable called "strSQL" as a String. Put a value in it that will parse against the connection and provide the columns you are trying to get. (ie, SELECT col1, col2 from myTable where col1 = 1234, so as to only pull back one record to parse and verify with)

    Create a Script Task and set "User::strSQL" as a read/write variable.

    In the code of the script task, set the value of the SQL you want to use against the ODBC connection as:

    Dts.Variables("User::strSQL").Value = "SELECT col1, col2 FROM myTable"

    Create a new connection manager (right click in conn managers window and choose "New Connection..." and create your ODBC connection.

    Create a data flow task; Add a "DataReader Source" object to the data flow and assign it to use the connection manager you just created.

    Return to the "Control Flow" and right click on the Data Flow Task you created and choose properties to bring up the properties window. From that window, click in the empty box next to "Expressions" and then click on "..." to bring up the Expressions Editor. From there, choose [DataReader Source].[SqlCommand] for the property in the drop down menu, and for the expression itself, drop down the Variables in the Expression Builder and drag and drop "User::strSQL" into the Expression window below. The expression will look like "@[User::strSQL]" (without the quotes). Click on "Evaluate Expression" to verify the query you placed in the value for strSQL shows up in the evaluated value window.

    Now, when you run the package, the query that you set in the Script Task will overwrite the value that you entered into the strSQL value manually in the variable window.

    I like to put the sample query in the window so that you know what columns are being selected in the oracle connection as well as allowing you to verify that the expression is in fact pulling the variable value properly.

    Then you just need to set up an OLE DB data destination and set up the appropriate destination objects to get the data into your SQL Instance.

    Let me know how you make out.

    This also has some generic information about what I explained above:

    http://www.sqlservercentral.com/Forums/Topic422374-148-1.aspx

  • Thank you Stephen for such a detailed explanation, this helped me a lot!!!

  • Glad to help!

  • Thank you for taking time to write up the explanation and how to's for this. I've been researching a great bit to see how to accomplish this and finally encountered your thread. Faniev's original posting, including the constraints, sounds very much like what I need to do. I was attempting to do this by building a Data Flow with two input sources-- the MSSQL and Oracle Db tables flowing into a Merge Join, then into a Row Counter and then outputting the results into a staging table in the MSSQL Db. So far I've had some success but, obviously since my MSSQL table contains approx 1.9 million rows/values that need to be included in the query against a 70+million row table in Oracle that approach is not going to work.

    Since I'm still a little new with SSIS, I apologize if this would be evident to a more seasoned SSIS user but I have some questions concerning the setup for the various package object. Mainly, they have to do with where each of the SELECT's should be pointing:

    1. I've placed the Control Flow Script Task just before, and then connected to, the Data Flow Task. Is that the correct placement?

    2. In the code of the Script Task where we enter "...Value = "SELECT col1, col2 FROM myTable" is the myTable representing the MSSQL or the Oracle table?

    3. In the global package variable, where we need to provide the SQL query string: "...(SELECT col1, col2 from myTable where col1 = 1234, ...)" Would this need to be coded to Select from the Oracle db source or the MSSQL source?

    4. Where you indicated "...set the value of the SQL you want to use against the ODBC connection as: Dts.Variables("User::strSQL").Value = "SELECT col1, col2 FROM myTable" Again, would this be the MSSQL source or the Oracle source?

    So far, I've managed to have the package work when using a data-value that's present in both the MSSQL and Oracle tables. (i.e. where you have the variable set to pull across only one record.) It's when I take steps to remove the hardcoded data value (so the MSSQL table values will be used) is when the package simply defaults to selecting all of the rows from the Oracle source (instead of matching up with the values already residing on the MSSQL table.) I'm sure it's because I'm missing something in how the various objects are setup/configured but I'm not able to determine which? Can you help?

    _____________________________________________________________
    [font="Comic Sans MS"]I'm constantly amazed when I learn about what I never knew that I didn't know.[/font]:smooooth:

  • Hi,

    I'm glad you appreciated the effort on that post.

    To answer your questions:

    1. This is the correct flow. You want the SQL statement to be set to the query you want to run before you enter the data flow task.

    2. The select statement would be, in your case, the Oracle statement. When you're setting up the statement in the script task, you're setting it up to query your data source.

    3. In the global variable, that would also be your Oracle source. This should return the same data types and column names as your modified query that you set in the script task will return. The reason is that when you set the data flow task to use the variable, it will look at what the variable contents are, and then that is what will allow you to set up your mappings properly. Think of this as overlaying your query in the package at your run time with what you're building within the script task (in most cases, the only different would be the addition of a where clause; in others, you might want to use cast/convert functions to return the correct length so that all the metadata matches up correctly). In a sense, you are tricking the SSIS engine prior to running the package.

    4. Again, this would be your source, Oracle.

    To help with that, this high level flow would be in play:

    Global variable = your source schema you want to move into SQL Server. (ie. SELECT int_col1, datetime2_col2 from t1)

    Script task = where you set the variable value , typically your source schema + your conditionals to set the global package variable back to the query you want to run on your source. (ie. SELECT int_col1, datetime2_col2 from t1 WHERE datetime2_col2 = '1/1/2009') -- this would be set with VB.NET code.

    Data flow task = source oledb connection would be set to the variable. (for ODBC connections, you'll need to look into expressions and setting the values that way, unfortunately i dont have any handy links for that).

    Hope that information helps!

    Steve

  • Hello Steve,

    Thanks for your reply. After working on this a bit more today, I guess I'm still a bit unclear on how/where the package is picking up the primary_key data values from the MSSQL Db source table so they can in turn be included in the Select/compare query against the Oracle table source? If I've read your reply correctly, it would seem that each of the package objects are all being configured to point to the Oracle source? Is it possible that I should be setting up the Script Task component to first read the MSSQL Db table in order to load the data values that I'l need in the subsequent Select Query of the Oracle source that runs under Data Reader Object?

    _____________________________________________________________
    [font="Comic Sans MS"]I'm constantly amazed when I learn about what I never knew that I didn't know.[/font]:smooooth:

  • Maybe I misread your initial post. Are you pulling from Oracle into SQL or the other way around?

    All of your queries should be set to pull from your source, whether that is oracle or SQL. Ultimately, you want that global package variable to contain your modified query that will execute on the source.

    If you are trying to compare the PK on the destination with a value on the source, that might be more tricky because you'd need to first query your destination table for the most recent value, then return that value to the query you want to run on the source. Can be done, just wasn't in my initial thought process.

    What I used to do was have a query to a Informatica data source that would pull all data from the previous day (Select * from table where datecolumn = today - 1), then i would have another task to query that raw table from within SQL because I was more familiar with the T-SQL syntax.

    Certainly not trying to confuse you at all, and want to help you get to your end state.

    Steve

  • Hi Steve,

    Let me back up a bit and clarify my components and how I'm trying to get them handled with the package. I have two data sources, with similar tables, that I'm working with: a MSSQL 05 Db table and an Oracle 9i Db table.

    - The SQL05 Db source has a table named NULL_ROWS. It contains one column named CLINICAL_EVENT_ID (decimal 18,0). CLINICAL_EVENT_ID also qualified and was set as the PK. There are 1.9+ million unique data values in each row of the table.

    - The Oracle9i Db source has a table, named CLINICAL_EVENT, with a PK column named CLINICAL_EVENT_ID (NUMBER). It also has 60 some additional columns that I need to pull in as well. This table is huge-- 71+ million rows.

    Essentially what I want the package to do is read in all the CLINIACL_EVENT_ID's from NULL_ROWS and then use them in a SELECT query over the Oracle table to pickup all 1.9 million matching rows. Each matching instance (row) is then shoved into a staging table that also resides on the SQL05 Db to be used in subsequent processing in the package.

    Unless I'm not understanding something, I think the example solution you provided should work in this case, no?

    _____________________________________________________________
    [font="Comic Sans MS"]I'm constantly amazed when I learn about what I never knew that I didn't know.[/font]:smooooth:

  • Ok- now it makes sense. The way my example would work is that you would need to generate a select statement based on all of the values in the table. I don't think it would be ideal based on the number of rows because your select statement would ultimately generate to "SELECT col1, col2 from table_71mil where col1 IN ('value1' ... 'value1_9mil') which would be a massive string that I don't think is possible, nor logical.

    You could painfully loop through all 1.9 million rows, generating a select statement for each one, then executing that select statement against the oracle source. An execute SQL task could return a full result set, then you could have a for each loop go through each row. The first step of the for each would be a script task to dynamically set the select statement, then the second would be the data flow task to do the data pull. However, that may or may not take a long time to complete, but it would work. The script task would build a single "select col1 from table_71mil where col1 = 'value1'", then on the second iteration it would be "select col1 from table_71mil where col1 = 'value2'", and so on until 'value1900000'.

    Is there any way to have your 1.9 million records added to the oracle DB? If so, you could simplify the solution and have your package simply query oracle with "select col1, col2, from table_71mil where col1 IN (Select col1 from table1_9mil)", but that would require your Oracle DBA to allow you to import that table.

    Off the top of my head, I'm not sure of a way other than row-by-row, but I'm sure there's a way. Maybe Jeff Moden will hit up this thread and offer some other suggestions. He's usually got some nifty ideas.

    Steve

  • Another option that I've been looking at is the Lookup transformation component within the data flow task toolbox.

    I was able to get this configured to do what you need to do based on two SQL Server tables, but I unfortunately don't have any Oracle DBs at my disposal to try it out.

    This article from Jamie Thomson should help you configure the Lookup:

    http://www.sqlis.com/post/Get-all-from-Table-A-that-isnt-in-Table-B.aspx

    Look at Method 2 further down in the page.

    It's tricky to configure. Rows that don't match will go through the Lookup transformation as "errors". So, you need to configure error output on the "Reference Table" tab and choose "Redirect Row" in the Error column. However, you don't need to configure the error output stream out of the error (red arrow). The success stream (green arrow) should go to your destination table.

    Reading this site seems like it will work for you:

    http://microsoftdw.blogspot.com/

    Looks like you'll want to use the Microsoft OLEDB Driver for Oracle.

    That lookup might be what you're looking to do.

    Steve

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply