January 13, 2015 at 5:46 pm
I am wanting to insert data into a table based on folder descriptions, file descriptions, and content within files.
I have multiple folders which I want to insert the folder description as an attribute within a table. Next multiple files within each folder and each .csv file description as an attribute and then the content within each file in the table. Can a foreach container be nested? Any help would be greatly appreciated. Thank you!
To summarize:
Folder #1
File #1 (.csv)
File#1.First
File#1.Last
File #2 (.csv)
File#2.First
File#2.Last
Folder #2
File #1 (.csv)
File#1.First
File#1.Last
File #2 (.csv)
File#2.First
File#2.Last
The table should look as follows:
FolderName FileName First Last
1 1 John Smith
1 1 Jim Jones
1 2 Sara Black
1 2 Jan White
2 1 Bob Blue
2 1 Jeremy Red
2 2 John Pink
2 2 Fred Silver
January 14, 2015 at 12:18 am
Just a standard FOREACH loop should be sufficient for this.
If you configure the FEL something like this[/url], you will see that, for every file processed, the full file path can be written to a variable, if you select the 'Fully qualified' option when retrieving the file name.
You'll need to use expressions or a script task to parse out the file name and path from this variable.
The file content should be handled by a standard source --> destination data flow. Within your data flow, add the file name and path as derived columns and map them to the destination table columns as required.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 14, 2015 at 4:11 pm
Thank you so much Phil for the reply!
I probably made this a little harder than what it could have been but I solved this in the following steps:
1. A Foreach container is created
2. Grabs the full path of each file into a variable
3. Imports the file path variable into a temporary processing table
4. Imports the [first] and [last] content from each file into a second temporary processing table
5. Move [first] and [last] into the production table
6. Move full path of file into the production table for every record where the [first] and [last] is populated without a full path file.
7. Clear processing tables for the next file
8. Once all folders and files have been processed I use SQL script task to identify the folder description from the file path that was imported.
9. Clean up string of file path to include only the file name via SQL script task
January 15, 2015 at 12:13 am
I probably made this a little harder than what it could have been
Let me remove any doubt: yes you did! 🙂
But it works and that's great. Should you decide in future that some optimisation is required, feel free to start another thread and maybe we can help streamline things.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 15, 2015 at 1:55 pm
Thank you so much Phil. A follow up thread has been posted as I am in dire need for optimization.
January 15, 2015 at 2:42 pm
1. A Foreach container is created
2. Grabs the full path of each file into a variable
3. Imports the file path variable into a temporary processing table
4. Imports the [first] and [last] content from each file into a second temporary processing table
5. Move [first] and [last] into the production table
6. Move full path of file into the production table for every record where the [first] and [last] is populated without a full path file.
7. Clear processing tables for the next file
8. Once all folders and files have been processed I use SQL script task to identify the folder description from the file path that was imported.
9. Clean up string of file path to include only the file name via SQL script task
Try something more like
1. A Foreach container is created
2. Grabs the full path of each file into a variable
3. Data flow selecting first and last from file source
4. Derived column transformation parsing your variable into one filename column, and one folder path column
5. Lookup transformation looking up folder description corresponding to the folder path
6. Destination to production table
Of course, with that many files, it will still be slow, because it is doing one file at a time, and because files aren't indexed the way tables are, so rows don't get read as efficiently.
Assuming there are no concurrency issues, considering running multiple foreach loops in parallel. For example, you could have one loop go through files with names of "a*.*", another doing "b*.*", etc. Or you could have different loops for different directories. The optimal choice would depend on the directory structure and file naming convention, but at the end of the day the one thing you can do are
Make each file's import as fast as possible
Do multiple imports at the same time
January 16, 2015 at 12:37 am
Nevyn has provided you with some good advice here, I agree with everything he has written. I'm sure that we can find you some links on setting up parallel FELs with data flows, if you cannot.
You'll have to watch out for locking issues though, if the target table is the same. There are strategies that you can employ to do that too, should it be a problem.
I would be interested to know more about the nature of the 'folder description' – what is that?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 16, 2015 at 8:28 am
Hello Phil,
I am working on the alterations suggested by Nevyn. The folder descriptions are integer ID's, incrementing from 1~2500. Also, each folder has 200 files which are also integer ID's, 1.csv-200.csv.
January 16, 2015 at 8:39 am
J0shu@ (1/16/2015)
Hello Phil,I am working on the alterations suggested by Nevyn. The folder descriptions are integer ID's, incrementing from 1~2500. Also, each folder has 200 files which are also integer ID's, 1.csv-200.csv.
OK, got that, but where are the descriptions stored, or how are they accessed?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 16, 2015 at 9:02 am
The physical folders are on a local drive of the machine of visual studios. I'm just running this in debug mode as this is a one-time only execution. I don't have to deploy and schedule.
On the DB side I am capturing the entire path to the file:
C:\Project\1\1.csv
C:\Project\1\2.csv
.
.
.
C:\Project\2\1.csv
C:\Project\2\2.csv
I'm capturing the entire path to the file and inserting that into the DB. Then I'm parsing the path via SQL to capture the folder description.
January 16, 2015 at 9:29 am
J0shu@ (1/16/2015)
The physical folders are on a local drive of the machine of visual studios. I'm just running this in debug mode as this is a one-time only execution. I don't have to deploy and schedule.On the DB side I am capturing the entire path to the file:
C:\Project\1\1.csv
C:\Project\1\2.csv
.
.
.
C:\Project\2\1.csv
C:\Project\2\2.csv
I'm capturing the entire path to the file and inserting that into the DB. Then I'm parsing the path via SQL to capture the folder description.
If the description you want is just the number of the folder without the rest of the path, you should be able to get that directly with an expression in the derived column transformation without doing a lookup, and parsing via SQL.
January 16, 2015 at 9:21 pm
Hey Nevyn,
A higher level question with regards to implementation strategy. When implementing a SSIS package, should the priority of process be put on the SSIS side as much as possible vs. the SQL side? Or when do you determine the direct DB should handle this vs. SSIS?
January 17, 2015 at 8:19 am
J0shu@ (1/16/2015)
Hey Nevyn,A higher level question with regards to implementation strategy. When implementing a SSIS package, should the priority of process be put on the SSIS side as much as possible vs. the SQL side? Or when do you determine the direct DB should handle this vs. SSIS?
I'd say the answer is "it depends".
There are a number of factors that could impact the decision. There are some things SSIS is better at, and some things T-SQL would be better at, and some of it will depend on your setup (whether the package is running from the same machine, etc).
In the case of your problem, we suggested parsing the filenames in a derived column transformation mostly for the sake of simplicity. It is a simple formula, and that transformation can apply to rows as they are passed through, so it should not slow things down much. Whereas if you insert the data and try to run SQL on it, it impacts your ability to do things in parallel, makes your DB server do a bit more work, and makes you write data (the full path) to the DB that you won't be using.
January 18, 2015 at 1:32 am
J0shu@ (1/16/2015)
Hey Nevyn,A higher level question with regards to implementation strategy. When implementing a SSIS package, should the priority of process be put on the SSIS side as much as possible vs. the SQL side? Or when do you determine the direct DB should handle this vs. SSIS?
Well, I would say that certain things should generally be avoided in SSIS:
1) Anything that blocks the data pipeline
This includes (in-SSIS) aggregation and sorting.
Better idea: use the database engine for such things.
2) Anything that drastically slows the pipeline
The OLEDBCommand component in the data flow comes immediately to mind, with its RBAR implementation.
Better idea: fire the rows to a staging area and run a set-based process after the data flow has completed.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply