May 5, 2010 at 2:35 pm
I have a SQLServer agent job where the first step is to copy an existing file (backing it up), and the second step is to use the WinGNU sed.exe program to edit a file I need to upload. A non-technical customer keeps making mistakes sending us data with null values, so the data comes across as || instead of |""|. sed (UNIX Stream Editor) is supposed to replace || with |""| everywhere in the file.
This is what I've typed into the Operating System CmdExec job step in SQL Server Agent using SQLServer Management studio:
sed -i s/^|^|/^|\"\"^|/g "\\directory\filename.ext"
I am sure there are millions of better ways to do this, and I don't care. I just want to know how I can debug this and see what is output by sed and what the system sees when it runs this command. The History Viewer in SSMS doesn't tell me anything useful. I am using the correct error code checking.
When I open "\\directory\filename.ext" to view changes, || remains |""|. Running the same command in cmd.exe produces the correct results. What is the deal with CmdExec in SQLServer Agent? I *HATE* Blackboxes.
May 5, 2010 at 2:42 pm
Thinking about another way to do this, why not stage the data as is to a staging table in SQL Server and use T-SQL to fix the null values before importing the data into its final location? I'd rather do it that way than try to edit the file before importing it.
May 5, 2010 at 2:52 pm
I'm not sure you understand the problem.
bcp does not have a good way to represent NULL values in format files. If character data that contains a delimiter is escaped using, say, double quotes (what I am doing), then to represent NULL you have no choice but to do "". Otherwise, bcp dies.
I've asked the customer a couple times to get various facets of this extract right, and they are simply a prospect and not a paying customer. All I want to do is load some sample data for them successfully without issues, so they can see how much our tools rock.
bcp is a pretty lame utility, sure, but this is not a problem that can be solved using staging tables (which I am already using). If I were to go another route, I would've gone Windows Task Scheduler, which is what I would've preferred, but unfortunately do not have control over that option.
May 5, 2010 at 2:59 pm
Permissions problem? When you run it under your credentials, it works, but the agent is using the credentials of the account the agent is running under. The first thing I would check would be the ability of the agent account to access the file and run the sed comand.
May 5, 2010 at 3:03 pm
johnzabroski (5/5/2010)
I'm not sure you understand the problem.bcp does not have a good way to represent NULL values in format files. If character data that contains a delimiter is escaped using, say, double quotes (what I am doing), then to represent NULL you have no choice but to do "". Otherwise, bcp dies.
I've asked the customer a couple times to get various facets of this extract right, and they are simply a prospect and not a paying customer. All I want to do is load some sample data for them successfully without issues, so they can see how much our tools rock.
bcp is a pretty lame utility, sure, but this is not a problem that can be solved using staging tables (which I am already using). If I were to go another route, I would've gone Windows Task Scheduler, which is what I would've preferred, but unfortunately do not have control over that option.
Actually, I do understand the problem, I have done hundreds (if not more) imports of data from text files.
It may take a couple of days to put together a test suite, but let's see what I can do. If you could put toether a test suite that mirrors your current problem, that would help. You would need to provide a format file and the CREATE TABLE statement for the final destination of the data.
May 5, 2010 at 3:28 pm
CREATE TABLE [dbo].[zz_u_test_TEMP](
[col1_dirty] [varchar](50) NULL,
[col2_dirty] [varchar](50) NULL,
[col3_dirty] [varchar](500) NULL,
[col4_dirty] [varchar](50) NULL,
[col5_dirty] [varchar](50) NULL,
[col6_dirty] [varchar](50) NULL,
[col7_dirty] [varchar](50) NULL,
[col8_dirty] [varchar](50) NULL,
[col9_dirty] [varchar](50) NULL,
[col10_dirty] [varchar](50) NULL
) ON [PRIMARY]
9.0
16
1 SQLCHAR 0 0 "\"" 0 dummy1 ""
2 SQLCHAR 0 50 "\"" 1 col1_dirty ""
3 SQLCHAR 0 0 "|\"" 0 dummy2 ""
4 SQLCHAR 0 50 "\"" 2 col2_dirty ""
5 SQLCHAR 0 0 "|\"" 0 dummy3 ""
6 SQLCHAR 0 500 "\"" 3 col3_dirty ""
7 SQLCHAR 0 0 "|\"" 0 dummy4 ""
8 SQLCHAR 0 50 "\"|" 4 col4_dirty ""
9 SQLCHAR 0 0 "\"" 0 dummy5 ""
10 SQLCHAR 0 50 "\"|" 5 col5_dirty ""
11 SQLCHAR 0 0 "\"" 0 dummy6 ""
12 SQLCHAR 0 50 "\"|" 6 col6_dirty ""
13 SQLCHAR 0 0 "|" 7 col7_dirty ""
14 SQLCHAR 0 50 "|" 8 col8_dirty ""
15 SQLCHAR 0 50 "|" 9 col9_dirty ""
16 SQLCHAR 0 50 "" 10 col10_dirty ""
The last line should have end of line escape \ n but I don't know how to enter that in this forum's textbox and am not gonna waste time figuring that out.
Sample data:
"1"|"7"|"D"|"2"|"_"||05/25/2007|05/25/2007|10.00|2.15
Should be:
"1"|"7"|"D"|"2"|"_"|""|05/25/2007|05/25/2007|10.00|2.15
Successful agent job message:
Date5/4/2010 6:07:10 PM
LogJob History ((TEST) SED TEST)
Step ID2
Serversql_server3
Job Name(Daily Covenant) Daily Upload
Step Namesed files zz_u_test_TEMP Table x2
Duration00:00:00
Sql Severity0
Sql Message ID0
Operator Emailed
Operator Net sent
Operator Paged
Retries Attempted0
Message
Executed as user: DOMAIN\sql_server3. The step did not generate any output. Process Exit Code 0. The step succeeded.
May 5, 2010 at 3:31 pm
Have you tried running the command out of Managment Studio. I often find I get much better feedback here than when I tests directly from the SQL Agent.
My typical build process for something like this is:
1) Run it from the cmd line, when it works..
2) Run it from MS, when it works...
3) Put it into a Job
Leo
Is it Friday yet?
Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.
May 5, 2010 at 3:33 pm
Yes, of course, I've done all that debugging. The problem is there is no way I know of to use SSMS + Agent such that I can see what the problem is. This is why Windows Task Scheduler would be so much easier, because once I tested it in cmd, I just stuff it in a .bat and am done with it.
May 5, 2010 at 4:41 pm
johnzabroski (5/5/2010)
Yes, of course, I've done all that debugging. The problem is there is no way I know of to use SSMS + Agent such that I can see what the problem is. This is why Windows Task Scheduler would be so much easier, because once I tested it in cmd, I just stuff it in a .bat and am done with it.
Have you tried including the exact path for sed.exe like this:
c:\gnused\sed.exe -i s/^|^|/^|\"\"^|/g "\\directory\filename.ext"
(Replace gnused for the right subdir)
---------------------------------------------------------------------
Use Full Links:
KB Article from Microsoft on how to ask a question on a Forum
May 5, 2010 at 4:49 pm
johnzabroski (5/5/2010)
Sample data:
"1"|"7"|"D"|"2"|"_"||05/25/2007|05/25/2007|10.00|2.15
Should be:
"1"|"7"|"D"|"2"|"_"|""|05/25/2007|05/25/2007|10.00|2.15
First, thanks for the sample data.
In going with Lynn's suggestion, let's first load this stuff into a temp (aka staging) table.
IF OBJECT_ID('tempdb..#test') IS NOT NULL DROP TABLE #test
CREATE TABLE #test(RowID int IDENTITY, testdata varchar(1000))
insert into #test
select '"1"|"7"|"D"|"2"|"_"||05/25/2007|05/25/2007|10.00|2.15'
Instead of inserting the data this way, just use the BULK INSERT command to load the file. Do NOT use the format file!
Next, you're going to need the DelimitedSplit function. I recommend that you create a permanent TALLY table, and change this function to reference that directly. See the link in the code below for how to do this. However, this function makes a DelimitedSplit function with a virtual tally table... you can use this also, but it will usually be better to have one already set up in a permanent table.
IF OBJECT_ID('dbo.DelimitedSplit') IS NOT NULL DROP FUNCTION dbo.DelimitedSplit
GO
CREATE FUNCTION [dbo].[DelimitedSplit] (
@list varchar(max),
@Delimiter char(1)
)
RETURNS TABLE
AS
RETURN
-- first, need to break down into separate items.
-- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.
-- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.
WITH -- for a permanent tally table, remove the Tens - Tally lines below
Tens (N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ),
Thousands (N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),
Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),
Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions),
ItemSplit (ItemOrder, Item) AS (
SELECT N,
RTRIM(LTRIM(SUBSTRING(@Delimiter + @list + @Delimiter,N+1,
CHARINDEX(@Delimiter,@Delimiter + @list + @Delimiter,N+1)-N-1)))
FROM Tally -- for a permanent tally table, put the schema (aka dbo.) in here.
WHERE N < LEN(@Delimiter + @list + @Delimiter)
AND SUBSTRING(@Delimiter + @list + @Delimiter,N,1) = @Delimiter
)
Now, let's break that string down into its parts, and then put it back together in the columns that you need.
;WITH CTE AS (
select t1.*, s.*
from #test t1
CROSS
APPLY dbo.DelimitedSplit(REPLACE(t1.testdata, '"', ''), '|') s
-- the replace function removes all of the double-quotes in the line.
)
-- the following command will show you what it looks like broken down
-- select * from CTE
-- now, let's show the columns from the delimited string.
-- You will probably want to convert them to the proper data types here,
-- as shown below with the date fields.
select RowID,
Col1 = MAX(CASE WHEN ItemID = 1 THEN NULLIF(Item,'') ELSE NULL END),
Col2 = MAX(CASE WHEN ItemID = 2 THEN NULLIF(Item,'') ELSE NULL END),
Col3 = MAX(CASE WHEN ItemID = 3 THEN NULLIF(Item,'') ELSE NULL END),
Col4 = MAX(CASE WHEN ItemID = 4 THEN NULLIF(Item,'') ELSE NULL END),
Col5 = MAX(CASE WHEN ItemID = 5 THEN NULLIF(Item,'') ELSE NULL END),
Col6 = MAX(CASE WHEN ItemID = 6 THEN NULLIF(Item,'') ELSE NULL END),
Col7 = MAX(CASE WHEN ItemID = 7 THEN CONVERT(DATETIME, NULLIF(Item,'')) ELSE NULL END),
Col8 = MAX(CASE WHEN ItemID = 8 THEN CONVERT(DATETIME, NULLIF(Item,'')) ELSE NULL END),
Col9 = MAX(CASE WHEN ItemID = 9 THEN NULLIF(Item,'') ELSE NULL END),
Col10 = MAX(CASE WHEN ItemID = 10 THEN NULLIF(Item,'') ELSE NULL END)
FROM CTE
GROUP BY RowID
ORDER BY RowID
And, here you go. One nice, simple routine for loading it into a staging table, breaking it apart, and showing things row by row.
Edit: spelling changes...
Edit2: corrected name misspelling
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 5, 2010 at 6:02 pm
johnzabroski (5/5/2010)
I am sure there are millions of better ways to do this, and I don't care. I just want to know how I can debug this and see what is output by sed and what the system sees when it runs this command.
Well the simple answer is that you can't. Jobs may behave differently when running under the SQL Server Agent process account (the default) and the debugging facilities are poor. That said, it is what it is, so the best advice is probably just to get over your dislike of Black Boxes, and set the job up correctly 🙂
You may find it necessary to set up a proxy account so that the job runs in an appropriate permissions context. There are some details here: How to: Create a CmdExec Job Step (SQL Server Management Studio) and here: Creating SQL Server Agent Proxies David Webb mentioned all this a while back, but hopefully those links will help you progress.
Don't be too hard on bcp - it's pretty flexible (given properly-formatted input) but it isn't magic! Even SSIS can't handle embedded delimiters properly without resorting to custom components or .NET scripting. The NULL problem only arises because of the work-around you are trying to employ to bypass another work-around: the double quotes.
Paul
May 5, 2010 at 8:49 pm
I've had a much better idea. Create an XML format file like so:
<?xml version="1.0" ?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="01" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="02" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="03" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="04" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="05" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="06" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="07" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="08" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="09" xsi:type="CharTerm" TERMINATOR="|" MAX_LENGTH="50"/>
<FIELD ID="10" xsi:type="CharTerm" TERMINATOR="\r" MAX_LENGTH="50"/>
</RECORD>
<ROW>
<COLUMN SOURCE="01" NAME="col1_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="02" NAME="col2_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="03" NAME="col3_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="04" NAME="col4_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="05" NAME="col5_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="06" NAME="col6_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="07" NAME="col7_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="08" NAME="col8_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="09" NAME="col9_dirty" xsi:type="SQLVARYCHAR" />
<COLUMN SOURCE="10" NAME="col10_dirty" xsi:type="SQLVARYCHAR" />
</ROW>
</BCPFORMAT>
And use the BULK OPENROWSET provider to allow you to pre-process the raw data before inserting into the destination table:
INSERT dbo.zz_u_test_TEMP WITH (TABLOCK)
(
col1_dirty, col2_dirty, col3_dirty, col4_dirty, col5_dirty,
col6_dirty, col7_dirty, col8_dirty, col9_dirty, col10_dirty
)
SELECT REPLACE(RowSource.col1_dirty, CHAR(34), SPACE(0)),
REPLACE(RowSource.col2_dirty, CHAR(34), SPACE(0)),
REPLACE(RowSource.col3_dirty, CHAR(34), SPACE(0)),
REPLACE(RowSource.col4_dirty, CHAR(34), SPACE(0)),
REPLACE(RowSource.col5_dirty, CHAR(34), SPACE(0)),
REPLACE(RowSource.col6_dirty, CHAR(34), SPACE(0)),
RowSource.col7_dirty,
RowSource.col8_dirty,
RowSource.col9_dirty,
RowSource.col10_dirty
FROM OPENROWSET
(
BULK 'C:\Documents and Settings\Paul\Test.data',
FORMATFILE = 'C:\Documents and Settings\Paul\format.xml',
CODEPAGE = 'RAW',
FIRSTROW = 0,
LASTROW = 0,
MAXERRORS = 0,
ROWS_PER_BATCH = 0
) RowSource;
This preserves all the benefits of minimally-logged fast bulk load, while giving you the opportunity to apply transformations in the SELECT clause. Very cool.
The data loaded using the simple example above is:
col1_dirty col2_dirty col3_dirty col4_dirty col5_dirty col6_dirty col7_dirty col8_dirty col9_dirty col10_dirty
1 7 D 2 _ NULL 05/25/2007 05/25/2007 10.00 2.15
Paul
May 8, 2010 at 6:38 pm
johnzabroski (5/5/2010)
I *HATE* Blackboxes.
Heh... I'm right there with you on that one. I've been known to impliment code just to say away from them.
I haven't checked it but it looks like Wayne took an approach similar to what I would do to avoid the BB. Of course, a couple of folks might also suggest writing a CLR to do the cleaning. Well, at least it would be your own black box that way. 🙂
The only thing I hate worse than black boxes is users who can't provide the right kind of data even though it's important to them.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 10, 2010 at 7:46 am
Paul White NZ,
Your solution doesn't work on edge cases where data elements themselves may contain pipes.
As for using CLR, we use CLR heavily here in many exotic ways, especially for small-size clients (<50GB databases). Most articles on SQLServerCentral target performance reasons for using the CLR, which misses the point entirely: it should also make you more productive.
I've thought of using Parsing Expression Grammars (PEG) to automatically build a bcp format file; whether it is XML or the older format file plaintext-format doesn't matter, since the constraints on the solution are the same. XML is merely simpler to read but takes slightly longer for BCP to parse. XML is also not "human type-able". With a PEG, I would also be able to input the PEG to SQL CLR and verify the PEG against the database's dictionaries. This is the cleanest solution, and nobody does it because nobody understands grammar theory. XML is fundamentally for data interchange and not for DSLs.
May 10, 2010 at 8:50 am
johnzabroski,
"Your solution doesn't work on edge cases where data elements themselves may contain pipes."
No, of course it doesn't, but that's not the point. :doze:
The point is to give you away to perform arbitrary processing as part of a bulk load, without losing the benefits of minimal logging.
If you really can't ensure that the source data excludes delimiters, you could still bulk load a whole row at a time, and break it apart using custom logic (maybe in an in-line table-valued function) and still perform optimised bulk load.
The code I provided works for the sample data you provided, and illustrates the basic technique nicely. I do concede that making it work efficiently for your own data sets might require some effort on your part.
I think it is regrettable that you have not found it necessary to thank anyone for the time they have put in to helping you, but that's your choice I guess. I will not be participating further here.
Paul
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply