April 26, 2005 at 7:29 pm
The query below returns the records in the table shown below. It returns more than what I need. (I want just the records where CERTIFICATE is followed by OF which is followed by LIABILITY – those records highlighted in yellow.)
Does anyone know how I can construct a query that will select the three records only when their values in the wSequence field are sequential?
I cannot depend on the values of wSequence starting at 1 (or any other value).
Query:
select * from wordmetadata where word = 'CERTIFICATE' OR word = 'LIABILITY' or word = 'OF'
Return records
Page | wSequence | dLeft | dTop | dRight | dBottom | word |
1 | 1 | 104.250 | 22.568 | 212.392 | 40.998 | CERTIFICATE |
1 | 2 | 216.798 | 22.568 | 239.622 | 40.998 | OF |
1 | 3 | 244.028 | 22.568 | 322.113 | 40.998 | LIABILITY |
1 | 23 | 336.540 | 44.284 | 388.570 | 53.499 | CERTIFICATE |
1 | 29 | 487.849 | 44.284 | 499.057 | 53.499 | OF |
1 | 38 | 487.071 | 52.534 | 538.102 | 61.749 | CERTIFICATE |
1 | 41 | 372.664 | 60.784 | 423.336 | 69.999 | CERTIFICATE |
1 | 95 | 459.750 | 126.648 | 507.177 | 136.845 | LIABILITY |
1 | 101 | 84.047 | 176.963 | 94.597 | 185.340 | OF |
1 | 124 | 191.240 | 185.213 | 201.790 | 193.590 | OF |
1 | 135 | 478.460 | 185.213 | 528.959 | 193.590 | CERTIFICATE |
1 | 159 | 587.079 | 192.713 | 597.623 | 201.090 | OF |
1 | 177 | 101.060 | 214.320 | 109.366 | 221.022 | OF |
1 | 180 | 87.068 | 225.570 | 114.860 | 232.272 | LIABILITY |
1 | 225 | 142.419 | 236.820 | 168.863 | 243.522 | LIABILITY |
1 | 243 | 98.043 | 309.570 | 126.080 | 316.272 | LIABILITY |
1 | 312 | 83.833 | 406.320 | 111.167 | 413.022 | LIABILITY |
April 26, 2005 at 8:50 pm
How about this?
drop table test_seq
create table test_seq
(wpage smallint,
wsequence smallint
, word char(11))
go
insert into test_seq
select 1,1,'certificate'
union
select 1,2,'of'
union
select 1,3,'liability'
go
insert into test_seq
select 2,5,'certificate'
union
select 2,3,'of'
union
select 2,1,'liability'
go
select * from test_seq
go
select a.* from test_seq a
where a.wpage =
(select a.wpage
from test_seq a
join test_seq b on
a.wpage = b.wpage
join test_seq c on
b.wpage = c.wpage
where a.word = 'certificate'
and b.word = 'of'
and c.word = 'liability'
and a.wsequence = b.wsequence - 1
and b.wsequence = c.wsequence - 1
)
go
Michelle
April 26, 2005 at 9:43 pm
If you can't assume that wsequence is going to be x, x+1, x+2 in the case you want, then you might have to use a rownumber field on the table. And then, because you're going to have to refer to it three times, you'll want to stick that into a view. But I think what you're saying is that it is x, x+1, x+2.
Mind you, I'm not sure why Michelle used a subquery. I would've just used:
select *
from wordmetadata t_cert (NOLOCK)
join wordmetadata t_of (NOLOCK) on t_of.wsequence = t_cert.wsequence + 1
join wordmetadata t_liab (NOLOCK) on t_liab.wsequence = t_of.wsequence + 1
where t_cert.word = 'CERTIFICATE'
and t_of.word = 'OF'
and t_liab.word = 'LIABILITY'
Rob Farley
LobsterPot Solutions & Adelaide SQL Server User Group
Company: http://www.lobsterpot.com.au
Blog: http://blogs.lobsterpot.com.au
April 26, 2005 at 10:08 pm
Rob, your join is not considering the page number. you should probably include the page in the join.
Michele, your final query pulls all the records of the matching page.. but Gary only needs the page with the sequence.
My solution is :
Select W.* from #wordmetadata W
Join (
Select w1.page,w1.wsequence as seq1,w2.wsequence as seq2,w3.wsequence as seq3
From
#wordmetadata w1 inner join #wordmetadata w2
on w1.page=w2.page and w2.wsequence<=w1.wsequence+1
inner join #wordmetadata w3
on w1.page=w3.page and w3.wsequence<=w1.wsequence+2
Where
w3.wsequence-w2.wsequence = 1 and
w2.wsequence-w1.wsequence = 1 and
w1.word='certificate' and w2.word='of' and w3.word='liability'
) as X
ON
W.page = x.page and (w.wsequence = seq1 OR w.wsequence = seq2 OR w.wsequence = seq3)
April 26, 2005 at 11:00 pm
Hariharan,
He doesn't say he wants the page number. Couldn't it be that the words go across two different pages?
Of course, it depends on the actual context of the query.
Your "w3.wsequence-w2.wsequence = 1" isn't very efficient. Much better to use "w3.wsequence = 2.wsequence + 1".
I'm not sure why you'd do: "on w1.page=w2.page and w2.wsequence<=w1.wsequence+1" - why bother with the '<=', when you have '=' in the where clause.
I reckon just throw the page checking into my query if that's what's required. I'm assuming that wsequence keeps growing across all pages, so that the smallest wsequence on page 2 is maybe 423. That will allow cross-page checking. If it resets to one, then you do need to put the page checking in.
My query returns the three records as separate fields. Hariharan's returns them as separate records, which would mean you should order the results by page & sequence to make sure that they turn up in order.
Rob
Rob Farley
LobsterPot Solutions & Adelaide SQL Server User Group
Company: http://www.lobsterpot.com.au
Blog: http://blogs.lobsterpot.com.au
April 26, 2005 at 11:39 pm
April 27, 2005 at 5:07 am
Thank you all for very insightful and helpful input.
I studied and tried all the SQL statements provided. The first, by Michelle , generated a 'Subquery generated more than 1 value' error. The sub query did return values when run on its own and I will look into seeing how it can be made to work.
Rob's query returned only records where word = 'Certificate', but it does appear to 'weed out' the records I do not want (where word = 'Certificate' but the record after it does not contain word = 'Of').
Hariharan's query statement generated the results I was looking for. I was able to append the statement with "where W.page = 1" and it returned exactly what I was looking for.
I noticed some discussion about the intent of my desired query, and want to just address them.
1. "... can't assume that wsequence is going to be x, x+1, x+2 ..." In the database their are all types of possibilities, but I am looking for that sequence, just that we cannot assume x = 1, as is the case in the example table I provided in the original post.
2. Issues about the page number - The sequence will always have to be on the same page, and I am going to be specifying the page value in the query.
Thanks for all your help. I really appreciate your thoughtful comments.
April 27, 2005 at 7:23 pm
Don't specify the page value in the query, just make sure that it matches from record to record.
My version only returned one record for each 'triple', but returned three times as many fields. If you scroll to the right, you'll see the others.
Glad you've got it sorted now.
Rob
Rob Farley
LobsterPot Solutions & Adelaide SQL Server User Group
Company: http://www.lobsterpot.com.au
Blog: http://blogs.lobsterpot.com.au
May 2, 2005 at 7:01 am
I posted this earlier (around 4/25) but somehow it is gone. Here it is again:
This solution is exactly what I was looking for ... originally. I have implemented your code and it works, as you said, without using the stored procedure.
I think this discussion is so good, I figured I would test the performance of the SQL Insert verses that of the Stored Procedure (SP). Below are the results. Be aware that the table was truncated before each test. For example with test number 1 the table was truncated before the SQL Insert test and before the SP test. Also, the SQL Server instance is local to my workstation, so their should be no question about bandwidth and someone outside this experiment taking up the bandwidth. Really the only processing being performed on my local machine is some .Net code, SQL Server, and Microsoft Word.
However, after implementing the stored procedure version and reading the link posted
Test Number | SQL Insert | Stored Procedure | Number of Writes |
1 | 01.9677224 | 01.8472496 | 3156 |
2 | 01.8874072 | 01.8472496 | 3156 |
3 | 01.9476436 | 01.8773678 | 3156 |
4 | 31.9179775 | 26.4594330 | 46065 |
5 | 27.0767675 | 26.8167025 | 46065 |
6 | 27.5468850 | 26.3679080 | 46065 |
7 | 06:29.5346552 | 06:14.0346944 | 615220 |
Nunit test code is below. Note that the two tests are basically the same except for the "myWriter.ExecuteNonQuery () method. Its implementation is overloaded: one takes an sql string and passis it using to a sqlcommand of type "text", the other accepts an argument of a class (structure really) that has all the parameters used to build the stored procedure (see previous post) for a sqlcommand object whose type is "storedProcedure".
[Test] public void storePageWordsSQL ()
{
Xpdf.PdfReader myPdfReader;
myPdfReader = new PdfReader
(@"_allCcerts.pdf");
int numPages = myPdfReader.NumPages;
string strSQL;
long numRecsAdded = 0;
DateTime startTime = DateTime.Now;
for (int i = 0; i < numPages /*&& i < 100*/; i++)
{
WordInfo[] myWordInfo = myPdfReader.createWordAndDataList(i+1);
//myWriter.ExecuteNonQuery ("truncate table wordMetadata");
foreach (WordInfo wInfo in myWordInfo)
{
strSQL = "insert into wordMetadata (Page, Wsequence, dleft, dTop, dRight, dBottom, word) values " +
"(" + wInfo.Page + ", " + wInfo.Sequence + ", " + wInfo.Left + ", " +
wInfo.Top + ", " + wInfo.Right + ", " + wInfo.Bottom + ", '" +
wInfo.Word.Replace ("'","''") + "')";
//Console.WriteLine (strSQL);
myWriter.ExecuteNonQuery (strSQL);
}
numRecsAdded += myWordInfo.Length;
}
DateTime timeFinish = DateTime.Now;
Console.WriteLine ("SQL String - added " + numRecsAdded + " records / time : " + timeFinish.Subtract(startTime));
}
[Test] public void storePageWordsSP ()
{
Xpdf.PdfReader myPdfReader;
myPdfReader = new PdfReader
(@"_allCcerts.pdf");
//_allCcerts.pdf
// smallExtract.pdf
int numPages = myPdfReader.NumPages;
long numRecsAdded = 0;
DateTime startTime = DateTime.Now;
for (int i = 0; i < numPages /*&& i < 100*/; i++)
{
//myPdfReader.setCurrentPage (2);
WordInfo[] myWordInfo = myPdfReader.createWordAndDataList(i+1);
//myWriter.ExecuteNonQuery ("truncate table wordMetadata");
foreach (WordInfo wInfo in myWordInfo)
{
myWriter.ExecuteNonQuery (wInfo);
}
numRecsAdded += myWordInfo.Length;
}
DateTime timeFinish = DateTime.Now;
Console.WriteLine ("Stored Procedure - added " + numRecsAdded + " records / time : " + timeFinish.Subtract(startTime));
}
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply