SQL help on selecting records based on content and their neighbors content

  • The query below returns the records in the table shown below.  It returns more than what I need.  (I want just the records where CERTIFICATE is followed by OF which is followed by LIABILITY – those records highlighted in yellow.)

     

    Does anyone know how I can construct a query that will select the three records only when their values in the wSequence field are sequential?

     

    I cannot depend on the values of wSequence starting at 1 (or any other value).

     

    Query:

    select * from wordmetadata where word = 'CERTIFICATE' OR word = 'LIABILITY' or word = 'OF'

     

    Return records

    Page

    wSequence

    dLeft

    dTop

    dRight

    dBottom

    word

    1

    1

    104.250

    22.568

    212.392

    40.998

    CERTIFICATE

    1

    2

    216.798

    22.568

    239.622

    40.998

    OF

    1

    3

    244.028

    22.568

    322.113

    40.998

    LIABILITY

    1

    23

    336.540

    44.284

    388.570

    53.499

    CERTIFICATE

    1

    29

    487.849

    44.284

    499.057

    53.499

    OF

    1

    38

    487.071

    52.534

    538.102

    61.749

    CERTIFICATE

    1

    41

    372.664

    60.784

    423.336

    69.999

    CERTIFICATE

    1

    95

    459.750

    126.648

    507.177

    136.845

    LIABILITY

    1

    101

    84.047

    176.963

    94.597

    185.340

    OF

    1

    124

    191.240

    185.213

    201.790

    193.590

    OF

    1

    135

    478.460

    185.213

    528.959

    193.590

    CERTIFICATE

    1

    159

    587.079

    192.713

    597.623

    201.090

    OF

    1

    177

    101.060

    214.320

    109.366

    221.022

    OF

    1

    180

    87.068

    225.570

    114.860

    232.272

    LIABILITY

    1

    225

    142.419

    236.820

    168.863

    243.522

    LIABILITY

    1

    243

    98.043

    309.570

    126.080

    316.272

    LIABILITY

    1

    312

    83.833

    406.320

    111.167

    413.022

    LIABILITY

     

     

  • How about this?

    drop table test_seq

    create table test_seq

    (wpage smallint,

    wsequence smallint

    , word char(11))

    go

    insert into test_seq

    select 1,1,'certificate'

    union

    select 1,2,'of'

    union

    select 1,3,'liability'

    go

    insert into test_seq

    select 2,5,'certificate'

    union

    select 2,3,'of'

    union

    select 2,1,'liability'

    go

    select * from test_seq

    go

    select a.* from test_seq a

    where a.wpage =

    (select a.wpage

    from test_seq a

    join test_seq b on

    a.wpage = b.wpage

    join test_seq c on

    b.wpage = c.wpage

    where a.word = 'certificate'

    and b.word = 'of'

    and c.word = 'liability'

    and a.wsequence = b.wsequence - 1

    and b.wsequence = c.wsequence - 1

    )

    go



    Michelle

  • If you can't assume that wsequence is going to be x, x+1, x+2 in the case you want, then you might have to use a rownumber field on the table. And then, because you're going to have to refer to it three times, you'll want to stick that into a view. But I think what you're saying is that it is x, x+1, x+2.

    Mind you, I'm not sure why Michelle used a subquery. I would've just used:

    select *

    from wordmetadata t_cert (NOLOCK)

    join wordmetadata t_of (NOLOCK) on t_of.wsequence = t_cert.wsequence + 1

    join wordmetadata t_liab (NOLOCK) on t_liab.wsequence = t_of.wsequence + 1

    where t_cert.word = 'CERTIFICATE'

    and t_of.word = 'OF'

    and t_liab.word = 'LIABILITY'

    Rob Farley
    LobsterPot Solutions & Adelaide SQL Server User Group
    Company: http://www.lobsterpot.com.au
    Blog: http://blogs.lobsterpot.com.au

  • Rob, your join is not considering the page number. you should probably include the page in the join.

    Michele, your final query pulls all the records of the matching page.. but Gary only needs the page with the sequence.

    My solution is :

    Select W.* from #wordmetadata  W

    Join (

    Select w1.page,w1.wsequence as seq1,w2.wsequence as seq2,w3.wsequence as seq3

    From

    #wordmetadata w1 inner join #wordmetadata w2

    on w1.page=w2.page and w2.wsequence<=w1.wsequence+1

    inner join #wordmetadata w3

    on w1.page=w3.page and w3.wsequence<=w1.wsequence+2

    Where

    w3.wsequence-w2.wsequence = 1 and

    w2.wsequence-w1.wsequence = 1 and

    w1.word='certificate' and w2.word='of' and w3.word='liability'

    ) as X

    ON

    W.page = x.page and (w.wsequence = seq1 OR  w.wsequence = seq2 OR w.wsequence = seq3)

     

     


    Kindest Regards,

    Hari

  • Hariharan,

    He doesn't say he wants the page number. Couldn't it be that the words go across two different pages?

    Of course, it depends on the actual context of the query.

    Your "w3.wsequence-w2.wsequence = 1" isn't very efficient. Much better to use "w3.wsequence = 2.wsequence + 1".

    I'm not sure why you'd do: "on w1.page=w2.page and w2.wsequence<=w1.wsequence+1" - why bother with the '<=', when you have '=' in the where clause.

    I reckon just throw the page checking into my query if that's what's required. I'm assuming that wsequence keeps growing across all pages, so that the smallest wsequence on page 2 is maybe 423. That will allow cross-page checking. If it resets to one, then you do need to put the page checking in.

    My query returns the three records as separate fields. Hariharan's returns them as separate records, which would mean you should order the results by page & sequence to make sure that they turn up in order.

    Rob

    Rob Farley
    LobsterPot Solutions & Adelaide SQL Server User Group
    Company: http://www.lobsterpot.com.au
    Blog: http://blogs.lobsterpot.com.au

  • Rob,

    Agreed to your point on the wsequence.. since the reset point is not mentioned in the question, i assumed that it'll be reset for each page. that is why i used "w2.page=w2.page"

    noted the point on efficiency..

    Thanks dude.

    -Hari

     


    Kindest Regards,

    Hari

  • Thank you all for very insightful and helpful input.

    I studied and tried all the SQL statements provided.  The first, by Michelle , generated a 'Subquery generated more than 1 value' error.  The sub query did return values when run on its own and I will look into seeing how it can be made to work.

    Rob's query returned only records where word = 'Certificate', but it does appear to 'weed out' the records I do not want (where word = 'Certificate' but the record after it does not contain word = 'Of').

    Hariharan's query statement generated the results I was looking for.  I was able to append the statement with "where W.page = 1" and it returned exactly what I was looking for.

    I noticed some discussion about the intent of my desired query, and want to just address them. 

    1. "... can't assume that wsequence is going to be x, x+1, x+2 ..."  In the database their are all types of possibilities, but I am looking for that sequence, just that we cannot assume x = 1, as is the case in the example table I provided in the original post.

    2. Issues about the page number - The sequence will always have to be on the same page, and I am going to be specifying the page value in the query.

    Thanks for all your help.  I really appreciate your thoughtful comments.

  • Don't specify the page value in the query, just make sure that it matches from record to record.

    My version only returned one record for each 'triple', but returned three times as many fields. If you scroll to the right, you'll see the others.

    Glad you've got it sorted now.

    Rob

    Rob Farley
    LobsterPot Solutions & Adelaide SQL Server User Group
    Company: http://www.lobsterpot.com.au
    Blog: http://blogs.lobsterpot.com.au

  • I posted this earlier (around 4/25) but somehow it is gone.  Here it is again:

     

    This solution is exactly what I was looking for ... originally.  I have implemented your code and it works, as you said, without using the stored procedure.

     

    I think this discussion is so good, I figured I would test the performance of the SQL Insert verses that of the Stored Procedure (SP).  Below are the results.  Be aware that the table was truncated before each test.  For example with test number 1 the table was truncated before the SQL Insert test and before the SP test.  Also, the SQL Server instance is local to my workstation, so their should be no question about bandwidth and someone outside this experiment taking up the bandwidth.  Really the only processing being performed on my local machine is some .Net code, SQL Server, and Microsoft Word.

     

    However, after implementing the stored procedure version and reading the link posted

     

     

    Test Number

    SQL Insert

    Stored Procedure

    Number of Writes

    1

    01.9677224

    01.8472496

    3156

    2

    01.8874072

    01.8472496

    3156

    3

    01.9476436

    01.8773678

    3156

    4

    31.9179775

    26.4594330

    46065

    5

    27.0767675

    26.8167025

    46065

    6

    27.5468850

    26.3679080

    46065

    7

    06:29.5346552

    06:14.0346944

    615220

     

    Nunit test code is below.  Note that the two tests are basically the same except for the "myWriter.ExecuteNonQuery () method.  Its implementation is overloaded: one takes an sql string and passis it using to a sqlcommand of type "text", the other accepts an argument of a class (structure really) that has all the parameters used to build the stored procedure (see previous post) for a sqlcommand object whose type is "storedProcedure".

     

    [Test] public void storePageWordsSQL ()

    {

           Xpdf.PdfReader myPdfReader;

     

           myPdfReader = new PdfReader

                  (@"_allCcerts.pdf");

           int numPages = myPdfReader.NumPages;

     

           string strSQL;

     

           long numRecsAdded = 0;

           DateTime startTime = DateTime.Now;

           for (int i = 0; i < numPages /*&& i < 100*/; i++)

           {

     

                  WordInfo[] myWordInfo = myPdfReader.createWordAndDataList(i+1);

                  //myWriter.ExecuteNonQuery ("truncate table wordMetadata");

     

                  foreach (WordInfo wInfo in myWordInfo)

                  {

                         strSQL = "insert into wordMetadata (Page, Wsequence, dleft, dTop, dRight, dBottom, word) values " +

                               "(" + wInfo.Page + ", " + wInfo.Sequence + ",  " + wInfo.Left + ", " +

                               wInfo.Top + ", " + wInfo.Right + ", " + wInfo.Bottom + ", '" +

                               wInfo.Word.Replace ("'","''") + "')";

                         //Console.WriteLine (strSQL);

                         myWriter.ExecuteNonQuery (strSQL);

     

                  }

                  numRecsAdded += myWordInfo.Length;

           }

           DateTime timeFinish = DateTime.Now;

           Console.WriteLine ("SQL String - added " + numRecsAdded + " records / time : " + timeFinish.Subtract(startTime));

     

    }

     

    [Test] public void storePageWordsSP ()

    {

           Xpdf.PdfReader myPdfReader;

     

           myPdfReader = new PdfReader

                  (@"_allCcerts.pdf");

           //_allCcerts.pdf

           // smallExtract.pdf

           int numPages = myPdfReader.NumPages;

     

           long numRecsAdded = 0;

           DateTime startTime = DateTime.Now;

           for (int i = 0; i < numPages /*&& i < 100*/; i++)

           {

     

                  //myPdfReader.setCurrentPage (2);

                  WordInfo[] myWordInfo = myPdfReader.createWordAndDataList(i+1);

                  //myWriter.ExecuteNonQuery ("truncate table wordMetadata");

     

                  foreach (WordInfo wInfo in myWordInfo)

                  {

                         myWriter.ExecuteNonQuery (wInfo);

                  }

                  numRecsAdded += myWordInfo.Length;

           }

           DateTime timeFinish = DateTime.Now;

           Console.WriteLine ("Stored Procedure - added " + numRecsAdded + " records / time : " + timeFinish.Subtract(startTime));

     

    }

     

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply