Why don't these two queries return the same number of records?

  • I've been tasked with making some modifications to an application written some years ago by someone whose no longer with us. I think this application started its life as a Microsoft Access application (the documentation I've found on it suggests that), but someone along the way upgraded the database to SQL Server. It's currently in a SQL Server 2008 R2 instance. (FWIW, the front end is an ASP.NET Web Forms app, running in our Intranet.)

    Anyway, the users have made some requests for changes before they start using it in September. I've been looking into what it will take to make what they want, work. I believe I've found out why they can't do one of the things they want to do.

    First a bit of background. This app and database is to track employees work activities for a period of time. This is to satisfy a Federal requirement.

    In order to test my theory as to why I won't be able to satisfy one of the users' requested changes I decided to do a count of the number of records in the timetrack table, which is the table that stores the activities done. And then compare that to a straight forward listing of all of the activities in timetrack joined to the user name, named PHD_USER. Here's the DDL for PHD_USER:

    CREATE TABLE [dbo].[PHD_USER](

    [phd_user_id] [varchar](15) NOT NULL,

    [phd_user_name] [varchar](40) NULL,

    [staff_id] [int] NULL,

    [phd_user_status] [varchar](40) NULL,

    [phd_user_password] [nvarchar](128) NULL,

    [phd_user_telephone_nr] [char](10) NULL,

    [phd_user_fax_nr] [char](10) NULL,

    [create_userid] [varchar](15) NULL,

    [create_datetime] [datetime] NULL,

    [modify_userid] [varchar](15) NULL,

    [modify_datetime] [datetime] NULL,

    [password_last_changed_dt] [datetime] NULL,

    [phd_win_userid] [varchar](25) NULL,

    [phd_employee_id] [char](10) NULL

    ) ON [PRIMARY]

    And here's the definition for the timetrack table:

    CREATE TABLE [dbo].[timetrack](

    [id] [int] NOT NULL,

    [userid] [char](15) NULL,

    [taskid] [int] NULL,

    [dt] [datetime] NULL,

    [hour] [decimal](10, 2) NULL,

    [jobid] [int] NULL,

    [clinicid] [int] NULL,

    [notes] [varchar](100) NULL

    ) ON [PRIMARY]

    Um, this database has some issues. You probably noticed its missing the primary key definitions, relationships between the tables, etc. And you'd be correct. What I listed is precisely and only what's available for those tables. In fact that's the way it is with all of the tables in this database. None of the tables have any primary key defined. None of the tables are related to any other table. They're just there.

    OK, I've got to work with what I've got. I've gotten naked with the data and I think I know how these tables are supposed to be linked together. I believe that the userid column of timetrack is supposed to be related to the phd_user_id column of the PHD_USER table. I wanted to know if maybe we might be missing some users in the PHD_USER table; like maybe there'd be a record in the timetrack table that didn't match anyone in the PHD_USER table. If I were very lucky, then there'd be exactly the same number of records returned when I listed everything in timetrack joined to PHD_USER, or there might be fewer records someone blew away some user but left records in timetrack. So I ran the next two queries:

    select count(*) 'Total Records in timetrack' from

    timetrack

    select p.phd_user_name, t.* from

    timetrack t

    inner join PHD_USER p

    on t.userid = p.phd_user_id

    Very much to my surprise the first query resulted in a count of 10432 records. However the second query returned 10433 records?! How is that possible? Could the discrepancy be explained by the fact that none of the tables have primary keys, no foreign keys, no constraints?

    Kindest Regards, Rod Connect with me on LinkedIn.

  • Unless I'm missing something there's no reason to believe that that the two queries that you posted should return the same number of records.

    I wanted to know if maybe we might be missing some users in the PHD_USER table; like maybe there'd be a record in the timetrack table that didn't match anyone in the PHD_USER table.

    To check for userid's that exist in timetrack that don't have a matching phd_user_id in PHD_USER you could run this:

    SELECT userid

    FROM timetrack

    EXCEPT

    SELECT phd_user_id

    FROM PHD_USER

    To do the opposite you could run this:

    SELECT phd_user_id

    FROM PHD_USER

    EXCEPT

    SELECT userid

    FROM timetrack

    From the looks of it (just an educated guess based on the table names and the DDL that you posted) it appears that phd_user_id is a DISTINCT value and userid is not. dbo.PHD_USER appears to be a table with information about users and dbo.timetrack is a table that tracks information about those users.

    I know you inherited this but, if you have control over the table design, one low-hanging fruit performance/design improvement would be to change those NULLable fields to NOT NULL (provided that they are not storing any NULL values).

    Edit: spelling goofs

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Since timetrack is returning fewer rows , my guess its the other table tha has a duplicate userid , you really should add a PK or atleast a Unq constraint if the joins are expected the return the same result. in this case you need to delete the duplicate record before you can implement the PK on the user table

    Jayanth Kurup[/url]

  • Rod at work (7/21/2015)


    Very much to my surprise the first query resulted in a count of 10432 records. However the second query returned 10433 records?! How is that possible? Could the discrepancy be explained by the fact that none of the tables have primary keys, no foreign keys, no constraints?

    There's a duplicate row in PHD_USER table, two rows with the same phd_user_id.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jayanth and Gail,

    you're probably right, but how do I find a duplicate record when there's nothing to distinguish one from the other? I ask because when I'm sure that the act of trying to either add a primary key or unique constraint will result in an error message telling me there's a duplicate, but not telling me where the duplicate is.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • To find duplicates use this code.

    SELECT phd_user_id

    FROM PHD_USER

    GROUP BY phd_user_id

    HAVING COUNT(*) > 1

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks Luis.

    Kindest Regards, Rod Connect with me on LinkedIn.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply