Prevent duplicates in a table

Question

Prevent duplicates in a table

Viewing 15 posts - 16 through 30 (of 46 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 1

andycadley - Monday, August 20, 2018 3:43 PM
RonKyle - Monday, August 20, 2018 2:21 PM
The separation still sucks. If someone wants uniqueness by month, create a persisted computed column that resolves to the number of months since Jan 1900 and use that as a part of the key. And there's no knee jerking about this. I've seen people get into trouble even with reporting tables by separating month and year and other date parts
I'm not a fan of artificially generated keys if there's a business key that will work and not be "too many" columns. Don't ask me for an exact definition of that, but I usually draw the line at four if it can be a parent. Year 2018 Month 3 is more human readable that 148 or whatever month that would translate as in an OLTP environment. I do it that way in an OLAP environment, but that is a different matter.
I'd tend to agree, but I'd still follow Jeff's basic design. Have one date column with the combined day/month/year and then persisted computed columns (with the necessary unique constraint) to break out the year and month part.

You only need one column to do both breakouts while still lending itself to supporting the uniqueness of a row in a persisted manner. And it is, in fact, a natural key because, like the separate month and year columns, it's a representation of a month . There's just no way I would persist separate columns for month and year. It causes too many reporting problems, especially when the boss coughs up a special request that traverses the annual boundary and people start digging themselves into a hole with crazy ways to do what the boss wants done. It's so much easier if you leave the display stuff out until it's time to display. The data layer and display layer are kept separate everywhere else... I don't understand why people confuse the two in databases. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 2

drew.allen - Monday, August 20, 2018 2:32 PM
Jeff Moden - Monday, August 20, 2018 2:13 PM
RonKyle - Monday, August 20, 2018 9:37 AM
Jeff Moden - Monday, August 20, 2018 9:30 AM
Senchi - Monday, August 20, 2018 8:53 AM
I am entering 'salaries' information into a sql server table called 'salaries'.
Fields are :
id (pk,identity),c_id (int),month,year,salary.
I use a stored procedure to insert this like :
insert into salaries (c_id,month,year,salary) select c_id,month,year,salary from TEMP
Now I am trying to avoid duplicates from being inserted.
There can be no duplicates for c_id,month,year.
All 3 conditions at the same time.
c_id can be inserted as many times as you like but not with the same month and year data.
Month and year data are also not unique as I am inserting data for all company for that month.
How can I do this ? I used to know it but I forgot. I dont remember if I used IF EXISTS or if I
used some index. please help.
You'll actually learn to hate the fact that you have separate month and year columns someday in the not-so-distant future for a whole lot of reasons.
As for a PK, the month and year thing will make that difficult as well because there is no guarantee that people won't be paid more than once per month. And what the heck does "c_ID" mean if you take it out of context? If you must have an ID column, use the form of TableName_ID or TableNameID.
I won't get into what I think of pluralized tables names. 😉
He said there can be no duplicates for id, year, month. This may not have anything to do with actual pay but what they should be paid in a month. While I'm not keen on the separation of dates, I wouldn't as a knee jerk reaction see it as wrong. Otherwise multiple dates could be entered for a particular month, which again is apparently undesirable. The separation is probably best and the dates are easily constructed when necessary with the DATEFROMPARTS function.
While I agree with you on the table plurals, that's an internal preference.
The separation still sucks. 😉 If someone wants uniqueness by month, create a persisted computed column that resolves to the number of months since Jan 1900 and use that as a part of the key. And there's no knee jerking about this. I've seen people get into trouble even with reporting tables by separating month and year and other date parts.
I use the first (or last) day of the month. This has the advantage that it can easily be formatted to show only the year and month.
Drew

Absolutely agreed... I typically use the first of the month but for some reason, a lot of people don't get that because they're looking for a display value and think they're looking at a display value. When that level of confusion prevails, I take it all away from them to get them to think a different way. It's funny how they'll believe that some number is a representation for a month and year but won't believe that a date is also a representation for a month and a year (and, as you already know, a whole lot more).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 3

Senchi - Monday, August 20, 2018 3:18 PM
Though I appreciate your opinions can somebody actually tell me on how to accomplish this ?

Yes. The table design is bad and it will kill you in the future. Change it now while you can.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

andycadley SSCertifiable Points: 5296 More actions · Answer 4

Jeff Moden - Monday, August 20, 2018 8:03 PM
andycadley - Monday, August 20, 2018 3:43 PM
RonKyle - Monday, August 20, 2018 2:21 PM
The separation still sucks. If someone wants uniqueness by month, create a persisted computed column that resolves to the number of months since Jan 1900 and use that as a part of the key. And there's no knee jerking about this. I've seen people get into trouble even with reporting tables by separating month and year and other date parts
I'm not a fan of artificially generated keys if there's a business key that will work and not be "too many" columns. Don't ask me for an exact definition of that, but I usually draw the line at four if it can be a parent. Year 2018 Month 3 is more human readable that 148 or whatever month that would translate as in an OLTP environment. I do it that way in an OLAP environment, but that is a different matter.
I'd tend to agree, but I'd still follow Jeff's basic design. Have one date column with the combined day/month/year and then persisted computed columns (with the necessary unique constraint) to break out the year and month part.
You only need one column to do both breakouts while still lending itself to supporting the uniqueness of a row in a persisted manner. And it is, in fact, a natural key because, like the separate month and year columns, it's a representation of a month . There's just no way I would persist separate columns for month and year. It causes too many reporting problems, especially when the boss coughs up a special request that traverses the annual boundary and people start digging themselves into a hole with crazy ways to do what the boss wants done. It's so much easier if you leave the display stuff out until it's time to display. The data layer and display layer are kept separate everywhere else... I don't understand why people confuse the two in databases. 😉

Well technically you just need a Date column with a check constraint that enforces it to be first of the month and a unique constraint to guarantee there is only one per month. I still think there are occasions where breaking out the month/year as separate columns has advantages (I'm thinking in terms of GROUP BY Year etc) that a single sentinel value of "days since 0" doesn't really help with. Obviously you don't want to try and report on a time slice based on those separated columns, that way lies madness...

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 5

RonKyle - Monday, August 20, 2018 2:21 PM
The separation still sucks. If someone wants uniqueness by month, create a persisted computed column that resolves to the number of months since Jan 1900 and use that as a part of the key. And there's no knee jerking about this. I've seen people get into trouble even with reporting tables by separating month and year and other date parts
I'm not a fan of artificially generated keys if there's a business key that will work and not be "too many" columns. Don't ask me for an exact definition of that, but I usually draw the line at four if it can be a parent. Year 2018 Month 3 is more human readable that 148 or whatever month that would translate as in an OLTP environment. I do it that way in an OLAP environment, but that is a different matter.

Might want to reevaluate those plans with the oncoming storm of GDPR/CPP/SHIELD laws. Natural keys become extremely problematic for a lot of reasons.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

RonKyle SSC-Dedicated Points: 31557 More actions · Answer 6

Might want to reevaluate those plans with the oncoming storm of GDPR/CPP/SHIELD laws. Natural keys become extremely problematic for a lot of reasons.

Couldn't you have stated one or two of them? As it is, I have no way to evaluate your statement.

RonKyle SSC-Dedicated Points: 31557 More actions · Answer 7

Yes. The table design is bad and it will kill you in the future.

I don't agree. If you need to guarantee that there is no more than one entry per month, this is the cleanest way to do it.

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 8

RonKyle - Tuesday, August 21, 2018 7:08 AM
Might want to reevaluate those plans with the oncoming storm of GDPR/CPP/SHIELD laws. Natural keys become extremely problematic for a lot of reasons.
Couldn't you have stated one or two of them? As it is, I have no way to evaluate your statement.

Sure. The easiest is the right to be forgotten (which is in all the laws). So, with a natural key, you have a ton of labor to try to logically delete records. You might even have to do physical deletes. Also, with natural keys, ensuring that backups are clean after a restore means you have to keep the identifying values in order to clean them off the restore, but... you can't keep the identifying information, so that becomes pretty much impossible.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

RonKyle SSC-Dedicated Points: 31557 More actions · Answer 9

Grant Fritchey - Tuesday, August 21, 2018 7:25 AM
RonKyle - Tuesday, August 21, 2018 7:08 AM
Might want to reevaluate those plans with the oncoming storm of GDPR/CPP/SHIELD laws. Natural keys become extremely problematic for a lot of reasons.
Couldn't you have stated one or two of them? As it is, I have no way to evaluate your statement.
Sure. The easiest is the right to be forgotten (which is in all the laws). So, with a natural key, you have a ton of labor to try to logically delete records. You might even have to do physical deletes. Also, with natural keys, ensuring that backups are clean after a restore means you have to keep the identifying values in order to clean them off the restore, but... you can't keep the identifying information, so that becomes pretty much impossible.

I don't understand how this applies using a natural key of employee, year, month. First of all, an employee does not have the right to be forgotten. The table seems geared to storing what his salary pay should be starting that month. What does that have to do with GDPR? Even assuming they had that right, the employee name would just become disassociated with the id. There were several example articles along exactly these lines recently.

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 10

RonKyle - Tuesday, August 21, 2018 7:31 AM
Grant Fritchey - Tuesday, August 21, 2018 7:25 AM
RonKyle - Tuesday, August 21, 2018 7:08 AM
Might want to reevaluate those plans with the oncoming storm of GDPR/CPP/SHIELD laws. Natural keys become extremely problematic for a lot of reasons.
Couldn't you have stated one or two of them? As it is, I have no way to evaluate your statement.
Sure. The easiest is the right to be forgotten (which is in all the laws). So, with a natural key, you have a ton of labor to try to logically delete records. You might even have to do physical deletes. Also, with natural keys, ensuring that backups are clean after a restore means you have to keep the identifying values in order to clean them off the restore, but... you can't keep the identifying information, so that becomes pretty much impossible.
I don't understand how this applies using a natural key of employee, year, month. First of all, an employee does not have the right to be forgotten. The table seems geared to storing what his salary pay should be starting that month. What does that have to do with GDPR? Even assuming they had that right, the employee name would just become disassociated with the id. There were several example articles along exactly these lines recently.

Completely depends on what the data is. I don't know in this case. We have the bare bones definition of a table. However, an employee id is an artificial key, so not something we're discussing here anyway. You said, "I'm not a fan of artificially generated keys". That's what I was referring to. That's what I was suggesting might not be the best choice in the future because of new privacy and protection laws.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

RonKyle SSC-Dedicated Points: 31557 More actions · Answer 11

Grant Fritchey - Tuesday, August 21, 2018 7:57 AM
RonKyle - Tuesday, August 21, 2018 7:31 AM
Grant Fritchey - Tuesday, August 21, 2018 7:25 AM
RonKyle - Tuesday, August 21, 2018 7:08 AM
Might want to reevaluate those plans with the oncoming storm of GDPR/CPP/SHIELD laws. Natural keys become extremely problematic for a lot of reasons.
Couldn't you have stated one or two of them? As it is, I have no way to evaluate your statement.
Sure. The easiest is the right to be forgotten (which is in all the laws). So, with a natural key, you have a ton of labor to try to logically delete records. You might even have to do physical deletes. Also, with natural keys, ensuring that backups are clean after a restore means you have to keep the identifying values in order to clean them off the restore, but... you can't keep the identifying information, so that becomes pretty much impossible.
I don't understand how this applies using a natural key of employee, year, month. First of all, an employee does not have the right to be forgotten. The table seems geared to storing what his salary pay should be starting that month. What does that have to do with GDPR? Even assuming they had that right, the employee name would just become disassociated with the id. There were several example articles along exactly these lines recently.
Completely depends on what the data is. I don't know in this case. We have the bare bones definition of a table. However, an employee id is an artificial key, so not something we're discussing here anyway. You said, "I'm not a fan of artificially generated keys". That's what I was referring to. That's what I was suggesting might not be the best choice in the future because of new privacy and protection laws.

Our difference may simply be the definition of "artificial key." I consider an artificial key to be an autogenerated number that has no business meaning and is generally invisible to the users. If someone's employee id is 80091, then even though they came in after employee 80090, it's computer generated, but it still has a business meaning. I deal with many tables that have a computer generated id that's the key but also have this business key.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 12

RonKyle - Tuesday, August 21, 2018 7:11 AM
Yes. The table design is bad and it will kill you in the future.
I don't agree. If you need to guarantee that there is no more than one entry per month, this is the cleanest way to do it.

But it's not the cleanest way to do it for anything else. You just have to know there will be multiple reporting requirements for this table and that, because of the total lack of definition of other possible requirements, that the OP and the people he works for are missing out on other possibilities. I'll also state that any table that has separate columns for Months and Years that require enforcement of such uniqueness within that table is improperly designed. There's a much better way to do such a thing that will still allow it to be flexible for future reporting and usage requirements without having to worry about wrap-arounds for queries that span an annual boundary.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jcelko212 32090 SSCrazy Eights Points: 9303 More actions · Answer 13

Senchi - Monday, August 20, 2018 8:53 AM
I am entering 'salaries' information into a sql server table called 'salaries'.
Fields are :
id (pk,identity),c_id (int),month,year,salary.
I use a stored procedure to insert this like :
insert into salaries (c_id,month,year,salary) select c_id,month,year,salary from TEMP
Now I am trying to avoid duplicates from being inserted.
There can be no duplicates for c_id,month,year.
All 3 conditions at the same time.
c_id can be inserted as many times as you like but not with the same month and year data.
Month and year data are also not unique as I am inserting data for all company for that month.
How can I do this ? I used to know it but I forgot. I dont remember if I used IF EXISTS or if I
used some index. please help.

Why did you fail to post DDL? Why do you think that the non-relational IDENTITY table property (it is not a column!) could ever be a key? Why don't you know the difference between a field and a column? Since an identifier can never be a numeric you got that wrong too.
Here is a skeleton for an auxiliary table of reporting periods.

CREATE TABLE Report_Periods (report_name CHAR(10) NOT NULL PRIMARY KEY, report_start_date DATE NOT NULL, report_end_date DATE NOT NULL, CONSTRAINT date_ordering CHECK (report_start_date <= report_end_date), ordinal_period INTEGER NOT NULL UNIQUE CHECK(ordinal_period > 0) etc);

These report periods can overlap; a fiscal quarter will be contained in the range of its fiscal year. I like the MySQL convention of using double zeroes for months and years, That is 'yyyyyyyy-mm-00' for a month within a year and 'yyyyyyyy-00-00' for the whole year. The advantage is that it will sort with the ISO-8601 data format required by Standard SQL.

CREATE TABLE Salaries (emp_nbr CHAR(10) NOT NULL, salary_month CHAR(10) NOT NULL REFERENCES Report_Periods, PRIMARY KEY (emp_nbr, salary_month), salary_amt DECIMAL(8,2) NOT NULL CHECK(salary_amt >= 0.00) );

>> Now I am trying to avoid duplicates from being inserted. <<

The primary key constraint will do that. It also meets the requirement that a table must have a key (this is why we wanted you to post DDL).

Please post DDL and follow ANSI/ISO standards when asking for help.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 14

jcelko212 32090 - Tuesday, August 21, 2018 11:00 AM
Senchi - Monday, August 20, 2018 8:53 AM
I am entering 'salaries' information into a sql server table called 'salaries'.
Fields are :
id (pk,identity),c_id (int),month,year,salary.
I use a stored procedure to insert this like :
insert into salaries (c_id,month,year,salary) select c_id,month,year,salary from TEMP
Now I am trying to avoid duplicates from being inserted.
There can be no duplicates for c_id,month,year.
All 3 conditions at the same time.
c_id can be inserted as many times as you like but not with the same month and year data.
Month and year data are also not unique as I am inserting data for all company for that month.
How can I do this ? I used to know it but I forgot. I dont remember if I used IF EXISTS or if I
used some index. please help.
Why did you fail to post DDL? Why do you think that the non-relational IDENTITY table property (it is not a column!) could ever be a key? Why don't you know the difference between a field and a column? Since an identifier can never be a numeric you got that wrong too.
Here is a skeleton for an auxiliary table of reporting periods.
CREATE TABLE Report_Periods (report_name CHAR(10) NOT NULL PRIMARY KEY, report_start_date DATE NOT NULL, report_end_date DATE NOT NULL, CONSTRAINT date_ordering CHECK (report_start_date <= report_end_date), ordinal_period INTEGER NOT NULL UNIQUE CHECK(ordinal_period > 0) etc);
These report periods can overlap; a fiscal quarter will be contained in the range of its fiscal year. I like the MySQL convention of using double zeroes for months and years, That is 'yyyyyyyy-mm-00' for a month within a year and 'yyyyyyyy-00-00' for the whole year. The advantage is that it will sort with the ISO-8601 data format required by Standard SQL.
CREATE TABLE Salaries (emp_nbr CHAR(10) NOT NULL, salary_month CHAR(10) NOT NULL REFERENCES Report_Periods, PRIMARY KEY (emp_nbr, salary_month), salary_amt DECIMAL(8,2) NOT NULL CHECK(salary_amt >= 0.00) );
>> Now I am trying to avoid duplicates from being inserted. <<
The primary key constraint will do that. It also meets the requirement that a table must have a key (this is why we wanted you to post DDL).
(function(){(function(e){if(e){var t=e.cloneNode;e.cloneNode=function(n){var o=t.call(e,n);if(e.classList.contains("mceContentBody"))o.innerHTML=e.innerHTML,r(o);else try{i(o)}catch(e){}return o},i(e)}function n(e){if(e.parentNode)if(e.childNodes.length>1){for(var t=document.createDocumentFragment();e.childNodes.length>0;)t.appendChild(e.childNodes[0]);e.parentNode.replaceChild(t,e)}else e.firstChild?e.parentNode.replaceChild(e.firstChild,e):e.parentNode.removeChild(e)}function r(e){if(e)try{for(var t=e.querySelectorAll(".gr_"),r=t.length,i=0;i<r;i++)n(t)}catch(e){}}function i(e){try{Object.defineProperty(e,"innerHTML",{get:function(){try{var t=e.ownerdocument.createRange();t.selectNodeContents(e);var n=t.cloneContents(),i=document.createElement("div");return i.appendChild(n),r(i),i.innerHTML}catch(e){return""}},set:function(t){try{var n=e.ownerdocument.createRange();n.selectNodeContents(e),n.deleteContents();var r=n.createContextualFragment(t);e.appendChild(r)}catch(e){}}})}catch(e){}}})(document.querySelector("[data-gramm_id='a101f096-9f17-e24f-cda4-a65b9f608aff']")) })()[/code]These report periods can overlap; a fiscal quarter will be contained in the range of its fiscal year. I like the MySQL convention of using double zeroes for months and years, That is '-mm-00' for a month within a year and '-00-00' for the whole year. The advantage is that it will sort with the ISO-8601 data format required by Standard SQL.CREATE TABLE Salaries(emp_nbr CHAR(10) NOT NULL,salary_month CHAR(10) NOT NULL REFERENCES Report_Periods,PRIMARY KEY (emp_nbr, salary_month),salary_amt DECIMAL(8,2) NOT NULL CHECK(salary_amt >= 0.00));>> Now I am trying to avoid duplicates from being inserted. <<The primary key constraint will do that. It also meets the requirement that a table must have a key (this is why we wanted you to post DDL).

Mr. Celko, ENOUGH ABOUT MySQL! If you love that product so much go find a forum for it and live there instead of here.

jcelko212 32090 SSCrazy Eights Points: 9303 More actions · Answer 15

Lynn Pettis - Tuesday, August 21, 2018 11:06 AM

jcelko212 32090 - Tuesday, August 21, 2018 11:00 AM

Senchi - Monday, August 20, 2018 8:53 AM
Mr. Celko, ENOUGH ABOUT MySQL! If you love that product so much go find a forum for it and live there instead of here.

That is one of my standard "cut and paste" replies to a common problem in SQL Server (lack of the ANSI/ISO standard interval temporal datatypes). It's not particularly a plug for the MySQL product. But as full disclosure, I will tell you that I did work for them many years ago doing videos on basic data design (not product specific training videos). Frankly, I find the MySQL dialect to be too far away from standards for my taste.

But there proposal for integrating certain temporal interval data types into ISO 8601 standards to be a neat solution, with many advantages and it seems to be catching on the standards community.

Can I assume you have a better solution that they should be using for this problem? I've never seen you posted anywhere that I can remember.

Please post DDL and follow ANSI/ISO standards when asking for help.