March 16, 2011 at 9:19 pm
Comments posted to this topic are about the item Implicit and Explicit Conversions
March 16, 2011 at 11:47 pm
I couldn't agree more. Whenever I carry out code reviews, especially for junior members of the team, I typically find these to be the most common errors they make. I have seen people using VARCHAR for storing everything (dates, money, floats)! Their excuse? Globalization/Internationalization.
We have a variety of data-types that can satisfy almost every computing and storage need - whenever a system is designed, time should be spent arguing the use of one data type of the other. Once the decision has been made, the code should not assume that the field will contain only one data pattern - unless the code is explicitly taking care of the same (eg. storing all dates in UTC, and then the representation is handled by the UI). Either way, exlicit checks & conversions should always be made.
Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com
Follow me on
Twitter: @sqltwins
March 17, 2011 at 4:31 am
I also agree. Absolutely. In my organization this is a pandemic. And it strikes specially with dates. Most dates are stored as numeric(8,0) in the central DB2 database, coming directly from fields with PIC(99999999) from transactional files. In SQL Server we need them to be real dates, because the datamart needs date arithmetics (and because that is the correct way to store a date value!). So we have to explicitly make every translation and to correct the many errors we find, filtering non-valid values at ETL time. Much of this effort could be saved by just defining DATE columns in the original database. We try to evangelize on this. But, as to date, even new tables are still created with numeric(8,0) columns for date values. "For agility purposes", they say. (????)
March 17, 2011 at 6:40 am
BITD(back-in-the-day), there were fewer options for storing data. Dates were generally stored either as 6-digits or 3, 2-digit fields. Fields were not changed when that data was ported to a SQL DB.
However, anything more recent should not be storing data incorrectly. A date should be a date or datetime data type. Editing should occur in the UI or the reports and not the database. Storing money with the currency sign the database is not necessary. Data warehouses may do that, though.
The closer you are to matching the datas intended use and its type, the better off you will be. It takes more time but it's worth it in the long run.
March 17, 2011 at 7:41 am
The representation of dates as character strings in a database (what I'll call a "VarDate") is one of the leading causes of business intelligence disasters.
The day will come when another peripheral application or ETL process will start inserting into or querying from your database, and they won't follow (or be aware of) the assumption for how the "VarDate" should be formatted.
Other RDBMS like Oracle follow different rules about how they implicitly convert data types. For example, SQL Server will reliably convert YYYY-MM-DD or YYYYMMDD to a Date, but Oracle will only implicitly convert a char value in the format 'DD-MMM-YYYY' or 'DD-MMM-YY'.
An even worse scenario than implicit conversion that throws an error is an implicit conversion that works, but works in a way that you didn't expect. The application appears to be running normally, but you discover months down the road that payroll or invoices submitted to clients been totaled incorrectly or duplicate records have been entered into tables where a VarDate column is part of the primary key.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
March 17, 2011 at 8:04 am
Nakul Vachhrajani (3/16/2011)
I have seen people using VARCHAR for storing everything (dates, money, floats)! Their excuse? Globalization/Internationalization.
How can anyone claim that storing a date as the string "09/01/2001" is better for globalization? :crazy: If you're using a US culture, that's "1st September 2001"; in most other cultures, it's "9th January 2001".
March 17, 2011 at 10:04 am
richardd (3/17/2011)
Nakul Vachhrajani (3/16/2011)
I have seen people using VARCHAR for storing everything (dates, money, floats)! Their excuse? Globalization/Internationalization.How can anyone claim that storing a date as the string "09/01/2001" is better for globalization? :crazy: If you're using a US culture, that's "1st September 2001"; in most other cultures, it's "9th January 2001".
What you can expect in this situation is that users in the Asian center will enter 9th January 2001 as '09/01/2001' and users in the US will enter 1st September 2001 as '09/01/2001', and of course management will complain that IT can never get the month end reporting working right.
If an application must use VarDate columns, then they should at least code it using ISO standard format YYYYMMDD and use a check constraint to enforce proper formatting like so:
create table MyTable
(
MyDate char(8) not null check (MyDate like '[1-2][09][0-9][0-9][01][0-9][0123][0-9]')
)
The advantage of the ISO format over 'MM/DD/YYYY' or 'MMM DD, YYYY' is that with 'YYYYMMDD' there is less ambiguity, and it will index, sort, and compare (<, >, between) correctly without requiring any datatype conversion to Date.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
March 17, 2011 at 10:08 am
Two comments:
1. If you're going to store dates in a varchar, at least store it in YYYYMMDD format.
2. Implicit conversions can also cause poor performing queries. Say you have a six-digit char column, zero-prefilled so that you have data like "001843". If you search for 1843, it will find the match, but it will also convert all of that column to an int. Even if you have an index on that column, it now won't be used.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
March 17, 2011 at 10:10 am
Interestingly, I inherited a process where the previous person stored integer values in a varchar(3) field and did poor quality checks on the data coming in. The people doing data entry often misread a zero for the letter D.
I was tasked with trying to code for some statistical samplings. The problem was, I could not match the previous statistics. Most years were very close and could mostly be attributed to different rounding methods between SQL Server and the stats package they had been using. One year of data was way off. This is when I found that I had two entries in the integer holding Varchar(2) field that were entered as D0. I had used logic to make the D a zero. However, the old stats package had apparently been treating it as hex and converted it to 208.
It was near impossible to convince the data owners that their data quality was responsible and that their previous reports were in error.
So, I guess my point is that, aside from poor QA on the data entry, that we do not always know how other tools will handle data when we do poor data typing.
By the way, it took 2 other analysts and a statistician to convince the data owners their data was of poor quality.
March 17, 2011 at 10:21 am
WayneS (3/17/2011)
2. Implicit conversions can also cause poor performing queries.
Especially in SQL CLR. If you pass in the wrong types for conversions, you might end up with multiple conversions of the same value back and forth.
March 17, 2011 at 10:31 am
Mike B in AK (3/17/2011)
Interestingly, I inherited a process where the previous person stored integer values in a varchar(3) field and did poor quality checks on the data coming in. The people doing data entry often misread a zero for the letter D.I was tasked with trying to code for some statistical samplings. The problem was, I could not match the previous statistics. Most years were very close and could mostly be attributed to different rounding methods between SQL Server and the stats package they had been using. One year of data was way off. This is when I found that I had two entries in the integer holding Varchar(2) field that were entered as D0. I had used logic to make the D a zero. However, the old stats package had apparently been treating it as hex and converted it to 208.
It was near impossible to convince the data owners that their data quality was responsible and that their previous reports were in error.
So, I guess my point is that, aside from poor QA on the data entry, that we do not always know how other tools will handle data when we do poor data typing.
By the way, it took 2 other analysts and a statistician to convince the data owners their data was of poor quality.
Sounds like fun! FWIW, I've yet to find anybody outside the database/developer world that really understood the difference between data and processing/software. One of the most common tasks I have is 'fixing' 'broken' software by locating the missing or incorrect data. It would be nice if I could validate the data, but I just don't see how to do it. I can't force anybody to open the software to record the data when it becomes available and that is by far the most common problem.
March 17, 2011 at 10:50 am
It seems we all agree that we should be using types that match the data and that our conversions should be explicit. So how did we get to the point where we even need to discuss this? I've never worked on systems without (for example) a date datatype, but I've also never worked on systems that would seem to have been migrated from 'type-less' systems, yet I've seen a lot of misuse of types. I try to cut the casual programmer some slack, but it's hard for me to imagine their thought processes. I taught myself this stuff and I think I always managed to select suitable data types. If not, I was at least spending time with books trying to figure out which data types made sense. My favourite was finding amount (money) in a character field and phone numbers in a numeric field 🙂
March 17, 2011 at 11:04 am
Ron Porter (3/17/2011)
It seems we all agree that we should be using types that match the data and that our conversions should be explicit. So how did we get to the point where we even need to discuss this? I've never worked on systems without (for example) a date datatype, but I've also never worked on systems that would seem to have been migrated from 'type-less' systems, yet I've seen a lot of misuse of types. I try to cut the casual programmer some slack, but it's hard for me to imagine their thought processes. I taught myself this stuff and I think I always managed to select suitable data types. If not, I was at least spending time with books trying to figure out which data types made sense. My favourite was finding amount (money) in a character field and phone numbers in a numeric field 🙂
I think importing legacy data is one cause we still have this conversation. More directly, the lack of time and resources ($) to do it correctly. it is always a fun battle to explain the reduced cost to do it right the first time. Another reason is probably laziness or inexperience to do the job right.
March 17, 2011 at 11:23 am
Ron Porter (3/17/2011)
Mike B in AK (3/17/2011)
Interestingly, I inherited a process where the previous person stored integer values in a varchar(3) field and did poor quality checks on the data coming in. The people doing data entry often misread a zero for the letter D.I was tasked with trying to code for some statistical samplings. The problem was, I could not match the previous statistics. Most years were very close and could mostly be attributed to different rounding methods between SQL Server and the stats package they had been using. One year of data was way off. This is when I found that I had two entries in the integer holding Varchar(2) field that were entered as D0. I had used logic to make the D a zero. However, the old stats package had apparently been treating it as hex and converted it to 208.
It was near impossible to convince the data owners that their data quality was responsible and that their previous reports were in error.
So, I guess my point is that, aside from poor QA on the data entry, that we do not always know how other tools will handle data when we do poor data typing.
By the way, it took 2 other analysts and a statistician to convince the data owners their data was of poor quality.
Sounds like fun! FWIW, I've yet to find anybody outside the database/developer world that really understood the difference between data and processing/software. One of the most common tasks I have is 'fixing' 'broken' software by locating the missing or incorrect data. It would be nice if I could validate the data, but I just don't see how to do it. I can't force anybody to open the software to record the data when it becomes available and that is by far the most common problem.
A classic example where you encounter typeless columns is the vertical all-purpose table that fits anything.
referral_id datatype_id data_value
100 4 9/3/2003
100 2 12.75
100 9 Y
101 4 Jan 15, 2007
101 2 $10.50
101 9 Y
Also, I see this thing a lot in datawarehouses where the data model is designed by someone who knows the business, but their expertise not really data modeling. Or perhaps a junior developer looks at the requirements for a report, which state that invoice date must be formatted as MM/DD/YYYY, and he confuses a data presentation requirement for a data storage requirement.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
March 17, 2011 at 11:32 am
I too agree, explicit conversions are always best, the problem is that implicit conversions are very easy to miss.
I was thinking recently it would be nice to have some sort of flag you could turn on like "IMPLICIT CONVERSION WARNINGS", and these would show up when you do a syntax check.
pef
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply