August 24, 2018 at 8:31 am
Hi,
Let's keep this simple (for my sake) and say we have two tables.
Table Employee contains:
e.EmployeeID e.Employee e.BadgeAwarded e.CentreID
Table Centre contains:
c.CentreID c.CentreName c.Region
The resultset required is:
EmployeeID CentreID
Should I retrieve the fields from just the Employee single table like:
SELECT e.EmployeeID, e.CentreID
FROM Employee AS e
Or should I use the the Primary Key (well the Primary Key in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) like:
SELECT e.EmployeeID, c.CentreID
FROM Employee AS e JOIN Centre AS c
ON e.CentreID = c.CentreID
The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Centre's version of the attribute as, I expect, back in the SQL db it would've been the Primary Key and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop.
Thanks.
August 24, 2018 at 8:49 am
chocthree - Friday, August 24, 2018 8:31 AMHi,
Let's keep this simple (for my sake) and say we have two tables.Table Employee contains:
e.EmployeeID e.Employee e.BadgeAwarded e.CentreIDTable Centre contains:
c.CentreID c.CentreName c.RegionThe resultset required is:
EmployeeID CentreIDShould I retrieve the fields from just the Employee single table like:
SELECT e.EmployeeID, e.CentreID
FROM Employee AS eOr should I use the the Primary Key (well the Primary Key in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) like:
SELECT e.EmployeeID, c.CentreID
FROM Employee AS e JOIN Centre AS c
ON e.CentreID = c.CentreIDThe latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Centre's version of the attribute as, I expect, back in the SQL db it would've been the Primary Key and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop.
Thanks.
Best practice is to only pull in necessary tables. A Foreign Key and a Primary Key will have the exact same integrity if the Foreign Key is trusted. If it's not trusted, you better have a VERY GOOD REASON for it not to be trusted.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
August 24, 2018 at 9:12 am
Ok. Thanks.
I don't have visual on the original SQL db so I can only presume the integrity was/is placed throughout. I'm actually guessing that CentreID was a PK. I would assume so, but I have seen tables without the constraints set.
August 24, 2018 at 10:05 am
Agree with Drew. I would only query employee, but I'd have a trusted FK here.
August 24, 2018 at 12:40 pm
chocthree - Friday, August 24, 2018 9:12 AMOk. Thanks.
I don't have visual on the original SQL db so I can only presume the integrity was/is placed throughout. I'm actually guessing that CentreID was a PK. I would assume so, but I have seen tables without the constraints set.
In order for the FK constraint to be put in place, it has to match a unique constraint on the referenced table. It only matters that the constraint be unique, not that it's the PK.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
August 24, 2018 at 4:16 pm
>> Let's keep this simple (for my sake) and say we have two tables. <<
You are keeping things too simple. This is why the past 30+ years, posting DDL has been minimal netiquette. You also don’t know some of the basics of data modeling. For example unless you really have only one employee, then the name of your table should be a collective or plural noun because it models a set, and not a single entity. Then, by definition, not an option like your narrative, the table must have at least one key. Your awarded badges in center are not really attributes; they are entities that should have their own tables, and their relationship to your personnel should be expressed in other tables. This is part of normalization.
My spelling checker doesn’t like the French spelling, so I’m getting “center” instead of “centre”, that’s kind of minor.
CREATE TABLE Personnel
(emp_id CHAR(10) NOT NULL PRIMARY KEY, ---Required!
emp_name VARCHAR(35) NOT NULL,
awarded_badge_nbr CHAR(5) NOT NULL,
center_id CHAR(5) NOT NULL
REFERENCES Centers (center_id)); --- Important!!
CREATE TABLE Centers
(center_id center_id CHAR(5) NOT NULL PRIMARY KEY,---Required!
center_name VARCHAR(25) NOT NULL,
region_name CHAR(10) NOT NULL);
>> Should I retrieve the fields [sic] from just the Personnel single table like:
SELECT P.emp_id, P.center_id
FROM Personnel AS P;
<<
Yes. It involves no joins,and the references clause that you didn’t show in your non-DDL narrative, will ensure that any center_id that is used has a match. In other SQL products, this would be enforced with pointer chains to a single occurrence of each center_id in the referenced table.
>> .. Or should I use the PRIMARY KEY (well the PRIMARY KEY in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) <<
The lack of data integrity in Hadoop is one of the many reasons I never recommend it to a client. Those products are strictly for queries but have nothing to protect data integrity. My attitude is that if it doesn’t have to be right, then let’s do a Douglas Adams and always return 42 🙂
>> SELECT P.emp_id, C.center_id
FROM Personnel AS e
INNER JOIN
Centers AS C
ON P.center_id = C.center_id;
The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Center's version of the attribute as, I expect, back in the SQL db it would've been the PRIMARY KEY and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop. <<
Actually, if you have a decent SQL optimizer, it will detect that the INNER JOIN is redundant and remove it from the execution plan. All you’ve done is replace a simple query with a more complicated one that gained you absolutely nothing.[/code]>> Should I retrieve the fields [sic] from just the Personnel single table like:SELECT P.emp_id, P.center_id FROM Personnel AS P; <<Yes. It involves no joins,and the references clause that you didn’t show in your non-DDL narrative, will ensure that any center ID that is used has a match. In other SQL products, this would be enforced with pointer chains to a single occurrence of each center_id in the referenced table.>> .. Or should I use the PRIMARY KEY (well the PRIMARY KEY in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) <<The lack of data integrity in Hadoop is one of the many reasons I never recommend it to a client. Those products are strictly for queries but have nothing to protect data integrity. My attitude is that if it doesn’t have to be right, then let’s do a Douglas Adams and always return 42 :-)>> SELECT P.emp_id, C.center_idFROM Personnel AS e INNER JOIN Centers AS CON P.center_id = C.center_id;The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Center's version of the attribute as, I expect, back in the SQL db it would've been the PRIMARY KEY and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop. <<Actually, if you have a decent SQL optimizer, it will detect that the INNER JOIN is redundant and remove it from the execution plan. All you’ve done is replace a simple query with a more complicated one that gained you absolutely nothing.
Please post DDL and follow ANSI/ISO standards when asking for help.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply