April 1, 2012 at 8:50 am
Hello everyone/
First of all, i'm not quite sure if my issue is already solved in another thread because i dont know which keywords should i use in the search bar for it to come up(noob issue). Furthermore i have very limited understanding of SQL in general, so i decided it was better to post my specific scenario so i can get feedback in terms i could understand more easily. Now having said that, im going to elaborate on my situation right now.
I have a very unnormalized database with a lot of data(Approx. 6 millon records in the main table to be exact):
tbl_People:
personId (pk)
name
lastname
socialSecurity#
street
houseNumber
City
State
(and much more fields)
This table has pretty much all the information about the People, and it has 6million records in it.
My task consists in normalize this table, which means taking out all the columns that should be in a separate table and assign just the id of that new table to tbl_People.
So, instead of having all the address fields inside tbl_People, I should only have the corresponding addressId on each row. This is what i want:
tbl_address
addressId (pk) (auto incremental)
street
houseNumber
City
State
tbl_People:
personId (pk)
name
lastname
socialSecurity#
AddressId <---
I have already created my normalized database, created new tables for the people and adress, imported all the data to my new people table without the address info.
In the address table i set up an incremental addressId PK and imported all the address rows from the old unnormalized tbl_people.
My problem is: How can i update the newTbl_People setting the addressId for each person with the addressId that correspond to each person?
I apologize in advance if this question was already solved. If so, please provide a link to a similar problem. But i would prefer to have advice on my specific issue here
April 1, 2012 at 9:20 am
It really depends what you need out of this reorganisation of the tables.
1 - If each person had a unique address that you will not save any space or time at all.
2 - If you wanted to allow for multiple people within a single address then the structure you have now designed does this.
3 - If you wanted to allow for multiple people within a single address and for people to have multiple addresses then this would not be the correct design.
OK given that option 2 is your chosen requirement.
1. Have the tables tbl_Address and tbl_People created but not populated to start with.
2. Run a select distinct command on the original source table using street, housenumber, city and state and insert these into the new tbl_Address table using INSERT INTO.
INSERT INTO tbl_Address (Street, HouseNumber, City, State)
select distinct street, housenumber, city, state
from tbl_OriginalPeople
3. Run a select command that joins the original source table into the tbl_Address joining on street, housenumber, city and state. INSERT the personid, name, lastname, socialsecurity# and addressID from the tbl_Address into the tbl_People.
INSERT INTO tbl_People (PersonID, Name, LastName, SocialSecurity, AddressID)
select P.personid, P.name, P.lastname, P.socialsecurity, A.AddressID
from tbl_OriginalPeople as P
inner join tbl_Address as A
on P.Street = A.Street AND P.HouseNumber = A.HouseNumber AND
P.State = A.State AND P.City = A.City
Hope this helps,
Fitz
April 1, 2012 at 9:25 am
This is what I would do.
Ensure the tables are there. Populate addresses followed by the new people table then rename the old objects.
You arent clear on whether you need to cleanse your old address data but given that this is on a per person basis this is quite likely.
CREATE TABLE [dbo].[tbl_People](
[personId] [int] NOT NULL,
[name] [nvarchar](100) NULL,
[lastname] [nvarchar](100) NULL,
[socialSecurity#] [varchar](20) NULL,
[street] [nvarchar](100) NULL,
[houseNumber] [varchar](5) NULL,
[City] [nvarchar](60) NULL,
[State] [char](2) NULL,
PRIMARY KEY CLUSTERED
(
[personId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[tbl_address](
[addressId] [int] IDENTITY(1,1) NOT NULL,
[street] [nvarchar](100) NULL,
[houseNumber] [varchar](5) NULL,
[City] [nvarchar](60) NULL,
[State] [char](2) NULL,
PRIMARY KEY CLUSTERED
(
[addressId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[tbl_Peoplenew](
[personId] [int] NOT NULL,
[name] [nvarchar](100) NULL,
[lastname] [nvarchar](100) NULL,
[socialSecurity#] [varchar](20) NULL,
[addressid] [int] NULL,
PRIMARY KEY CLUSTERED
(
[personId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE dbo.tbl_Peoplenew ADD CONSTRAINT
FK_tbl_People_tbl_address FOREIGN KEY
(
addressid
) REFERENCES dbo.tbl_address
(
addressId
) ON UPDATE NO ACTION
ON DELETE NO ACTION
GO
INSERT INTO [db_dba].[dbo].[tbl_address]
([street]
,[houseNumber]
,[City]
,[State])
SELECT DISTINCT [street]
,[houseNumber]
,[City]
,[State]
FROM [dbo].[tbl_People]
GO
INSERT INTO [db_dba].[dbo].[tbl_Peoplenew]
([personId]
,[name]
,[lastname]
,[socialSecurity#]
,[addressid])
SELECT p.[personId]
,p.[name]
,p.[lastname]
,p.[socialSecurity#]
,a.addressId
FROM [dbo].[tbl_People] p
INNER JOIN [dbo].[tbl_address] a ON p.street = a.street and p.houseNumber = a.houseNumber and p.City = a.City and p.[State] = a.[State]
GO
exec sp_rename 'tbl_People','tbl_PeopleOLD'
GO
exec sp_rename 'tbl_Peoplenew','tbl_People'
GO
April 1, 2012 at 10:05 am
My goal is to reduce address redundancy as much as possible since it is likely that there will be people living in the same address. Both answers seem pretty accurate and effective, i will work on it and let you guys know. Thanks a lot
April 1, 2012 at 10:43 am
nicoalmontec (4/1/2012)
My goal is to reduce address redundancy as much as possible since it is likely that there will be people living in the same address. Both answers seem pretty accurate and effective, i will work on it and let you guys know. Thanks a lot
If that's the case you will need to tidy the address table first and modify the join criteria
In the past I've done this using the fuzzy matching in enterprise as its pretty common to find typos in this kind of address data
April 1, 2012 at 11:56 am
1- I created tbl_Address and inserted all distinct addresses, which gave me around 4 million records.
use tbl_newPeople
INSERT INTO tbl_address(street, house#, building, floor, city, state)
select distinct street, house#, building, floor, city, state
from tbl_OriginalPeople
around 4 million rows affected
2- Inserted the data in tbl_newPeople
INSERT INTO tbl_newPeople(PersonId,name, lastname, (all other fields), idAddress )
select P.personId, P.name, P.lastname, P.(all other fields), A.addressId
from tbl_OriginalPeople as P
inner join tbl_Address as A
on P.Street = A.Street AND P.HouseNumber = A.HouseNumber AND
P.State = A.State AND P.City = A.City
However, this is only inserting a few 2 million and something into my new people table...
The original table has some serious integrity issues, there are records in which street or houseNumber or state are NULL or empty, i dont know if this might be causing the problem, anyways im not sure
Step 1 returned something around 4 million distinct addresses, so if im not mistaken. there must be at least that same amout of matches found on the query on step two, There is something wrong here..
April 1, 2012 at 1:22 pm
nicoalmontec (4/1/2012)
1- I created tbl_Address and inserted all distinct addresses, which gave me around 4 million records.
use tbl_newPeople
INSERT INTO tbl_address(street, house#, building, floor, city, state)
select distinct street, house#, building, floor, city, state
from tbl_OriginalPeople
around 4 million rows affected
2- Inserted the data in tbl_newPeople
INSERT INTO tbl_newPeople(PersonId,name, lastname, (all other fields), idAddress )
select P.personId, P.name, P.lastname, P.(all other fields), A.addressId
from tbl_OriginalPeople as P
inner join tbl_Address as A
on P.Street = A.Street AND P.HouseNumber = A.HouseNumber AND
P.State = A.State AND P.City = A.City
However, this is only inserting a few 2 million and something into my new people table...
The original table has some serious integrity issues, there are records in which street or houseNumber or state are NULL or empty, i dont know if this might be causing the problem, anyways im not sure
Step 1 returned something around 4 million distinct addresses, so if im not mistaken. there must be at least that same amout of matches found on the query on step two, There is something wrong here..
It may well be nulls that are causing the problem. You can guarantee that any case where one of the address fields is null will not go into the new table with the code you have there.
Also, it appears that you have 6 address fields (street, house#, building, floor, city, state) but match
only on 4, which may give you some problems too.
Changing the insert code for the new people table to be
INSERT INTO tbl_newPeople(PersonId,name, lastname, (all other fields), idAddress )
select P.personId, P.name, P.lastname, P.(all other fields), A.addressId
from tbl_OriginalPeople as P inner join tbl_Address as A
on (P.Street = A.Street or (P.Street is NULL and A.Street is NULL)) AND
(P.House# = A.House# or (P.House# is NULL and A.House# is NULL)) AND
(P.State = A.State or (P.State is NULL and A.State is NULL)) AND
(P.City = A.City or (P.City is NULL and A.City is NULL)) AND
(P.building = A.building or (P.building is NULL and A.building is NULL) AND
(P.floor = A.floor or (P.floor is NULL and A.floor is NULL))
should fix both those problems (although for example it's not clear whether the column name is state or State or one in one table and the other in another, is it housenumber or house#, is that column really called floor - you should avoid using names of built in sql functions as column or tabvle names - and so on, so you will need to get those bits right as well).
If you still have a problem after that, tell us some more detail.
Also you should look seriously at whether the NULLs can be eliminated. For address components it's sensible to use an empty string instead of a NULL - but only if your apps don't rely on seeing a null rather than an empty string for inapplicable fields (which is poisonously bad design, but unfortunately very common).
Tom
April 1, 2012 at 2:22 pm
Your solution does the work satisfcactory, however i decided to insert the letter 'M' in all those null fields indicating that the field data is "missing". I also fixed my query according to your advice and it works now. Each solution you guys provided are correct and thank you all for the help.
..Might stick around for a while in this forum.
Cheers
April 4, 2012 at 9:55 pm
Okay now im stuck again with this task..
Now that i have inserted all the corresponding addressId to each person, i have to insert another Id to this same table. I already know i cant do any inserts since the records are already opened. I need to update the table, setting this new id column = a joined select statement just like the one you recomended.
This is my update query:
update tbl_newPeople
set idDocument =
(select P.document_type, P.document_city, P.document_book, A.idDocument
from original_People as P
inner join Document as A
on P.document_type = A.document_type AND P.document_city = A.document_city
AND P.document_book = A.document_book)
It turns out that im getting this error : "Only one expression can be specified in the select list when the subquery is not introduced with EXISTS."
I have been reading about this error, according to what i learned, its not going to work since im giving more than one expression in the select statement and that i have to use EXISTS.. im not quite sure how am i supposed to achieve it..
April 4, 2012 at 9:57 pm
Try this. The only issue it may have is if the subquery returns more than one row.
update tbl_newPeople
set idDocument =
(select A.idDocument
from original_People as P
inner join Document as A
on P.document_type = A.document_type AND P.document_city = A.document_city
AND P.document_book = A.document_book)
April 4, 2012 at 10:13 pm
The query you suggested did bring more than one value, which is not supposed to happen because there should be only one documentId that matches the criteria given...
April 4, 2012 at 10:17 pm
nicoalmontec (4/4/2012)
The query you suggested did bring more than one value, which is not supposed to happen because there should be only one documentId that matches the criteria given...
You need to check your data to see why more than one row is being returned. Copy out the select and add back the columns needed to see what is being retunrned.
April 4, 2012 at 10:28 pm
tbl_newPeople store records of related people, therefore they might as well live together and have the same addressId.
tbl_Document and tbl_address data have something in common:
The same way people share an addressId, some people have the same documentId, so modifying the query or the records structure isnt a possibility.
We need to think of a way of getting everyone their corresponding documentId no matter if more than 1 person has it. Once again thank you all for your help
April 5, 2012 at 12:09 am
The real reason that the update fails is that the subquery is not related to the table being updated. This causes the subquery to join all rows from the table and attempt to update the single value to have all the documentIDs.
The code below would show the technique required. I have added a set of dummy data with sort of matches your data to show.
create table Documents(
DocumentID int,
Document varchar(max),
DocumentCity varchar(20),
DocumentType varchar(30)
)
create table OldPeople(
PersonID int,
PersonName varchar(30),
DocumentCity varchar(20),
DocumentType varchar(30)
)
create table NewPeople(
PersonID int,
PersonName varchar(30),
DocumentID int
)
-- Dummy Example Data
insert into Documents values (1000,'First Document','London','Dummy1')
insert into Documents values (1001,'Second Document','Manchester','Dummy1')
insert into Documents values (1002,'Third Document','London','Dummy2')
insert into OldPeople values (1,'Eric','London','Dummy1')
insert into OldPeople values (2,'Paul','London','Dummy2')
insert into OldPeople values (3,'David','Manchester','Dummy1')
insert into OldPeople values (4,'Simon','London','Dummy1')
-- Transfer OldPeople to NewPeople
insert into NewPeople
select PersonID, PersonName, NULL
from OldPeople
select * from NewPeople
-- First attempt at adding DocumentID (as per thread)
-- this fails as subquery returns every joined row and not the
-- row specific to the newpeople record being updated
update NewPeople
set DocumentID =
(select DocumentID
from Documents as D
inner join OldPeople as OP
on D.DocumentCity = OP.DocumentCity
AND D.DocumentType = OP.DocumentType)
-- Correct version
update NP
set DocumentID = D.DocumentID
from NewPeople as NP
inner join OldPeople as OP
on NP.PersonID = OP.PersonID
inner join Documents as D
on D.DocumentCity = OP.DocumentCity
AND D.DocumentType = OP.DocumentType
select * from NewPeople
Fitz
April 5, 2012 at 6:13 am
Your query works just perfect, it took a while on my coal powered laptop but it did the work as it should, thanks for the help..Please do not close this thread as i havent finished the task yet. Im good to go for now
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply