August 16, 2016 at 5:48 pm
Table Structures:
Student
-- ID
-- FirstName
-- LastName
StudentTelephoneNumber
-- ID
-- StudentID
-- TelephoneNumber
StudentAddress
-- ID
-- StudentID
-- City
-- PostalCode
StudyProgram
-- ID
-- Name
StudentStudyProgram
-- ID
-- StudentID
-- StudyProgramID
Relationships:
Student Parent - StudentTelephoneNumber / StudentAddress Child
Student / StudyProgram Parent - StudentStudyProgram Child
Given the above structure, what would be the best way of constructing a query in which the input would be a text value, and the output would be a wildcard (or, ideally, a proximity) search across ANY of the above tables? In other words, if I wanted to search for "Macaroni", it would search for any value in Student.FirstName, Student.LastName, StudentTelephoneNumber.TelephoneNumber, StudentAddress.City, StudentAddress.PostalCode, and StudyProgram.Name, ideally as a proximity search (values that resemble "Macaroni", like "Macarony" or "Macaronee"), though if necessary, as a wildcard serch (values that contain "Macaroni", like "I like Macaroni" or "Macaroni is the best"), how would I best construct the query?
The obvious answer is to build it like follows:
DECLARE @Input VARCHAR(MAX)
SET @Input = 'Macaroni'
SELECT DISTINCT
Student.ID
FROM Student
LEFT JOIN StudentTelephoneNumber ON StudentTelephoneNumber.StudentID = Student.ID
LEFT JOIN StudentAddress ON StudentAddress.StudentID = Student.ID
LEFT JOIN StudentStudyProgram ON StudentStudyProgram.StudentID = Student.ID
LEFT JOIN StudyProgram ON StudyProgram.ID = StudentStudyProgram.StudyProgramID
WHERE(
(Student.FirstName LIKE '%' + @Input + '%') OR
(Student.LastName LIKE '%' + @Input + '%') OR
(StudentTelephoneNumber.TelephoneNumber LIKE '%' + @Input + '%') OR
(StudentAddress.City LIKE '%' + @Input + '%') OR
(StudentAddress.PostalCode LIKE '%' + @Input + '%') OR
(StudyProgram.Name LIKE '%' + @Input + '%')
)
I worry about performance issues with constructing the query like that, though. While there will only be around 15000 active records in each table, these records will be supplanted every few months with a new list of records, with those that no longer constitute the list being set to "deleted", though still remaining in the table. Since I'm using wildcard searches on both sides, indexes would serve no purpose, so I worry that the query will start taking longer and longer to run. Also, as the project grows, it is likely that more columns will be requested to be searched, so I'd like to minimize both the extra work necessary to modify the query, and the extra time added to the searching. I'd also of course have to validate the input to avoid potential injection attacks.
I've heard that full-text search could work for this, but I'm not really sure how I would go about constructing the query to handle it. I'm also not sure if the performance of the full-text search would be up to par, or if it would suffer the same problems. Finally, I'm not sure how the full-text search maintenance would affect the database, IE, when importing new lists, or modifying the existing ones.
Anyone have any suggestions? I don't need someone to construct a full answer for me, just to point me in the right direction.
Thanks!
NOTE - if any potential solutions involve a version of SQL Server beyond 2008, these are also acceptable.
August 17, 2016 at 7:30 am
Every IT organization I've ever worked for has had a Patient/Customer/Student search application similar to this, and optimizing this type of thing is a classic problem. My default advice is not to go the route of a single left-join everything query, because the execution is complicated and near impossible to optimize or cache for reuse. Instead, take try unionizing the result of multiple more discrete SELECTs. Therefore, each SELECT is simpler and can leverage an INNER JOIN, rather than less efficient OUTER JOINs. Also, there is no need to use the DISTINCT clause in each SELECT statement, because the UNION keyword will perform it's own distinct operation as it combines resultsets.
SELECT ID
FROM Student
WHERE (Student.FirstName LIKE '%' + @Input + '%') OR
(Student.LastName LIKE @Input + '%')
UNION
SELECT ID
FROM StudentStudyProgram ON StudentStudyProgram.StudentID = Student.ID
INNER JOIN StudyProgram ON StudyProgram.ID = StudentStudyProgram.StudyProgramID
WHERE StudyProgram.Name LIKE @Input + '%'
UNION
...
...
...
https://msdn.microsoft.com/en-us/library/ms180026.aspx
The part about needing to perform a partial match filter is problematic, because SQL Server can only perform a sargable (indexed) keyword match on leading characters.
For example, this is sargable:
StudyProgram.Name LIKE @Input + '%'
But this is not sargable:
StudyProgram.Name LIKE '%' + @Input + '%'
So, for each SELECT statement, try to use the sargable version of the LIKE comparison.
You can also leverage the SOUNDEX function to perform similarity comparisons. This can be done by adding a computed column to the table and then adding an index on the computed column.
https://msdn.microsoft.com/en-us/library/ms187384.aspx
http://blog.sqlauthority.com/2010/08/22/sql-server-computed-columns-index-and-performance/
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
August 17, 2016 at 8:28 am
I'm aware of the limitations of wildcards on indexes, which is one of my principal concerns of performance - normally, a table with several hundred thousand rows marked as "deleted" would not be cause for concern, since a simple covering index reduces the searchable records considerably.
Your suggestion of splitting the query up into bits and then merging the results together though, is an interesting one. I hadn't considered that, and it would at very least allow me to go from a clustered index scan to an index scan - for example, having an index on DateDeleted and TelephoneNumber for the StudentTelephoneNumber table.
I've used SOUNDEX in the past, and another variant called the Double Metaphone search, but the limitations there are that both algorithms are intended for use in searching a single word against another word, and don't work nearly as well when you are searching for combinations of words, like "Computer Engineering" matching with "Bachelor's in Computer Engineering".
Right now I'm thinking the best solution might be to put full-text indexes on each table, and then search each table independently and UNION the results together.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply