In the first part of this series we focused on three main objectives: 1) we considered the question of why we might bit mask (bitmask) in the database, 2) we reviewed numbers, counting, base 2, base 10, and 3) we solved a simple problem using bitmasking. For that first part, we engaged a method which could be termed "bit cramming" if you will; packing multiple bits of data into a single binary data type, and represented the values of those bits. When it came time to unmask those values, we use the various bitwise operators and then decoded the values back to independent elements. That example demonstrated the theoretical aspects of bitmasking – today, you would probably not use this technique as a general rule in a relational database, although I still receive plenty of email from SQL programmers needing information and looking to use the technique in some form or fashion within their own environments.
In the days where space was expensive and scarce, the notion of writing compact code and conserving resources was anything but an afterthought. Today, although we have fast processors, lots of RAM, and disk space, we still can use this technique to solve problems for which an alternative solution is suboptimal. In Part 2 of this series, we expand the technique to a more practical problem in SQL Server, and in Part 3 we will expand the programmability of bitmasking.
Bitmasking representations
Bitmasking is an exercise of compiling multiple values that are normally represented in a number of data types to a single computer word. In a previous example, we used the binary data type to see how bit manipulation is represented visually. Four bits, a nibble, allowed us to see bits map to the visual representation of our data in hexadecimal form; aside from this, there was really no other reason to use the binary data type, so for this article we will use integers for our representations.
Consider a web form with 25 checkboxes and 25 dropdowns. How would a programmer or database developer manage putting the SQL together to send as a query to the database? A WHERE clause with 50 elements? LINQ? A "homespun" split function in TSQL? Appending 50 parameters in c# code? Or maybe use sp_executesql directly to n number of tables? All are valid ways of achieving the desired result, yet none is as elegant as using the technique described here. But bitmasking is also more than just representing our data easily. We can also use the bitwise operators to check, search, sort, and alter masked data. Simply combining data values into one column is very nice, but the magic occurs when you can peek inside of the mask to analyze the values.
At this point, I suggest that you (re)visit the first part of the series to gain a refresher on bitmasking, operators, and related terms. If you are comfortable with what has been covered up to now, or are familiar with the topic, skip past the next section. Otherwise, let’s continue. For a quick warm-up, recall that our bit counting went something like 0, 1, 2, 4, 8, etc. Since we are using integers for our assignments, the absence of "bit cramming" and bitshifting will make our operation this time more straightforward and more palatable. So, for each attribute that we want to represent, a single value – integer pair will suffice.
Masking using integers
Let us take a look at bitmasking with integers. We will limit our masking to only two simple operators, the bitwise AND (&) and the bitwise OR (|) for our calculations here. For our scenario, we are taking orders for automobiles, and we wish to store the options that a customer has chosen in a masked integer. This allows us to represent those values easily in one column in the database.
/* * * * * * * * * * * * * * Simple Masking exercise using bitwise AND and OR. Adding car options to an order. Attribute Value
-------------------------------------- Car only 1 400 hp 6.0 liter engine 2 Automatic Transmission 4 Custom stereo w/ JBL speakers 8 Tint 16 Supercharger 32 Alarm system 64 Turbocharger 128 * * * * * * * * * * * * * * */
-- This is a car with no options (resulting mask in parenthesis) SELECT 1 | 1 -- (1) -- Let's add the big motor! SELECT 1 | 2 -- (3) -- And the cool stereo SELECT 1 | 2 | 8 -- (11) -- Got to have the supercharger! SELECT 1 | 2 | 8 | 32 -- (43) -- Oops, wife says no supercharger SELECT 43 & 11 -- (11) -- Add the alarm SELECT 11 | 64 -- (75 our final mask value that represents our options) -- I forgot - did we add the big motor? SELECT CASE WHEN 75 & 2 = 2 THEN 'Yes we did' ELSE 'No' END
These are very straightforward operations for building a mask can be using bitwise AND and bitwise OR operators. Notice how the | builds the mask (or adds to it) and the "&" takes away from, so to speak, and assists us with checking the mask value. From previous knowledge, we know that all we are doing is flipping bits on and off behind the scenes. These values can then be easily inserted into a table and used in a search:
IF OBJECT_ID('CustOrders')IS NOT NULL DROP TABLE CustOrders GO CREATE TABLE CustOrders (CustID int NOTNULL PRIMARY KEY ,CustFirst varchar (10) ,CustLast varchar (10) ,Car varchar (20) ,OptionsMask int ) GO
INSERT INTO dbo.CustOrders
(CustId, CustFirst, CustLast, Car, OptionsMask) VALUES (1,'Sally','Smith', 'Camaro', 3) ,(2,'Joe', 'Celko','Monaro', 25) ,(3,'Steve', 'Jones','GTO', 51) -- So, who bought car w/6.0 litre and supercharger? Why, none other than Steve Jones! SELECT * FROM dbo.CustOrders WHERE OptionsMask & 34 =(2|32)
Do the possibilities of performing such an operation begin to appear? If not, they will shortly. Notice that we have taken several "options", or attributes, and represented them by a single integer value. No need for eight columns here as we can combine all of our values into one value. The value is easy to manage, compact, and the concept surprisingly simple.
On leveraging .Net and the SQL CLR
Many of us have either dabbled in or used the SQL CLR a time or two since the release of SQL Server 2005, and some of us have actually incorporated .Net assembly into the database namespace; if nothing else you have at least heard about or have read documentation on the its existence. While this is not a forum to debate the merits of the SQL CLR, or to engage the reader in a debate on whether or not C# or VB.NET code should reside in the database or in the middle-tier, using a programming language to assist with bitmasking is a worthwhile technique. Because it is a procedural process, this technique naturally fits well with either c# or vb.net.
We have oftentimes heard the discourse that TSQL is a set-based language, and is somewhat lacking when it comes to procedural coding. Because we have the CLR available now, gone are the days of having to fight through creating and making safe extended stored procedures in SQL Server. Available to us now are all of the rich methods and properties of the .Net library. And while all of the code in this article may be written in TSQL (I choose to parlay its strengths into the following CLR code) it can readily be handled wherever the programmer may choose based on preference and/or performance.
A bit of history
Over ten years ago I worked at Match.com in Dallas, TX as a DBA. The original Match programmers, based in the San Francisco Bay Area, implemented for their search engine using pl/sql and C++ a really nice facility for potential love mates to search for their "one and only", and more specifically, a search routine that relied heavily on bitmasking. When the company was sold and operations moved to the Dallas/Fort Worth area, the database was converted to SQL Server from the Oracle/Unix environment, and the search engine rewritten in TSQL continued to perform effectively. The rest is history, as they say. Match.com continues to be the number one online dating web site in the world.
The reason that I bring up Match is twofold. First, I appreciate the programming techniques that were used back then, and second, it provides a platform which we can create meaningful and relevant examples for bitmasking.
Our Challenge
Our challenge is to create a search engine. To set the stage, we will use the real-world scenario from a fictitious dating web site – multiple dropdowns that require fast searching. (For the sake of brevity, we will limit the number of options). This type of search is suitable for bitmasking as all of the items are predefined and are not of the free-form, full text search variety. The assumptions and requirements for our search engine using bitmasking are 1) three drop-downs on a web form, 2) search attributes include hair, age, and gender, and 3) the matches must return a rank for hit percentage. We will use TSQL to query the database, and deploy a CLR Table-Valued Function (TVF) to handle our masking operation. Remember – the logic for handling this can be moved anywhere, including to either TSQL or business logic layer (BLL). Additionally, a traditional scalar UDF or CLR UDF could replace the TVF. Once again, I stress that all of the locations for the code are interchangeable, so I suggest that you experiment with various combinations in order to decide which will work best for your scenario.
The power of this technique will now become clear as we search on a masked value against another mask. Each searching customer will have two masks – one mask that represents "who he or she wants", and another to represent "what he or she is". When a person is searching, we will take the attributes for the persons that they are looking for - "searchees" - and match them to the attributes of the "searchor".
Steps to Build the Example Search Engine
1) Create a key for mask representations – Search on three hair colors, three age bands, and gender and assign them integer values. I suggest that you always jot down your data mapping prior to beginning any bit assignments.
/* * * * * * * * * * * * * * Search Engine Key to Assign Bits
Attribute Value
-------------------------------------- Blonde 1 Brunette 2 Red 4 Age_1822 8 Age_2327 16 Age_2835 32 Male 64 Female 128 * * * * * * * * * * * * * * */
2) Create the .Net CLR code – a Table-valued User-defined function written in c#
using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using System.Collections; using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions { #regionUser-defined function
//Attributes for table that is to be returned [Microsoft.SqlServer.Server.SqlFunction( FillRowMethodName = "RowFiller", TableDefinition = "Gender INT, Hair INT, Age INT, SearchResults DECIMAL (4,2)")] public static IEnumerable Mask(Int32 inputMask, Int32 inputSearch) { if ((inputMask < 1) || (inputSearch < 1)) throw newArgumentOutOfRangeException("Mask Value", "Mask Value less than 1 - please resubmit query.");
//local declaration decimal searchResults = 0M;
//attribute mask for matching check int hair = 1 | 2 | 4; int age = 8 | 16 | 32; int gender = 64 | 128;
//mask vs input and search int hairResult = inputMask & hair & inputSearch; int ageResult = inputMask & age & inputSearch; int genderResult = inputMask & gender & inputSearch;
//Check for hair, age, and gender match based on mask if (hairResult != 0) searchResults = .33M; if ((inputMask & age & inputSearch) != 0) searchResults += .33M; if ((inputMask & gender & inputSearch) != 0) searchResults += .33M;
yieldreturn new Results( genderResult , hairResult , ageResult , searchResults ); } #endregion
#regionRowFiller function public static void RowFiller(object obj, out SqlInt32 Gender , out SqlInt32 Hair, out SqlInt32 Age, out decimal SearchResults ) { Results rr = obj as Results; Gender = rr.Gender; Hair = rr.Hair; Age = rr.Age; SearchResults = rr.searchResults; } #endregion
#regionResults helper class private class Results { internal decimal searchResults; internal Int32 Gender; internal Int32 Age; internal Int32 Hair; internal Results( Int32 Gender, Int32 Hair, Int32 Age, decimal searchResults) { this.Gender = Gender; this.Hair = Hair; this.Age = Age; this.searchResults = searchResults; } } #endregion }
This function is used to compare our masks and return the search results to SQL server via TSQL. Because this particular lesson doesn’t require learning CLR code (you can work on this later or post any questions in the discussion), study the snippet of code below which is the portion that handles the comparison of the mask to the representation of attributes, essentially the crux of the search.
// mask vs input and search int hairResult = inputMask & hair & inputSearch; int ageResult = inputMask & age & inputSearch; int genderResult = inputMask & gender & inputSearch;
Both masks enter the function as input parameters, the input mask and the input search. The input mask is the mask value that represents the preferred search result, while the input search is the list of candidate masks of potential matches. Each of the variables - hair, age, and gender - represent our mask values so when the code executes the if statement, ANDing all three values not equal to zero allows us to determine whether or not a match exists. If it does match, a searchResults variable is upped by a given percentage; here 33%, since we have only three attributes to search upon.
3) Create our table
SET NOCOUNT ON IF OBJECT_ID ('Customers')IS NOT NULL DROPTABLE Customers; GO CREATE TABLE [dbo].[Customers]( [CustId] [int] NOT NULL, [CustMask] [int] NULL, [CustSearchMask] [int] NULL, PRIMARY KEY CLUSTERED ( [CustId] ASC )WITH (PAD_INDEX =OFF, STATISTICS_NORECOMPUTE =OFF, IGNORE_DUP_KEY=OFF , ALLOW_ROW_LOCKS =ON , ALLOW_PAGE_LOCKS =ON)ON [PRIMARY] ) ON [PRIMARY]; GO
3) Insert some customers
INSERT INTO Customers VALUES (1, 73, 145 ) -- blonde 18-22 male seeking blonde 23-27 female ,(2, 145, 74) -- blonde 23-27 female seeking brunette 18-22 male ,(3, 65, 1) -- blonde male seeking blonde ,(4, 129, 65) -- blonde female seeking blonde male ,(5, 129, 65) -- blonde female seeking blonde male ,(6, 129, 65) -- blonde female seeking blonde male ,(7, 164, 0) -- red 28-35 female
4) Query the data – study the values that are being searched and the returned mask. Look up CustId in the Customers table to verify that the search is working as intended. Then, create some of your own queries to see how the search performs using various parameter values. Study the execution plan and note the efficiency of the queries. Experiment with adding more rows, removing some of the query elements (such as the ORDER BY), and add indexes to the table as you see fit.
--Query 1: Seeking brunette 18_22 male DECLARE @mask int SET @mask = 2 | 8 | 64 SELECT CustId, CustMask, Gender, Hair, Age, SearchResults FROM Customers CROSS APPLY dbo.Mask(@mask, CustMask) WHERE Gender>0 ORDER BY SearchResults DESC GO --Query 2: Seeking blonde female 23-27 DECLARE @mask int SET @mask = 145 SELECT CustId, CustMask, Gender, Hair, Age, SearchResults FROM Customers CROSS APPLY dbo.Mask(@mask, CustMask) WHERE Gender>0 ORDER BY SearchResults DESC GO --Query 3: Seeking brunette male DECLARE @mask int SET @mask = 66 SELECT CustId, CustMask, Gender, Hair, Age, SearchResults FROM Customers CROSS APPLY dbo.Mask(@mask, CustMask) WHERE Gender>0 ORDER BY SearchResults DESC GO --Query 4: Seeking red 28_35 female DECLARE @mask int SET @mask = 164 SELECT CustId, CustMask, Gender, Hair, Age, SearchResults FROM Customers CROSS APPLY dbo.Mask(@mask, CustMask) WHERE Gender>0 ORDER BY SearchResults DESC GO
A few notes about this search engine
This is a simple example of a search engine for searching attributes with set domains, but the strength, flexibility, and ease of use is apparent. A list of some important business logic concepts were deliberately left out, but could include the following:
1) Filtering – this search engine could be set up to remove the largest chunks of data first. For example, filtering on zip code might be the first predicate used to narrow the result set quickly.
2) Two-way matching – this search engine does not provide the facility for two-way matching (matching one mask’s preferences to another’s)
3) Gender variations – this search engine does not provide the built-in facility for alternative gender matching
4) Percentage hits – this search engine manually calculates hit % based on hard-coded values
5) Others
Conclusion
We have explored another form of bitmasking in this article, and have expanded our initial survey of the topic from part 1 by creating a straightforward method to solving a realistic problem. The technique is simple to implement and easy to understand, and can be used in many situations to solve a variety of problems. Also, we limited our operators to bitwise AND and bitwise OR which coincide with the introductory nature of this series. Although some may think of it as an abstract or outdated method of data representation and manipulation, it is still widely used in procedural and database programming implementations.
About the author
Lee Everest M.S. is a principal consultant with Sogeti USA LLC. He is also an adjunct at North Lake College in Irving, Texas, and can be reached at his blog at www.texastoo.com.