Return a list of all columns in a table with more than X % filled

  • Say I have a table that allows nulls on some columns. As records are added, some columns are filled only say 20% of the time. is there an easy way to determine if a column is filled over X percent? Creating a cursor of column names is simple, as is eliminating all the columns that cannot be null.

    I think I have to use dynamic SQL to insert the  (column name, table name, % filled ) values for each column into another table. (I'm basically trying to automate which columns to include in a view.)

    Everything works fine except for when I try to use the column name from the cursor...

    DECLARE @ColumnList NVARCHAR(2000);
    DECLARE @column_name nvarchar(25);
    DECLARE @RowCount as decimal(5,2);
    DECLARE @Threshold DECIMAL(3,2) = 0.01; /* include row if it's at least 20% full */

    SELECT @RowCount = COUNT(*) FROM Toxicity_Original;

    PRINT @RowCount;

    DECLARE @ColumnsCursor AS CURSOR
    DECLARE @SQLString NVARCHAR(2000);

    SET @ColumnsCursor = CURSOR FOR
    SELECT ac.name AS column_Name
    --, ac.is_nullable
    FROM sys.all_columns ac
    INNER JOIN sys.all_objects ao
    ON ac.object_id = ao.object_id
    WHERE ao.name = 'Toxicity_Original';

    OPEN @ColumnsCursor;

    FETCH NEXT FROM @ColumnsCursor INTO @column_name;

    WHILE @@FETCH_STATUS = 0
    BEGIN
    /*
    How do I save the results of the successful counts? Or just append them all to a temporary table,
    and then use STRING_AGG() to munge them all together again.
    */
    FETCH NEXT FROM @ColumnsCursor INTO @column_name;
    END

    PRINT @ColumnList;

    CLOSE @ColumnsCursor;
    DEALLOCATE @ColumnsCursor;

    I was thinking that I could test the column's "fill factor" by getting a count of records in the table and then a count of records where @columnName IS NOT NULL, but writing the dynamic SQL for that part is giving me fits. The problem is that the basic pattern is

    SELECT Count(*) FROM MyTable WHERE @columnName = NULL

    and I need dynamic SQL for that, and then to do something like

    INSERT INTO LogTable (Tablename, ColumnName, PercentFilled) 
    SELECT @TableName, @ColumnName, fGetPercentFilled(@TableName, @ColumnName);

    but how do I write the dynamic SQL, and then use it to insert into a table? I can get it to return a value no problem, like this:

    SELECT @NonNullCount = COUNT(*)
    FROM dbo.Toxicity_Original t
    WHERE t.ALOPECIA IS NOT NULL;

    I can replace the column name in the WHERE clause with REPLACE()... nothing doing. But how do I execute the SQL statement and get the result inserted into a table? (Or how do I do an INSERT with a dynamic WHERE clause?)

    Hope this makes sense. At least sort of.

  • pietlinden wrote:

    Say I have a table that allows nulls on some columns. As records are added, some columns are filled only say 20% of the time. is there an easy way to determine if a column is filled over X percent? Creating a cursor of column names is simple, as is eliminating all the columns that cannot be null.

    Why store the nulls?  Are the number of columns variable and you're adding and dropping nullable columns according to some application logic?

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Not sure I follow why you are doing this - you stated you are trying to automate what columns to include in a view?

    Why would that even matter?  If the view needs that column to be available (because the calling code needs to reference that column) - then it should be in the view regardless of how many rows contain a NULL value.

    There is no efficiency in not including the column vs including the column.  SQL Server is able to factor out of final queries any tables/functions/views that are not utilized, but it cannot factor out a column on a row in a table.  If that row needs to be read - then all columns on that row are read.

    Assume you have a view with the column vs a view without the column - when you build a query that uses either view, how will one be more efficient/better than the other?  If both do not reference the column then both queries will be the same and will generate the same execution plan.

    Sorry - but this doesn't make sense.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • For efficiency, you should really just scan the table once to count all columns, as the code generated below does.  I've added the view name as input so the code can generate the final 'CREATE VIEW' code, based on the specific threshold.  It provides you the final view code (you'll need to remove some comment markers) before running.  Normally I don't use SELECT ... INTO directly.  If you want to avoid that too, creating the table stand-alone first and then inserting into it, let me know and I will adjust the code accordingly.

    DECLARE @TableName nvarchar(128) = 'Toxicity_Original'
    DECLARE @ViewName nvarchar(128) = @TableName + 'View'
    DECLARE @Threshold DECIMAL(3,2) = 0.20; /* include row if it's at least 20% full */--**************************************************************************************************

    DECLARE @column_name nvarchar(128);
    DECLARE @ColumnList nvarchar(max);
    DECLARE @ColumnsCursor CURSOR;
    DECLARE @objectId int;
    DECLARE @RowCount int;
    DECLARE @SQLString1 nvarchar(max);
    DECLARE @SQLString2 nvarchar(max);

    IF OBJECT_ID('tempdb.dbo.#columnTotals') IS NOT NULL
    DROP TABLE #columnTotals;

    SET @TableName = PARSENAME(@TableName, 1)

    SELECT @objectId = object_id
    FROM sys.tables
    WHERE name = @TableName

    SELECT @RowCount = SUM(rows)
    FROM sys.partitions
    WHERE object_id = @objectId AND index_id IN (0, 1)
    GROUP BY index_id

    PRINT 'Table Name = ' + @TableName + ', total row count = ' + CAST(@RowCount AS varchar(10)) + '.';

    SET @ColumnsCursor = CURSOR FOR
    SELECT c.name AS column_Name
    FROM sys.columns c
    INNER JOIN sys.tables t ON t.object_id = c.object_id
    WHERE c.object_id = @objectId /*AND c.is_nullable = 1*/ORDER BY c.column_id

    OPEN @ColumnsCursor;
    SET @SQLString1 = '';
    SET @SQLString2 = '';
    WHILE 1 = 1
    BEGIN
    FETCH NEXT FROM @ColumnsCursor INTO @column_name;
    IF @@FETCH_STATUS <> 0
    IF @@FETCH_STATUS = -2
    CONTINUE
    ELSE
    BREAK;
    SET @SQLString1 = @SQLString1 + ', SUM(CASE WHEN [' + @column_name + '] IS NULL THEN 0 ELSE 1 END) AS [' + @column_name + '__Count]'
    SET @SQLString2 = @SQLString2 + '+ CASE WHEN [' + @column_name + '__count] * 1.0 / TotalRows__Count >= ' + CAST(@Threshold AS varchar(10)) +
    ' THEN '', ' + @column_name + ''' ELSE '''' END'
    END /*WHILE*/DEALLOCATE @ColumnsCursor;

    SET @SQLString1 = 'SELECT ' + CAST(@RowCount AS varchar(10)) + ' AS [TotalRows__Count], ' + STUFF(@SQLString1, 1, 2, '') + ' ' +
    'INTO #columnTotals ' +
    'FROM [' + @TableName + ']; '
    SET @SQLString2 = 'SELECT
    ''/*CREATE VIEW ' + @ViewName + ' AS */'' +
    ''SELECT '' + STUFF(column_names, 1, 2, '''') + '' FROM ' + @TableName + ''' ' +
    'FROM (SELECT ' + STUFF(@SQLString2, 1, 2, '') + ' AS column_names ' +
    'FROM #columnTotals) AS derived; '
    PRINT @SQLString2
    SET @SQLString1 = @SQLString1 + @SQLString2
    EXEC(@SQLString1)

     

     

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply