Changing nvarchar to varchar

  • terry999 wrote:

    Subject of Orig Post:

    The Data are logs from Many different machines (1000s). The table was actually installed as a heap. The Table is 98% of the DB size. It was growing v.large I needed a way of purging it without getting Vendor to change their insert code. I added the autoID field so I might uniquely identify a row to purge . The purge doesn't get in the way of the constant inserts as its removing data with the smallest IDs. I see it as removing from the bottom adding at the top.

    Now you really would benefit from some data normalisation.

    In similar situations when I cannot change the code (or it's too much work) I create a view to read from properly normalised data structure, and showing the same set of columns as the original table. Then I add an INSTEAD OF INSERT trigger which populates the normalised structure with the data inserted into the original table, After it's all tested and approved, I rename the table and give its name to the view. The 3rd party code inserts data into the same db object, and reads from the same object as it used to do. Everything what happens behind the interface of the view - stays behind the scene.

    I found this approach can help even with the speed of inserts (even comparing to a heap), because with no need to record look-up varchar values (names, statuses, etc.) you have much less data to store.

    The purge process would be also much lighter (if you still would want to go with it), as you'd need to delete data from much smaller central  transactional table.

    Jeff Moden wrote:

    Sergiy wrote:

    Autoincremental bigint ID is never a good candidate for a clustered index.

    "Never" is a pretty big word here.  There are some exceptions.  For example, you're using the Autoincremental ID as  a necessary surrogate key like for CustomerID, EmployeeID, etc or using the Clustered Index as a storage control mechanism rather than as a performance mechanism.

    I've been stretching my mouth

    To let those big words come right out

    🙂

    The words might be strong, but apparently not strong enough, considering how the masses of developers mindlessly follow the pattern (by default from MS) and clustering everything by ID

    _____________
    Code for TallyGenerator

  • Sergiy wrote:

    The words might be strong, but apparently not strong enough, considering how the masses of developers mindlessly follow the pattern (by default from MS) and clustering everything by ID

    True enough there and very much in agreement there!  Still, " never" is the wrong word here. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sergiy wrote:

    But.. If some reports don't filter on Date they wouldn't benefit from the new index design.

    The only query that actually does a select on this is the purge routine which is doing

    Delete from table where autoID < ###

    My only aim was to have a method to delete data. Now the data is going to be used. I will most probably put clustered index on Date field and create non clustered index on autoID field.

    Instead of Trigger is a v.nice solution which I hadn't thought of. I have used views before to allow flexibility for change but never an instead of trigger.

    Wouldn't it be much slower if for every insert it would have to check lookup tables to return their TypeIDs and/or insert new records? Inserting new TypeIDs in lookup table would become increasing rare though once it had been up and running like you say a tinyint would cover it.

     

  • terry999 wrote:

    Thanks Jeff that's a good point, which I do not know the answer. I don't have control over the code. I doubt the support Guy I'm talking to, would know.

    From what I can see its using JavaScript (I Think), I do know its using an ODBC datasource saved on the machine. Datasource is created using sql native driver.

    The table is being populated with big inserts statements I can see in activity monitor eg. INSERT INTO table VALUES (1, 'Bob'), (2,'Jane'),(3,'Steve')

    The basic reporting queries we've discussed are

    Select …... where DateAdded > '20200903' and DateAdded < '20200903' and type='error'

    I would have thought SQL would say 'error' is unicode I will convert to varchar ONCE because  Type field is varchar? rather than converting all of the index?

    Terry...

    I missed this before.  If that's the typical query then, no, they're not using anything that would require NVARCHAR.  You're probably safe in creating a new table that uses VARCHAR.  If the dates aren't date columns, it would also be a good time to convert them to dates.

    Shifting gears to your latest post above, I guess we need to know more about your "purge" operation.

    1. How often do you do it?
    2. What is is based on?  I know you said you're using the ID to do the purge but what determines the rows that will ultimately be purged?  For example, are you looking up the highest ID for a given date and then using that as the "less than or equal boundary" for then purge?

    And, correct... changing the Clustered Index from the ID to a date column isn't going to benefit all queries.  Just the ones that are based on date criteria for the data column you're talking about.  Of course, right now, you have nothing that will benefit any queries except your purge query.

    You say the table has millions of rows in it at any one time.  How many rows are there and what are their average width?  Also, do you have columns that are greater than 100 characters wide, how many, and are any of them of the MAX datatype?  Rather than trying to answer all of that, you could easily answer it and a whole lot of other potential questions if you were to post the CREATE TABLE statement for the table including any constraints including defaults, checks, etc.

    It would also be nice to know if there were some other unique key that could be applied to the table other than that ID column.  And, and yeah... I agree... there might not be one but I had to ask the question.

    It would also help us help you if we knew more about the reporting they intend to do.

    As for worrying about the size of what it will take to add extra indexes, that's the cost of doing reporting if no other changes can be made.  If it becomes and issue, the people with the purse strings are either going to have to buy more disk space or spend some money on doing things a bit differently.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Of course, right now, you have nothing that will benefit any queries except your purge query.

    And the purge query can be easily changed to be based on date criteria.

    which would make it more logical and easier to understand

    _____________
    Code for TallyGenerator

  • Sergiy and Jeff both good points

    CREATE TABLE [dbo].[LogFiles](
    [ProcessName] [nvarchar](64) NOT NULL,
    [type] [nvarchar](64) NOT NULL,
    [From] [nvarchar](260) NOT NULL,
    [To] [nvarchar](260) NOT NULL,
    [Error] [int] NOT NULL,
    [Message] [nvarchar](520) NOT NULL,
    [DateAdded] [datetime] NOT NULL,
    [CustomerID] [int] NULL,
    [LogFile] [nvarchar](260) NULL,
    [LogID] [bigint] IDENTITY(1,1) NOT NULL,
    CONSTRAINT [PK_LogFiles] PRIMARY KEY CLUSTERED
    (LogID ASC)
    )

    -- No constraints or indices at the moment apart from PK

    Instead of Trigger

    Going with Sergiy's suggestion of instead of Trigger. ProcessName, Type,  Message could currently all be replaced with Tinyint FK relationships to 3 separate lookup tables. However, wouldn't this slow inserts down. I could pre-populate each lookup table when converting to new structure. However the Trigger would still have to insert a new row into lookup up table if that value didn't exist.

    Purge

    There are 23M rows. I'm running the purge every night. Also every night it stores the current latest value of LogID in a table along with current date. I'm using the LogID from this other table to purge one month's worth of data. I was going to add a logdatetime field  instead of LogID to make the purge easier, but I wanted a unique identifier on this table.

     

  • Yes, when I was experimenting with my first log views I was prepared to trade to trade some insert performance for saving space and increasing reporting speed.

    suprisingly, inserts have become faster too. Turned out, 3-4 lookup reads were overcompensated by reducing the amount of writing to disk.

    But be careful to write a very effective code for you trigger. Triggers may be very unforgiving.

    _____________
    Code for TallyGenerator

  • Terry,

    If this is truly a log table (the name of it implies that it is) , there are two things that I'd do.  As you say, changing the NVARCHARs to VARCHARs will save a substantial amount of space although you may first want to verify what the datatypes for ProcessName and Type are if they live in other table that you may have to reference during reporting.  The other thing I'd do is see how well page compression might work on the table (Row Compression does little for variable length columns).  See the following link for how to do that.

    https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-estimate-data-compression-savings-transact-sql?view=sql-server-ver15

    Since the DateAdded column doesn't default to a date function and you also stated that people can add backdated stuff to the table, I'm thinking that the LogID column you added might cause you to delete rows that shouldn't be deleted at any given point in time.  The only good part is it CAN make a Clustered Index on the DateAdded column that Sergiy was talking about UNIQUE (SQL Server loves unique keys) if you make it the second column in the clustering key.

    I don't see much advantage to changing ProcessName or Type to a key from a "process table" or "type" table.  In the face of the other tables, it's a pain in the neck to do so with not much ROI compared to the other columns.  I'd probably wait on making a decision there (especially since you'll have extra overhead in the form of an "Instead of Trigger") and a lot of people whining about how their reports needed to join to additional tables).

    All in all, I'd treat this table as a single table mini data warehouse and see how Page Compression works out with the only possible changes being...

    1. Changing NVARCHAR to VARCHAR (big savings there and worth the ROI).
    2. Changing the Clustered Index to DateAdded/Log_ID.
    3. Seeing if Page Compression helps the size enough to be justified.
    4. Consider turning it into a partitioned VIEW in a different database.  See below for more details on some pretty serious advantages and not that I did NOT say partitioned TABLE... it's just needed for this type of thing and has some pretty bad disadvantages.

    You might want to consider changing it to a partitioned VIEW (not TABLE) of, say, week long partitions which will seriously speed up your purges (done once per week and nearly instantaneous because you'd just drop a "weekly table", add a new one, and very quickly rebuild the partitioned VIEW over it.  If you put the "Weekly Tables" each in their own identically named File Group, you could make the older static File Groups "READ_ONLY" and it would save a lot on backup sizes and times as well as DR "get back in business" times.  You might also want to consider putting this table (tables as a partitioned view) in a different database and point the partitioned VIEW at that, which would make a DR restore on the other stuff in your current database much quicker to restore not to mention that the database where the partitioned VIEW table lives could be backed up at a different frequency than your main data.

    You also wouldn't have a single monster Clustered Index to REBUILD if those ad hoc out-of-date-order inserts caused low page densities or unsatisfactory logical fragmentation (which will affect reporting).  If you do end up moving it to it's own database, you can also enjoy the fact that you could slip the database into the BULK LOGGED Recovery model during index REBUILDs (NOT Reorganize) and the index(es) would rebuild in a minimally logged fashion and beat the pants off of Reorganize insofar as duration goes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    I don't see much advantage to changing ProcessName or Type to a key from a "process table" or "type" table.  In the face of the other tables, it's a pain in the neck to do so with not much ROI compared to the other columns.  I'd probably wait on making a decision there (especially since you'll have extra overhead in the form of an "Instead of Trigger") and a lot of people whining about how their reports needed to join to additional tables).

    Table logFiles would be renamed to table_LogFiles, with the new view logfiles and an Instead of Trigger taking its place. The end user would have no idea about the joins going on.

    CREATE VIEW dbo.LogFiles

    AS
    SELECT logid, LT.Name as [type] -- rest of fields
    FROM table_LogFiles as LF
    LEFT JOIN LogTypes LT
    ON LF.TypeID = LT.TypeID

    Also a good side effect of using left join is the view will not allow updates on [type] because it resides in a lookup table (LogTypes). It will only allow updates on primary table i.e. table_LogFiles.

    Jeff you are spot on about DR and just general admin. I've asked to get vendor to move report table to new DB.  I guess they were not interested because that would mean a non standard additional DB connection to their code.

    Ignoring Partitioned views for the moment what would be the performance  impact if I moved the table to another DB and put in its place a new view e.g.

    CREATE view dbo.LogFiles
    AS
    SELECT [FieldList] FROM ReportsDB.dbo.LogFiles

    Partitioned Views

    Not used Partitioning before. I guess I couldn't partition dynamically i.e the check constraint says something like

    DATEDIFF(week,DateAdded,GETDATE()) = 1 -- week 1 table
    DATEDIFF(week,DateAdded,GETDATE()) = 2 -- week 2 table

    Because after a week the existing data would break the constraint? So Weekly purge would create new week 5 table, drop week1 table (or move to read only file) and rewrite view to replace week1 table with week5 table?

    What happens if old data is sent, would it just ignore it or would it error?

     

     

  • terry999 wrote:

    Jeff Moden wrote:

    I don't see much advantage to changing ProcessName or Type to a key from a "process table" or "type" table.  In the face of the other tables, it's a pain in the neck to do so with not much ROI compared to the other columns.  I'd probably wait on making a decision there (especially since you'll have extra overhead in the form of an "Instead of Trigger") and a lot of people whining about how their reports needed to join to additional tables).

    Table logFiles would be renamed to table_LogFiles, with the new view logfiles and an Instead of Trigger taking its place. The end user would have no idea about the joins going on.

    CREATE VIEW dbo.LogFiles

    AS
    SELECT logid, LT.Name as [type] -- rest of fields
    FROM table_LogFiles as LF
    LEFT JOIN LogTypes LT
    ON LF.TypeID = LT.TypeID

    Also a good side effect of using left join is the view will not allow updates on [type] because it resides in a lookup table (LogTypes). It will only allow updates on primary table i.e. table_LogFiles.

    Jeff you are spot on about DR and just general admin. I've asked to get vendor to move report table to new DB.  I guess they were not interested because that would mean a non standard additional DB connection to their code.

    Ignoring Partitioned views for the moment what would be the performance  impact if I moved the table to another DB and put in its place a new view e.g.

    CREATE view dbo.LogFiles
    AS
    SELECT [FieldList] FROM ReportsDB.dbo.LogFiles

    Partitioned Views

    Not used Partitioning before. I guess I couldn't partition dynamically i.e the check constraint says something like

    DATEDIFF(week,DateAdded,GETDATE()) = 1 -- week 1 table
    DATEDIFF(week,DateAdded,GETDATE()) = 2 -- week 2 table

    Because after a week the existing data would break the constraint? So Weekly purge would create new week 5 table, drop week1 table (or move to read only file) and rewrite view to replace week1 table with week5 table?

    What happens if old data is sent, would it just ignore it or would it error?

    I don't see a need for an Instead Of trigger for any of this because I don't believe you need to or even should "normalize" a table used exclusively for logs.

    You don't really need the vendor to get involved with moving the table to a separate database because the original database will have the partitioned view (named the same as the table was) pointing at it.  Unless they're doing something totally wonky, they wouldn't even know the change occurred.

    As you said, ignoring partitioned views for a moment, you don't need to create a view as you suggest if you just move the table as is.  Just create a synonym in the current database with the same name as the table after you move the table to the new database and point the synonym after that.

    On partitioned views, yes... if you make a separate weekly table and a separate file group to hold each table (which is where all the maintenance and DR benefits come into play), each table will need it's own constraint to identify it's own effective week span.  All of that pretty easy to automate.  When you do automate it, make sure that you always have "next week" already formed.  That way, new rows will auto-magically start flowing into the "next week" partition when the date makes it the  "current week".  The automated process should just need to make a new "next week".

    You wrote...

    Because after a week the existing data would break the constraint? So Weekly purge would create new week 5 table, drop week1 table (or move to read only file) and rewrite view to replace week1 table with week5 table?

    Yes... that's the idea except I wouldn't number the weeks.  Instead, I'd name each underlying table as dbo.LogFiles_YYYYMMDD where YYYYMMDD is the starting date for the week.  Since that follows a nice pattern and is forever increasing, it makes it real easy to automate because you don't have to juggle wrapping a week number around.  The partitioned view can be easily and automatically recreated after you drop the leading week and create a new trailing "next week" with a dated name.

    You wrote...

    What happens if old data is sent, would it just ignore it or would it error?

    Yes... if you try to insert something thats "out of range" for dates, it will error.  You could resolve this by creating an underlying table called something like dbo.LogFiles_Old and just changing the constraint on it each week to be less than the earliest dbo.LogFiles_YYYYMMDD table and making it a part of the partitioned view.  It should usually not have much in it (if anything) and your weekly purge would need to check it and delete stuff (or drop the table and rebuild it)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.  I you do decide to move the table (whether you turn it into a partitioned view or not), you should keep the original table for a couple of weeks just in case something does go haywire with the vendor app.  Just rename it.  That way, if something does go haywire, you would just drop either the synonym (for a non-partitioned table) or the partitioned view and rename the old table back to the original table.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff and Sergiy for all your help.

    I have a much better understanding now and have some design ideas: instead of triggers, partitioned views I can use not just in this task.

  • terry999 wrote:

    Thanks Jeff and Sergiy for all your help.

    I have a much better understanding now and have some design ideas: instead of triggers, partitioned views I can use not just in this task.

    Thanks for the feedback... Up next, how to use a frozen pork chop launcher at point blank range with unhelpful 3rd party vendors. 😀

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply