June 30, 2023 at 4:44 pm
I have a process that generates a batch of documents. In these documents it would be 'nice' to embed the Identity of the row storing this document. The purpose is if anyone has an issue with the authenticity of the document, they can easily reference the original on the system of record.
Today we generate a batch (150,000 documents or so) and then do a bulk insert. Later when the document is processed, a RegEx process inserts the Identity value into the document. If we know the number of records to create, then assign the Identity value to the document as it is generated, we would eliminate the downstream process of inserting the value.
I imagine I can perform an insert of 150,000 dummy rows and then use the Identity, but this means I need to follow up with an Update on 150,000 rows immediately after. It would be nice to tell SQL to give me a block of Identity values and then I can just perform the insert when the documents are finalized.
Thanks greatly for the expertise / advise.
Chris
June 30, 2023 at 4:45 pm
Does the document identity need to be an integer or guid?
June 30, 2023 at 5:01 pm
In this case, we're already using an integer. The reason is the identity is also PK so sorted on that value. I think if we used a GUID then searching for the batch might do a table scan as the child records would be spread all across the table. There are many down stream processes that operate on the batch so retrieving the set fast was a priority. It might work either way, but we stuck with patterns we are familiar with.
June 30, 2023 at 5:41 pm
I was going to suggest using a hash for the document then insert the computer hash into the table with the document. including document metadata should uniqueify each document even if their contents are the same. It would also help attest that the document hasn't been altered. It wouldn't be human readable however.
The way you are using that identifier - if you have SQL generate it, I don't think that there is any better way to do it than the way you are doing it now. Updating the document after the fact would be the only way to ensure that the key and the document are consistent. If you attempted to prestage autoincremented integers to anticipate each document ID, you could eventually have a failed insert and rollback, then necessitate either rerunning the process or resetting the identity seed to allow the previously generated IDs to be inserted.
does the document have a unique identifier in it already? I think it would be better to deserialize the document then insert its document identifier than generate and back port it from SQL.
June 30, 2023 at 5:45 pm
Using a sequence, can a process get a set of 5000 values and then use those values as the PK? WHILE another process performing the same task in parallel gets the next set of variable size (anywhere between 1 and 5000 as needed) of values and these two processes will not cross over on each other's sets?
The goal is to generate an integer that is guaranteed unique for that table which acts as an Identity, but is known before the insert event.
It seems like what you're suggesting will work, but then the table still needs the PK for the column to be used as a foreign key in another child table.
Your thoughts (once again), but I'll start playing with this as proof of concept.
Thanks!
June 30, 2023 at 5:48 pm
What I'd do is this. Grab the current identity, then reseed to that + the number you need. So if the current identity is 1100, and you need 5000, then reseed to 6101. Now, generate a sequence and use that to assign values for your insert, with SET IDENTITY INSERT on.
This only works for one process, however, as SET IDENTITY INSERT is single threaded. If you need concurrency, then this doesn't work. In that case, I think you're best with your current process. Insert them, then go back and fix things.
June 30, 2023 at 7:28 pm
Using a sequence, can a process get a set of 5000 values and then use those values as the PK? WHILE another process performing the same task in parallel gets the next set of variable size (anywhere between 1 and 5000 as needed) of values and these two processes will not cross over on each other's sets?
The goal is to generate an integer that is guaranteed unique for that table which acts as an Identity, but is known before the insert event.
It seems like what you're suggesting will work, but then the table still needs the PK for the column to be used as a foreign key in another child table.
Your thoughts (once again), but I'll start playing with this as proof of concept. Thanks!
you can do that and the number will be unique - how you set the sequence, in terms of increment, will give you the immediate range you need - I advise that it isn't too big, so in the cases you don't really need the full range you don't loose too many numbers.
as an example - assume a sequence increment of 500 - and you need 5000.
you do a "SELECT NEXT VALUE FOR sequencename; " 10 times and you immediately allocate 5000 numbers - (do note that if your process can run in parallel there is no warranty that the numbers will be fully sequential -e.g. each "next value" block will be, but between blocks another process may have grab a 500 block)
one issue with sequences is that you can (and likely will) loose numbers unless you reset the sequence to the last number used after your process finishes.
e.g. if your increment is 500 but you only use 450 of those, the last 50 will not be used unless you reset the sequence.
June 30, 2023 at 9:44 pm
as an example - assume a sequence increment of 500 - and you need 5000. ...
you do a "SELECT NEXT VALUE FOR sequencename; " 10 times and you immediately allocate 5000 numbers - (do note that if your process can run in parallel there is no warranty that the numbers will be fully sequential -e.g. each "next value" block will be, but between blocks another process may have grab a 500 block)
No need to go thru all yet. Just do a single exec of sys.sp_sequence_get_range and request 5000 values. You will then be guaranteed a sequential set of numbers.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
July 2, 2023 at 3:21 am
thank you for the suggestion and clarification on the threading. My process can run multiple concurrently on one service (presently only 1 batch) but there are multiple instances of the service running the same process in pods on the AWS cloud. I must be mindful of parallel processes still.
July 2, 2023 at 5:38 am
I'll look into the sys.sp_sequence_get_range SP as I've seen someone said it is thread safe. This may be a good solution. (if thread safe)
July 2, 2023 at 11:38 pm
Post removed.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 3, 2023 at 12:02 am
I'll look into the sys.sp_sequence_get_range SP as I've seen someone said it is thread safe. This may be a good solution. (if thread safe)
Looking at it from a different angle, I believe the sys.sp_sequence_get_range stored procedure might disappoint.
For example, you said in your original post ...
I have a process that generates a batch of documents. In these documents it would be 'nice' to embed the Identity of the row storing this document.
If you want to identify each batch separately, it would be far better to have a control table where you store things like the date the batch was created (1 row per batch), which would allow you to get a single IDENTITY to identify ALL the documents processed in the given batch. Then, you could simply use ROW_NUMBER() to identify the individual rows in the batch.
It would mean a 2 column PK or AK but we use this method for documents and loan information at work and it works a real treat... especially if you need to drop everything from a given batch and rerun it.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 3, 2023 at 2:13 am
I'll also tell you why sys.sp_sequence_get_range will disappoint. The only thing it does is give you the first value and reserves the rest of the values. If you want those values from the range you created, you have to build them using something like GENERATE_SERIES(). Because the damned thing returns SQL_VARIANT, you have to jump through some conversion hoops because it won't do implicit conversions from SQL_VARIANT.
Seriously a Batch_ID and, possibly a Batch_Document_ID using ROW_NUMBER(), will be about 4 times faster (400% as a number to impress) and easier and with much less CPU time.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 12, 2023 at 1:22 pm
This was removed by the editor as SPAM
Viewing 15 posts - 1 through 15 (of 29 total)
You must be logged in to reply to this topic. Login to reply