June 24, 2020 at 6:07 am
Hi Jeff
We have to pull in everything from the documents as we don't know what is needed for reporting purposes (orders from the higher-ups).
This is the only document that is causing an issue (presumably the one with the most fields, although I haven't checked).
The "documents" database sounds interesting, have you got any links with a decent amount of detail on this?
June 24, 2020 at 11:35 am
You should take all those varchar(max) columns off the table and put it into kinda old good EAV table:
DocumentId (fk to the main table)
InfoType NOT NULL
infoContents VARCHAR](MAX) NOT NULL
_____________
Code for TallyGenerator
June 24, 2020 at 1:12 pm
I think I've just had a brainwave.
We currently store all the document data vertically in Table1. It contains (amongst other things):-
Document name
Field name
Field value
Now, what we then do is split this data out into individual tables using the Document name from Table1.
This obviously take a while and creates 600+ tables hoding the data just for that document. This is done automagically.
Now, what I thought was, instead of creating tables to hold the data, why not create views instead to do the same thing.
Pros :-
Cons :-
I can only see one potential con which is that it may take longer to get data back from the view rather than the table.
Can anyone see any other flaws in my plan?
Thanks
Richard
June 24, 2020 at 1:50 pm
Hi Jeff
We have to pull in everything from the documents as we don't know what is needed for reporting purposes (orders from the higher-ups).
This is the only document that is causing an issue (presumably the one with the most fields, although I haven't checked).
The "documents" database sounds interesting, have you got any links with a decent amount of detail on this?
I am, by no stretch of the imagination, even a casual user of document databases never mind any kind of expert on the subject. My understanding is that a lot of them use JSON "documents" to store information and that it can be very useful but I've never worked with one to even come close to knowing if that's actually correct.
With that understanding, I've been told that document databases are good at this type of thing (storing lots of non-relational data). One of those database engines is MongoDB and it seems to be supported by MS and a fair number of people I know seem to think it's ok. That is not a judgement on my part. It's only what I've heard. There are other "document" databases that you can search for. Here's a "top level" link for MongoDB.
https://www.mongodb.com/what-is-mongodb
As with any possible new endeavor, especially one of this nature, there's going to be a need on your part for a fair bit of research. The first thing I'd do is research a whole lot more about what a document database actually is, what it's capabilities, limits, caveats, requirements, and level of support are. Then I'd start checking and comparing different providers of such databases.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 24, 2020 at 2:47 pm
Thanks Jeff, looks like I've got plenty of reading to do.
Any thoughts on my new plan?
June 24, 2020 at 6:00 pm
Also, when you have columns that have the same names except ending with a 1 or a 2, you may also want to look at normalizing the data into multiple tables, or possibly even a EAV table. I know Jeff, you hate that, but that may be worthwhile in this case.
June 24, 2020 at 6:16 pm
Using Mongo is a possibility, and JSON can be fairly easy to use. I personally have not used MongoDB but the product is used in development for a product we sell. There are costs involved even if you are using the free version. It is different to use and many of the features you expect in an RDBMS aren't there and need to be built manually, such as indexes. Be careful there. The developers here turned MongoDB into a relational database with all the indexes they felt were needed. Not sure if that changed or not.
June 24, 2020 at 7:52 pm
Also, when you have columns that have the same names except ending with a 1 or a 2, you may also want to look at normalizing the data into multiple tables, or possibly even a EAV table. I know Jeff, you hate that, but that may be worthwhile in this case.
Heh... I don't hate EAVs. They're quite useful... "It Depends". I just hate some of the reasons people use them.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 24, 2020 at 9:11 pm
I would recommend reviewing the data being populated - and seeing if that data really should be in VARCHAR(MAX) columns. For example, SNOMED is a code and generally doesn't exceed 18 characters and a MNEMONIC is generally like a single word short description or code.
It would also be a much better design to move all of these codes/identifiers/etc... to another table. For example, instead of separate columns in this one big table as:
[PracticalConcerns_ProblemsWithAlcoholAndDrugs_2_Mnemonic] [VARCHAR](MAX) NULL,
[PracticalConcerns_ProblemsWithAlcoholAndDrugs_2_SNOMED] [VARCHAR](MAX) NULL,
[PracticalConcerns_ProblemsWithAlcoholAndDrugs_2] [VARCHAR](MAX) NULL,
[PracticalConcerns_ProblemsWithAlcoholAndDrugs_2_WithComments] [VARCHAR](MAX) NULL,
[PracticalConcerns_MyMedication_2_Mnemonic] [VARCHAR](MAX) NULL,
[PracticalConcerns_MyMedication_2_SNOMED] [VARCHAR](MAX) NULL,
[PracticalConcerns_MyMedication_2] [VARCHAR](MAX) NULL,
[PracticalConcerns_MyMedication_2_WithComments] [VARCHAR](MAX) NULL,
[EmotionalConcerns_RegretAboutThePast_1_Mnemonic] [VARCHAR](MAX) NULL,
[EmotionalConcerns_RegretAboutThePast_1_SNOMED] [VARCHAR](MAX) NULL,
[EmotionalConcerns_RegretAboutThePast_1] [VARCHAR](MAX) NULL,
[EmotionalConcerns_RegretAboutThePast_1_WithComments] [VARCHAR](MAX) NULL,
[EmotionalConcerns_AngerOrFrustration_1_Mnemonic] [VARCHAR](MAX) NULL,
[EmotionalConcerns_AngerOrFrustration_1_SNOMED] [VARCHAR](MAX) NULL,
[EmotionalConcerns_AngerOrFrustration_1] [VARCHAR](MAX) NULL,
[EmotionalConcerns_AngerOrFrustration_1_WithComments] [VARCHAR](MAX) NULL,
These types of fields all repeat and could be moved to another table named: '...Concerns'
ConcernType varchar(20)
ConcernSubType varchar(50)
Sequence int
Mnemonic varchar(10) --size this to largest mnemonic/code being sent
SNOMEDCode varchar(18) --SNOMED codes will not be larger than 18 characters
WithComments varchar(max)
Set the Sequence to a default of 1 - for those entries that don't have repeating segments. Take the ConcernType from the first part of the column name as either: Physical, Practical, Emotional, FamilyOrRelationship, Spiritual, InformationOrSupport, etc...
The ConcernSubType will be the second part of your naming (you probably need to identify a default value for those columns that do not include this - or use a blank or null value.
Looking at this a bit more - it looks like you are receiving data in a single field and parsing it into the separate distinct elements. I am guessing that the column [PhysicalConcerns] contains all of the data - either an XML or JSON document. Then it is parsed out to get the further discrete data elements.
If that is the case - then keep that column in the primary table - but the parsed data is separated out to the secondary table as above.
Either way, I am guessing that a lot of these columns are not populated on every row since a patient would not have every possible concern (problem) available. By moving to a separate table you only insert rows where there is data which will reduce the overall table usage.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
June 25, 2020 at 1:59 am
That leads us back to the question I asked that hasn't yet been answered... before the OP makes any "plans", the first thing that should be done is an evaluation of the columns to see if they really need to be VARCHAR(MAX). As you've pointed out, I seriously doubt it.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 25, 2020 at 4:16 am
That leads us back to the question I asked that hasn't yet been answered... before the OP makes any "plans", the first thing that should be done is an evaluation of the columns to see if they really need to be VARCHAR(MAX). As you've pointed out, I seriously doubt it.
Funny you did not question varchar(30) for ID's
🙂
_____________
Code for TallyGenerator
June 25, 2020 at 6:36 am
Hi all
Apologies for missing any questions, I'll get to them now.
Jeff - VARCHAR(MAX) is used as we don't know what amount of data is in each column. The other consideration for this being used was to try and push the data off-row (this hasn't worked in this case.
Sergiy - The ID's are internal application IDs and are, in fact, 30 characters long and we need to store them (I'm assuming you mean the DocumentID field?)
Jeffrey - if only it was that simple. The individual fields are stored and then the application combines them all together. The documents are built in "sections" which is why the fields are named that way. The bit before the underscore is the section, with the bit after it being the actual question. A lot of the time, all the questions are completed which is why we need all the fields to be imported.
Hopefully, this covers everything I've missed but feel free to point anything I've missed.
Richard
June 25, 2020 at 12:23 pm
Sergiy - The ID's are internal application IDs and are, in fact, 30 characters long and we need to store them (I'm assuming you mean the DocumentID field?)
Well, I cannot know the reasons behind the decision to use 30 char long strings as ID's, I can only make my assumptions, but they're not gonna be flattering - ask Jeff, he knows. So, I better keep them to myself and share some food for thoughts with you.
30 characters (non-unicode) occupy 8 times more storage space than an integer number.
It inflates not only storage for a single table, it applies to every table referencing DocumentID, it also increases memory demand for every query mentioning that identifier.
It takes at least 10 times (depending on the collation) more CPU time to compare 2 varchar(30) strings rather then 2 integer numbers.
Front-end developers used to ignore that processing overhead, as they deal with handful number of records at most, but in databases with millions of records queried at any time such overhead makes some noticeable difference.
Positive integers can hold more than 2 billion ID's ( 2 billion seconds is about 60 years). From what I see in your samples your database won't get anywhere near this kind of capacity without a major refactoring. So, you're not gonna run out of ID's.
If you want to store some kind of GUID, then [uniqueidetifier] data type takes 16 bytes of storage, which is half of what's needed for varchar(30), and is 3-4 times faster to process (again, due to collation rules).
If those ID strings are some kind of mnemonic codes, then it would be better to map them to internally generated database ID's (int) and use those ID's for processing within the database, retrieving those mnemonic codes only for presentation layer.
Same applies to ClientID or any other varchar ID you may have in the database.
ClientID actually adds another overhead, but with so numerous updateable varchar(MAX) columns that overhear is so insignificant it's not even worth to be discussed here.
_____________
Code for TallyGenerator
June 25, 2020 at 1:12 pm
Hi Sergiy
The reasoning escapes me as well, I'd need to speak to the vendor on that one.
All I do know is that we need to store it in that fashion to link various sections together (you'd need to speak to the people that develop the documents for that one.
As I mentioned earlier, I'm attempting to write some views to bring all the bits together rather than fixed tables.
I've written one view (it used a table with 1.3 million rows and is quite wide).
It took 4 and a half minutes to display all the records from the table and 5 minutes 18 seconds using the view.
June 25, 2020 at 8:01 pm
That leads us back to the question I asked that hasn't yet been answered... before the OP makes any "plans", the first thing that should be done is an evaluation of the columns to see if they really need to be VARCHAR(MAX). As you've pointed out, I seriously doubt it.
Yeah - and it doesn't appear that there is any initiative to actually analyze the data being sent to determine what is being sent so of course the default of everything set to varchar(max).
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
Viewing 15 posts - 16 through 30 (of 31 total)
You must be logged in to reply to this topic. Login to reply