July 26, 2021 at 12:00 am
Comments posted to this topic are about the item Do You Have ALL the YAML?
July 26, 2021 at 10:12 am
With the proliferation of file formats, perhaps it is time to revisit the delimited file and how its problems can be overcome.
CSV stands for comma separated values. As we know there is always the issue that your data might contain either the value delimiter or the line delimiter (carriage return, line break).
This problem can be avoided by adopting the surprisingly little used standard Unicode control characters for record separator ("\u001E") and unit separator ("\u001F"). We can add the column names in the first row by separating them with the Unicode group separator ("\u001D"). Mapping the column names to the values is then simply a matter of indexing the names to the column index. This data format is far more compact than JSON or XML and just as readable (any program that can split a file into columns by delimiter can parse it).
If we want to confirm that all the data has been delivered then we use the End of Transmission block ("\u0017").
These control characters have been in ASCII and Unicode from the beginning and are very unlikely to go away.
July 26, 2021 at 12:32 pm
Since I am the Accidental TFS Administrator, and since I'm trying to move my organization to use the cloud for VCS instead of our out of support TFS, I've gotten into YAML. But like everything else where I work, they don't provide any training or support. I'm just through headfirst into the deep end of the pool.
Until I read today's editorial, Steve, I'd never even heard of a "closing key:value". After posting this, I'm going to have to look that up.
To answer your question, no I've not done anything to ensure that the YAML I'm working with is complete. Although, I'm as sure as I can be that it's complete, as much as anything else in a Git repo is complete. The YAML I work with is a part of our Git repos. So, either Git restored the whole file, or it doesn't. And that's as likely to be true of any source code file in the Git repo.
Rod
July 26, 2021 at 1:15 pm
When it comes to data interchange file formats, the merits of one versus another comes down to:
Delimited format is superior in all these aspects.
However, the other markup and object notation file formats are better for things like configuration file.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
July 26, 2021 at 1:20 pm
Regarding MS Excel, that's my least favorite data interchange format. In a previous job, I found out the hard way that when someone in accounting hides a range of rows from view, the SSMS Import Wizard considers them deleted and will ignore them.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
July 26, 2021 at 2:22 pm
I forgot to mention earlier that I like YAML, because it's easier for me to understand than the convoluted PowerShell script the former TFS admin wrote for lots of our on-premises builds. And our TFS is so old it doesn't work with YAML.
Kindest Regards, Rod Connect with me on LinkedIn.
July 26, 2021 at 5:15 pm
I do like YAML for the most part, but I'd never considered that a truncated file might not be noticed. If you upload something manually for config, might not be a problem, but if you're sending this across systems automatically, it could be something to worry about if this controls critical processes.
July 26, 2021 at 7:39 pm
With the proliferation of file formats, perhaps it is time to revisit the delimited file and how its problems can be overcome.
CSV stands for comma separated values. As we know there is always the issue that your data might contain either the value delimiter or the line delimiter (carriage return, line break).
This problem can be avoided by adopting the surprisingly little used standard Unicode control characters for record separator ("\u001E") and unit separator ("\u001F"). We can add the column names in the first row by separating them with the Unicode group separator ("\u001D"). Mapping the column names to the values is then simply a matter of indexing the names to the column index. This data format is far more compact than JSON or XML and just as readable (any program that can split a file into columns by delimiter can parse it).
If we want to confirm that all the data has been delivered then we use the End of Transmission block ("\u0017").
These control characters have been in ASCII and Unicode from the beginning and are very unlikely to go away.
+ 1000. That and control characters 28 thru 31 (""\u001D"" and "\u001E" and "\u001F" are 3 of those 4). And, to be sure, there is no problem with Delimited Data... it works perfectly when done correctly. It's what people do with it and to it that puts the screws to it. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
July 26, 2021 at 8:41 pm
Thanks Jeff
That and control characters 28 thru 31.
I'm writing them how I would decode them in JavaScript -
Row split:
data.split("\u001e")
column split:
data.split("\u001f")
It is possible to define a whole set of JavaScript functions that map column names to data and so on - or for that matter implement the relational operators (restrict, project, union, intersect, minus and join).
I'm principally concerned at present with sending data to a client. In this scenario things like JSON, XML and YAML are simply too bulky - when you consider that the network is one of the slower components in the system - and I want to be a good citizen and not hog bandwidth.
July 26, 2021 at 8:55 pm
Something else folks may not consider and that's the "native" format for transmitting data from SQL Server to SQL Server. It does really cool stuff such as leaving INTs in a 4 byte format and DATETIME in an 8 byte format or DATE in a 3 BYTE format. You can even generate a BCP format file to include as a "meta-data" file when you send the data as a separate file.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 26, 2021 at 8:58 pm
Thanks Jeff
That and control characters 28 thru 31.
I'm writing them how I would decode them in JavaScript -
Row split:
data.split("\u001e")
column split:
data.split("\u001f")
It is possible to define a whole set of JavaScript functions that map column names to data and so on - or for that matter implement the relational operators (restrict, project, union, intersect, minus and join).
I'm principally concerned at present with sending data to a client. In this scenario things like JSON, XML and YAML are simply too bulky - when you consider that the network is one of the slower components in the system - and I want to be a good citizen and not hog bandwidth.
Yeah... we "crossed streams" on this one a bit. I was in the process of updating my post to say that you used 3 of the 4 control characters that I was talking about. Totally agree with you.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 27, 2021 at 12:55 am
I do like YAML for the most part, but I'd never considered that a truncated file might not be noticed. If you upload something manually for config, might not be a problem, but if you're sending this across systems automatically, it could be something to worry about if this controls critical processes.
Very good point. In the scenarios I've been working with its simpler than what you've described.
Rod
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply