Back in 2015 I noticed a response to one of the editorials from a developer who had been lambasted by a colleague for using Linux scripting to clean up files in preparation for it being loaded by SSIS into SQL Server.
As a data architect I understand the importance of controlling technical diversity, clarity of processes, each component having a single responsibility etc. and yet…… Well I was going through a Linux course at the time and had hit Chapter 13 “Maniupulating Text”. I was struck by both the simplicity and power of the standard Linux tools.
I have been using Microsoft DOS and Windows for most of my career and my recurring thought was "I wish I had access to these Linux tools years ago". In theory I could have had access to them through the Cygwin project, although I would not be comfortable putting these on a production server unless it was dedicated for the task.
The introduction of Powershell back in 2005 was a welcome addition and its continuing evolution is something for which I am profoundly grateful. However, Bash for Linux and Powershell for Windows still present us with the challenge of mentally changing gear when we move between operating systems in a heterogeneous environment.
If you are not running in a heterogeneous environment yet then I would be surprised if that will be the case in 2 years time. I also think that SQL Server DBAs cannot afford not to embrace such an environment. Some of us will eventually run SQL Server natively on Linux. In fact, this may represent a sea change for SQL Server deployments.
The genius of AND vs the tyranny of OR
In a heterogeneous world we need to be able to use the tool of choice regardless of the operating system. This possibility is closer than I dared hope now that Bash for Windows is in Beta, if only on Windows 10. Microsoft are also working on Powershell for Linux. In addition, Red Hat and Microsoft are working together to bring an officially suported version of the .NET framework to Linux! Looks like the guys at the mono project were on to something all along!
One of the judgement calls I have to make as a data architect is whether a course of action represents an unnecessary complication or a necessary and hopefully valuable addition to the technical landscape of the organisation. Alarm over technical diversity can be caused by many things:
- Genuine factors. In a disciplined team the addition will hinder rather than help them due to the additional skills required.
- Traditional skills silos. People are constrained in the skillset they are allowed to acquire within the organisation
- Office politics. The breaching of tribal boundaries. Scrambling to avoid blame if it fails, land grabs if it is successful
- Fear of the unknown.
- Poor governance and weak leadership resulting in the wrong tool being applied for the wrong task
When considering the use some sort of pre-processing tasks for data in preparation for an ETL process I think it helps to focus on why ETL tools were built. ETL tools exist to simplify and therefore speed up the development of Extract, Transform, Cleans, Verify, Load pipelines. They provide an abstraction layer masking the complexity of the tasks they carry out.
The best of them offer a number of advantages:
- Data pipelines can be build entirely using the native facilities of the tool. No SQL or external coding language is required.
- Components encapsulate complex functionality using a simple interface
- A rich and mature feature set
- The visual nature of the tools provides a high level data lineage diagram
- Almost everything can be parameterised therefore code and environment reuse is possible.
- Multiple data flows can be co-ordinated within the tool
- Offer facilities to capture metadata.
However, the ETL tools are sometimes overkill for the tasks to which they are applied. When all you have is a hammer, everything looks like a nail. The ability to carry out simple file cleansing and manipulation tasks at command line level provides us with power, flexibility and efficiency that could prove and incredibly useful weapon in the armoury of an ETL developer.
I also believe that passing on ALL problems downstream from source systems to an ETL facility is a huge mistake. The eventual result of doing so is that the downstream ETL process ends up spending more time cleaning up other peoples messes than it does actually doing work that adds value. This in turn will lead to the upstream processes becoming log jammed because the downstream team cannot accept any more work.
So why so enthusiastic about Linux and Bash?
Given that a lot of the source systems I work with are on Linux, I do not always have the choice of using Powershell. The power of the Unix/Linux commands is that they accept STDIN and produce STDOUT and in some cases STDERR as well.
- STDIN can be a file, or the output of a command.
- STDOUT is usually the console but can be a file or the STDIN of the next command.
If we can pipe the output of one command into the input of the next command then what we have is an ETL pipeline using operating system commands.
The table below gives a list of commands described (at a basic level) on the course. This is by no means a definitive list and there are many other utilities for which I have offered a silent prayer of thanks.
Command | Purpose |
cat | Concatenate files. Where a single file is supplied this is similar to the Windows "type" command. |
tac | As per cat but performs a backwards serial read of the file |
head | By default this selects the 1st 10 lines of the input though this is configurable. |
tail | As for the head command but spits out the last lines of the input. |
split | Splits a file into smaller files. The default is 1000 lines but this is configurable using optional arguments |
join | Joins two files on a nominated key position in each file. Where no key is specified then it will join record 1 in file 1 to record 1 in file 2, record 2 in file 1 with record 2 in file 2 etc. It can do the equivalent of a left join |
sed | Stream editor program. This can convert data flowing from the input stream and push the result to the output stream. It can perform straight replacements and match RegEx patterns to accomplish its aim. |
awk | Incredibly powerful data extraction utility that can chop out certain parts of a line and apply a function to that extracted part. This command can also perform aggregations on numeric data. |
tr | Translates characters. Useful for converting case or character sets. |
wc | Splits out word, character and line counts for a file |
grep | Very powerful line selection command |
cut | Very like a substring command for input files but can cut on the basis of a delimiter. |
Some examples for simple diagnostics
For me one of the key frustrations when receiving external data files is the lack of documentation or when documentation is supplied it is insufficient for the job in hand.
The ability to interogate a file and get some core statistics from it is essential.
WC
The "wc" command produces four columns of information
- Number of lines in the file
- Number of Words in the file
- Number of Bytes in the file
- The name of the input source for the command
wc MyBigDataFile.csv
The output will look like the following
96 442 7202 MyBigDataFile.csv
If we just wanted a line count then we could have issued the following command
wc -l MyBigDataFile.csv
This has told us how many records we should expect to read in our input file.
Head and Tail
By default both head and tail will list 10 lines of a file. Let us imagine that we have received a CSV file with no supporting documentation and want to look at the first row of that file to see if it contains any header information
head -n1 MyBigDataFile.csv
Occassionally files are received that are prematurely truncated or corrupted. Looking at the last two lines will indicate if this has happened to our csv file.
tail -n2 MyBigDataFile.csv
Grep
Grep is a global regex pattern matching tool which makes it an incredibly powerful filtering tool. Let us suppose that we wanted to find all incidents of the word "leisure" regardless of case in our file. The two commands below are equivalent.
grep -i "leisure" ./MyBigDataFile.csv cat MyBigDataFile.csv |grep -i "leisure"
The comamnd above will list the matching lines however we may want to find which lines the word is found
grep -i-n
"leisure" ./MyBigDataFile.csv cat MyBigDataFile.csv |grep -i-n
"leisure"
This will prefix each matching line with <line number>:
If we are digging into problems with an XML file then we may want to find a matching line and also 5 lines either side.
grep -i -n--context=5
"leisure" ./MyBigDataFile.xml cat MyBigDataFile.csv |grep -i -n--context=5
"leisure"
By default these commands will run on the entire file. If we want to abort after 2 matches then we would try one of the following:
grep -i -n-m2
--context=5 "leisure" ./MyBigDataFile.xml cat MyBigDataFile.csv |grep -i -n-m2
--context=5 "leisure"
Finally, we may wish to draw attention to the matching text by highlighting it in colour.
grep -i -n -m2 --context=5--color
"leisure" ./MyBigDataFile.xml cat MyBigDataFile.csv |grep -i -n -m2 --context=5--color
"leisure"
The examples above barely scratch the surface of what grep can do.
Simple transformations
In previous roles I have had to deal with data that has been supplied to my organisation from external sources. In many cases both organisations find it mutually beneficial to share data but not sufficiently so to instigate a project to build a fully automated and robust mechanism. The files involved in the data exchange are produced by individuals on a best endeavours basis and inevitably minor errors creep in or there are inconsistencies between one supply and another. An amazing number of files are produced that involve MS Excel as some form of intermediary with all the implications for this much abused tool.
Simple deviations can break an ETL process utterly. Designing a robust mechanism for dealing with these inconsistencies is a rather more involved undertaking than is immediately apparent.
The tr command
The "tr" command receives input and substitutes individual characters on the way through and provides output. It can change specific characters or classes of characters Let us suppose that we have a process that demands that a file be TAB delimited rather than COMMA delimited. Both of the following commands are equivalent.
cat MyBigDataFile.CSV|tr ',' '\t' >MyTabDelimitedFile.csv tr ',' '\t' <MyBigDataFile.CSV >MyTabDelimitedFile.csv
Similarly we may have a file that has more white space characters than we require and we want to get rid of them.
cat MyBigDataFile.CSV|tr-s
[:blank:] ' ' >MyTabDelimitedFile.csv tr-s
[:blank:] ' ' <MyBigData.CSV >MyTabDelimitedFile.csv
The -s switch specifies that where the replacement value results in multiple occurrences of that value only one will be retained. If we need more sophisticated string replacement then we need to use the "sed" utility.
The sed command
The sed command is a powerful and flexible stream editor utility. It can choose which parts of the input to allow through, alter that data and even replace the data in the original source. As a utility sed can handle hugely complex tasks but when we need to stray into the realms of the complex then we should either rely on our ETL tool or reject the source file entirely as being not fit for purpose.
Let us suppose that I want to grab all but the 1st record of a file
cat MyBigDataFile.CSV | sed -n '2,$ p'
The $ symbol signifies the end of the range. I could have put an integer value in there if I wanted to generate small files of non-confidential data to be used in software development. Although a bit off topic I use sed to amend the MariaDB.MySQL configuration file during a automated build process.
sudo sed -i /#bind-address=0.0.0.0/c\bind-address=0.0.0.0 /etc/mysql/my.cnf sudo sed -i /^bind-address.*127\.0\.0\.1/c\#bind-address=127.0.0.1 /etc/mysql/my.cnf
The bind-address parameter tells MariaDB/MySQL what IP addresses are allowed to connect to it. By default there are two lines in the my.cnf file
#bind-address=0.0.0.0 bind-address=127.0.0.1
I use sed to uncomment the first line then comment out the 2nd line.
The awk command
Books have been written about awk so again I suggest that if what we need to do starts to look complex then we either rely on the ETL tool or reject the input file. The particular case where I found awk useful was when I received a file that had clearly come out of an old mainframe COBOL system. A single file contained multiple record structures which needed to be shredded out into separate files to allow bulk loading. The characteristics of the file were as follows:
- The field delimiter was a * character
- The first field identified the record type
- A new parent record was record type 01
- The parent key was field 2 on record type 01
awk -F* 'BEGIN{ OFS="*";}{if($1 == "01") a= $2;print a,$0}' StrangeInput.DAT >IntermediateOutput.dat awk -F* '{print >$2}' IntermediateOutput.dat
The first command carries out the following tasks
- Opens up the StrangeInput.DAT file with and shreds it out based on the * field delimiter
- Keep the Output Field Separator (OFS) as a *
- If the first field indicates that this is an 01 record then the value in the 2nd field is stored in variable 'a'
- print variable a and the entire record out to the IntermediateOutput.dat file
In our IntermediateOutput.dat file the 2nd field is now our record type so the 2nd file does the following:
- Opens up IntermediateOutput.dat and shreds it based on the * delimiter
- prints the entire record but pipes the output to a file named after the 2nd field.
Once I had proven that the 2 step process worked as expected I simply joined the two statements together
awk -F* 'BEGIN{ OFS="*";}{if($1 == "01") a= $2;print a,$0}' StrangeInput.DAT |awk -F* '{print >$2}
This simple one line command shredded a 45 million record file into 8 separate files which could be bulk loaded straight into SQL Server. I could have written a script component in SSIS to achieve the same thing but not as quickly or as simply as this one line equivalent.
The cut command
Given that I now have a file per record type I do not need the record type in those files. As the record type is now in field 2 of each file and field 1 is the primary key of the parent record I can use the cut command to extract the fields I want.
cat MyFile.txt |cut -d* -f1,3- >MyNewfile.txt
-d* tells the command that the * character is a delimiter
-f tells the command which fields I want to extract. In this case the 1st and fields 3 onwards.
If I know that only a few fields in a file are of relevance to me then the ability to extract these early can dramatically reduce the size of the files that I need to pass across the network to my ETL tools.
Closing thoughts
The examples I have given here can all be carried out using Powershell if you are running on a Windows server. For that matter they could all be carried out in SSIS or an ETL tool of choice.
However, the point of using Bash or Powershell as a pre-processor is to push as many file correction tasks as close to the point of origin as possible. This is a practice that has been adopted in the production lines of Nissan, Toyota and others. The relevance of manufacturing processes to IT production processes is covered brilliantly in The Phoenix Project.