There is no level of experience or intelligence where we cease to make mistakes. The simpler the mistake the harder we kick ourselves. The good news is that the simple mistakes often have a simple means of ensuring that we do not make the same mistake again.
I saw a Tweet from someone kicking themselves for committing a large file to their git repository. An easy mistake to make, a bit clunky to fix but very easy to prevent. This is where the pre-commit tool is useful.
What is the pre-commit tool?
At its most basic it is a tool that runs various linters, code formatters and file checkers using git hooks. Git hooks are scripts you wish to run before (pre) or after (post) you carry out some Git action. In this case, before we git commit, hence the name pre-commit. When you try to do a git commit, if your files fail the checks you have configured, then your commit will be rejected.
Look at the list of supported hooks on the pre-commit website. There are a huge number covering most of the languages and many of the day-to-day file formats that a data engineer is likely to use.
- Docker
- File format validation (JSON, TOML, XML, YAML)
- Python
- R
- SQL
- Terraform
Why use pre-commit?
Many of the code quality checks you would want in a CI/CD pipeline have either an equivalent hook to allow you to run those checks locally. In some cases, the CI/CD pipeline uses the same hook. This means that before your code is submitted to your CI/CD pipeline you know that it is unlikely to fail for linting or code quality rules. I say "unlikely" because, for those tools where we can tweak the configuration, we must make sure that both our local and CI/CD configuration has the same settings.
My work computer is an M1 MacBook which is the most powerful work machine I have ever had. It easily outperforms the servers used to run CI/CD pipelines. If your pipeline is in the cloud then a CI/CD run will cost you money. This can be direct run cost or the cost of lost time when work is queueing up in the CI/CD pipeline. It can be frustrating to have to wait only to find that either your run fails or you were queued up behind something that took ages only to eventually fail. If a CI/CD pipeline run fails then whatever broke it will have to be fixed and will have to queue up to run again. So briefly the reasons for using pre-commit are as follows:-
- Speed
- Cost saving
- Consistency
- Quality enforcement
- Security best practice detection
Getting started with pre-commit
After following the installation instructions you will have to create a file called .pre-commit-config.yaml. This tells the tool where to get git hooks from and which hooks you wish to use.
As I mentioned earlier, I work on a MacBook so the Windows equivalent commands will be similar but with some differences. Fortunately the Windows Linux Subsystem will give you a near identical experience.
The tool allows you to generate a small configuration simply by running the following command.
pre-commit sample-config > .pre-commit-config.yaml
Your .pre-commit-config.yaml file should look similar to the one shown below.
# See https://pre-commit.com for more information # See https://pre-commit.com/hooks.html for more hooks repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v3.2.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-added-large-files
These four hooks do the following
- Remove any trailing whitespace from the end of a line of code
- Ensure that the last character in a file is a new line character
- Make sure that any yaml files are valid yaml
- Enforce a maximum size of 500Kb (the default setting) for the files you try to commit.
The latter hook would have helped our Tweeter. As mentioned earlier, some hooks have optional parameters to allow us to tweak their behaviour. For example, if we wanted our check-added-large-files
hook to prevent us committing files larger than 100Kb then the parameter for this would appear as follows.
- id: check-added-large-files args: ['--maxkb=100']
To attach the pre-commit hooks to your project run the following command.
pre-commit install
This will make sure that whenever you run a git commit the files in that commit will be scanned by the hooks you chose. If any of the hooks signal a failure then the commit will be prevented.
You can run the hooks for all the files in your project that git knows about, not just the ones in a commit. This us useful when you want to run a check without having to go through git add, git commit.
pre-commit run --all-files
If you haven't run pre-commit install
the above command will still work though failures will not prevent you from committing the files.
Dettaching pre-commit from your commits
Why would you want to? If you make changes to .pre-commit-config.yaml
and test the changes you can find that your change affects more files than you intended. Ideally you want to git add and git commit just the changes to .pre-commit-config.yaml
and then git add and git commit the other changes as their own commit.
For Python programming I use the MyPy type checker in a git hook. Trying to comply with MyPy can be a frustrating experience so again, when experimenting, I detach pre-commit using the command below.
pre-commit uninstall
Pre-commit hooks for SQL code
For linting SQL code and for fixing any linting errors detected SQLFluff is a popular package with two hooks
- sqlfluff-lint
- sqlfluff-fix
We would add these to our .pre-commit-config.yaml
file as follows
- repo: https://github.com/sqlfluff/sqlfluff rev: 1.4.5 hooks: - id: sqlfluff-lint args: [ '--dialect=tsql' ] - id: sqlfluff-fix args: [ '--dialect=tsql' ]
This will check your .sql
files against a large set of SQL code style rules. You may find that some of the rules are not suitable for what your particular scenario.
Telling SQLFluff to ignore rules.
As with other linters, SQLFluff does have the capability for you to configure which rules to ignore. As a principle I try not to use rule avoidance, and if I must, to do it in specific circumstances only. I have found rule avoidance to be a slippery slope.
For specific cases we can use an inline comment in our SQL code such as -- noqa: L044
. This would ignore a SELECT *
as a rules violation.
Even inline, instructions can be more sophisticated than simply ignoring a single rule or single violation of a rule. The SQLFluff documentation covers this quite well.
Where we want to configure SQLFluff's behaviour as a set of general rules then we can use a .sqlfluff
configuration file. Python programmers have the option to use pyproject.toml
instead.
Housekeeping for pre-commit
When new versions of your chosen hooks become available you will want to upgrade to those versions. For any repo the assocatied rev:
line tells us what version we are using. Fortunately pre-commit provides a command to update the versions.
pre-commit autoupdate
This command is also useful if you do not know what version is appropriate for the hooks you wish to use. If in doubt. Set rev: 0.0.1
and run pre-commit autoupdate
.
Periodically, I also run two clean up commands to get rid of old cached versions of the hooks.
pre-commit clean pre-commit gc
I use pre-commit in all my projects so have a script to run a pre-commit autoupdate
in ever repo in which it is installed.
Concluding thoughts
I have found that the pre-commit tool provides a good 1st line of defence for codes style and standard violations. It is good for having a shared set of rules across the team or beyond.
Many open-source repositories use it too which I take to be an endorsement.
Many of the hooks are written in Python or Ruby with the source-code being available both for modification and also as an educational resource. As an open-source project, if a hook doesn't quite have the facilities you require there is nothing to stop you applying your skills to add those facilities. If you are successful then you should consider contributing your modification back to the main code base.