Modern Development

  • Comments posted to this topic are about the item Modern Development

  • The majority of what I do is in the cloud.  A substantial number of what is exposed to me as data sources is in some form of cloud bucket.

    What I have started to do is to set up folders that mimic the buckets I am supposed to use.  For example my project will have two main folders

    • Src (or something similar)
    • Tests

    Within Tests there are usually subfolders with code artefacts that express the tests.  There is also a subfolder called data.  Within tests/data/ there will be subfolders

    • Buckets
    • ConfigData
    • SQLScripts
    • ExpectedData
    • ActualData

    "Buckets" will have subfolder named for the actual bucket names that are used by the data pipeline.  Within those folders will be a replica of the folder structure and files that will be in that folder structure.

    When a test run is instantiated I use a mocking framework to create a local mock of each bucket, this then uploads the files in the bucket folders exactly as the data pipeline would in the real world.  For testing a pipeline using AWS the mocking framework is called moto.  The AWS SDK library is called boto.

    There are obvious size constraints posed by both Github and the mocking framework, and not AWS features are supported.  For the most part I can test locally.

    By adopting a naming convention I can keep my tests simple.  Let us suppose I have four files:

    • Buckets/mybucket/my_prefix/01_sales_data.parquet
    • SQLScripts/01_sales_data.sql
    • ExpectedData/01_sales_data.csv
    • ActualData/01_sales_data.csv - generated by the test

    If my test is written to read a source file, run an equivalent SQL script file and compare the expected and actual data then to add tests just need to drop the required files where they should be.  This allows people, whose skills do not yet include writing test code, can still contribute to the body of tests.

    If testing cannot be done on my workstation or is a carefully curated shared set of data then that test data can be stored in specific cloud buckets for whatever heavy weight testing is required.

  • I really like your suggestion of putting the data used for creating a test database into SQL scripts with INSERT statements! That does make it possible to put it into source control. I'll have to try it when next I get an opportunity.

    Kindest Regards, Rod Connect with me on LinkedIn.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply