azure synapse analytics

  • Dears,

    Hope this message finds you well

    I did not see in this foruns anything related with synapse, hence, if you don't mind I will add my questions here. Maybe you can help me

    In my company (for which I was now contracted as IT architect) I can across something which seems unusual to me

    I see the following architecture:

    1 - Sources

    2 - Datahub layer (theobald pulls information from SAP into this layer which sits in azure storage. The files generated by aecorsoft are CSV files)

    3 - Raw layer (using data factory, files are moved into azure datalake gen 2)

    4 - Standardized layer ( Inside , using data factory, files are converted to parquet into azure datalake gen 2)

    5 -  Historized layer ( triggered by data factory, using synapse sparks, the files are transformaed to have history using Delta  file)

    6 -  Curated ( by data factory triggering, using synapse sparks, the files are transformed again into parquet files )

    7 -  Core ( by data factory triggering, using views in Azure SQL Server Managed Instance that have as source this parquet files , data is imported into tables).

    There are more layers on top of this (after the core)..

    But the reason why I am writting is not related with it.

    Its instead related with movement from layer 5 (historized) into 6 (Curated).

    As opposite to the others, this one does a full load, meaning, it gets the entire information which is on the deltas and , on a daily basis transforms it again into parquet.

    There are some files which have 2B rows.. and many with 200M

    This is a lot of information to be moved from one layer to the other and it does not seems reasonable to me.

    I believe we should promote the incremental load for a daily basis. not this..

    Now , what the team states is that this is because the views in use (on core layer) connect directly into parquet files and the parquet files and to generate this parquet files, it needs to be from the total ...

    I don't think that we need to do this. We can for example, use synapse tables instead? to do the incremental load from historized to curated and then connect the views (which are in Core MI SQL Azure) into this tables, instead of having them connected into this parquet files.

     

    What is your opinion and suggestions? can you please advice?

    Thanks a lot,

    Pedro

     

     

     

     

     

     

     

     

     

     

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply