Data Lake - Data Warehouse Strategies

  • Hi,

    Please forgive me if this is wordy.  I am working on a future strategy for our BI environments.  We will have (among misc source systems) an ERP that provides their own data lake integration as well as a BI reporting tool with imbedded transformation logic.  We also have another BI reporting tool (Power BI).  My prior experiences (over 10 years) has been in on-prem traditional data warehouse/mart approaches.  Both jobs have been for large companies with complex ETL/integration so I would expect transformation logic more complex that what is included in Power BI or our ERP BI reporting tool, although I am hoping as this is a smaller organization, it will not be as complex. My goal is to provide an environment (not for data scientists/exploring data) for reporting that minimizes redundancy and maximizes data governance and common metrics.  My list of assumptions/questions are:

    1.  Assumption: I am planning on using the ERP Data Lake as my ultimate source whenever possible.  Non ERP info can also be pushed into this Data Lake.
    2. Assumption: Minimize impact for releases of ERP which also updates Data Lake and the out of the box reports that are provided.
    3. Q1 - Would a best practice be to 'cleanse'/transform raw data into staging tables in the Data Lake to be sourced into either Power BI or the ERP BI tool?  These tools would use their imbedded ETL to build facts/dims in each tool for reporting. However at least common rules can be applied into the logic for building staging tables in the Data Lake (or would these need to be stored in another cloud DB)
    4. Or - 'cleanse'/transform raw data into dims/facts using staging tables/stored procedures and store the dims/facts in the Data Lake?
    5. Or - 'cleanse'/transform raw data into dims/facts using staging tables/stored procedures and store dims/facts within another cloud database?
    6. Or -'cleanse' raw data in the data lake into staging tables in the data lake (which can also be used for adhoc reporting), and then build dims/facts (either inside or outside of the Data Lake) that both tools would use.
    7. Or - use the ETL that is provided in Power BI and ERP BI tool independently depending on source of data and have duplicate business rules/transformations and multiple dimensions/facts within each tool?

    Thank you all for your feedback.

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Hi, you might want to read Uber engineering blog about their Data Lake strategies, as they explained some amazing architecture. Also see platforms like Hudi by Uber to crunch and munch data. Here is the link https://eng.uber.com/uber-big-data-platform/

     

     

  • hey even i read about the same strategy by samsung. their team have come up with brilliant ideas.

    you can go through the same here- https://insights.samsung.com/2018/05/29/meet-customer-demand-warehouse-management-turns-to-iot/

    Endy

  • This website is in Spanish but I liked it ¿Qué es un Data Warehouse?

    • This reply was modified 4 years, 6 months ago by  josehn.
    • This reply was modified 4 years, 6 months ago by  josehn.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply