The world of data warehouses is full of examples of implementation projects that haven’t delivered because the technologies and/or methodologies were not adequate, or because the team was chasing rabbits instead of focusing on business needs. Therefore, there are also wrong ways of building a data warehouse. The applicability of a technology needs to be judged based on the existing requirements. They can make a technology or even methodology unfit for the purpose.
Data Warehouses are complex projects in nature, otherwise we wouldn’t have so many projects that don’t deliver what they were supposed to deliver.
The data warehouse is based in theory on multiple data pipelines.
The data lake is on top of all sources representing raw business data, including data warehouses or data marts. Of course, one can use data lake-related technologies to provision the data warehouse, though projects’ complexity can easily increase unnecessarily.
I would consider a better definition for the concept of data warehouse, data mart or data lake. A data warehouse is not a single source of truth per se, though an enabler for achieving that. A single source of truth is a state of being.
If I’m not mistaking stream processing can handle batches of data as well.
ELT appeared from the need to load first the data in a repository and after that transform the data, as data’s characteristics imped an ETL approach. It doesn’t mean that ETL will not be further used.
It would be helpful to define what orchestration is about. Not sure how many people wrote a CRON job… I know I haven’t (until now).
As long monitoring is used to check the status of a job during or after the job run, as process it can be only reactive. In theory one can check beforehand if the needed resources and services are available, though it doesn’t mean the process is proactive.
The Configuration Management (CM) process assures that a baseline for CM exists and is maintained as changes occur.
In general, it’s helpful to define processes in terms of the goals/objectives, scope and activities associated with them.
A roadmap implies planning and a timeline, usually how to arrive from point A to point B and some intermediary points. Describing a few processes provides at most a map.
It’s important to consider the mentioned processes when designing and planning a data warehouse, as some of the aspects associated with them needs to be addressed by design. Probably only data orchestration can improve pipeline’s performance.
The value of a post increases if it considers more than one research source. This aspect is important especially when the definitions are fishy and defy the common sense.