Question: Which ETL tools can support Data Vault modeling out-of-the-box? What are the challenges and issues building a data vault with ETL tools?
Discussion still going on
1. Marcel de Wit – See another discussion on LinkedIn (still active; Dutch)
2. Daan van Beek – Thanks Marcel, I did read the comments of that discussion too, it was the reason I started this discussion actually. After reading it, it was still unclear to me whether ETL tools do support Data Vault out-of-the-box like slowly changing dimensions or not. So, who knows the answer(s)? The vendors?
Strictly speaking none
3. Jos van Dongen – Confused by your question: imho ETL tools are for ETL, data modeling tools are for data modeling. You ask for ETL tools that support DV modeling. So strictly speaking: none ;-). But out of the box support for DV (to some extent, depending on whether you actually want to generate a model from the source system) can be found in BI-Ready, Quipu and Wherescape 3D (still in beta); at Antonius Hospital we developed a Pentaho Kettle framework for loading a DV.
Data Vault aware
4. Juan José van der Linden – Hi, I don’t think there are ETL tools that support loading a Data Vault out of the box with special steps/transformations. Pentaho Kettle supports adding new steps as plugins. So if someone develops steps for loading hub/link/sats it can be made Data Vault aware. Quipu and BI-Ready support generating a Data Vault model and loading code from a datamodel. This model can be a business model or a source model. BI-Ready generates SQL as loading code. Quipu currently generates by default SQL code. The code generation of Quipu is template based. There are already users of Quipu building templates that generate Pentaho Kettle ETL transformations for loading the Data Vault. Regards, JJ.
5. Marco Schreuder – Compared to Dimensional modeling Data Vault loading is very simple: 95% inserts. You can use a combination of a lookup transformation to determine new rows and a data transformation task to inserts them. For satellites you have to determine the old rows and enddate them. You can use a simple SQL execute task for this. So… you really don’t need a dedicated component.
6. Ronald Damhof – Agree with marco. My simple answer: all do Nuff said 🙂
7. Juan José van der Linden – Loading hubs and links is always easy in a Data Vault. Also loading a satellite when there is only 1 change per business key per load cycle is also easy. It becomes tricky when you’re loading from a CDC system and have more changes per business key per load cycle or when you have a bi-temporal satellite. But even then a template system helps you to speed up developing.
8. Erik Fransen – Only those ETL-tools that are used by qualified DWH-designers are able to deal with Data Vault and data marts schemes and related data logistics correctly.
9. Daan van Beek – @jos you’re right, but why do almost all ETL tools available in the market support star schemas with slowly changing dimensions? That’s because it should make the life of a DWH developer much easier.
@marco @ronald the same holds true for star schema’s, 95% inserts. In my view ETL tools are created and used to get rid of hand coding SQL? Are you suggesting, just load a data vault with SQL and forget about ETL tools?
10. Erik Fransen – @Daan that’s what we are doing at our customers, and it works great. No need for expensive ETL anymore, but the right skills and holding on to your architecture-rules is crucial.
2 different types of ETL steps
11. Martijn Evers – @Daan, You have to separate 2 different types of ETL steps:
1. Formal (semantical equivalent) model (to model) transformations (e.g. 3NF to Data Vault)
2. Pure data transformations (cleaning & matching, conforming etc).
Model transformation steps change the model but leave the data as is, while data transformations steps change the data (sometimes significantly) but do not change the model/shape of the data. ETL tools can do both, and usually in one BIG push. Most ETL tools however are not very smart on model to transformations (exceptions: Kalido, BIReady, Wherescape Red) and hence deliver poor ROI there. Most Data Vault (Oriented DWH) Architectures by default separate these 2 types of ETL actions (They should IMO) to accomplish an effective ETL divide & conquer strategy. Hence they can use different and far simpler tooling like (somewhat complex) metadata driven SQL or even Business Rule Engines to accomplish their task With Data Vault oriented DWH Architectures you can still use (powerful) ETL software but their weakness in model 2 model transformations and their focus on handling big and complex ETL steps (of any kind) is seen more as a liability then as an asset. Esp. in combining model transformations and data transformations, which can compromise things like reusability, maintainability, auditability & performance.
12. Daan van Beek – @Erik in general ETL tools would letting us lower the costs of data warehouse projects and maintenance by a factor of 3 to 5, are you so to speak sending all the ETL developers back to the stone age 😉 @Martijn, a distinction that is helpful, but a transformation for slowly changing dimensions (formal) is almost available in all ETL tools, why do the vendors not embrace Data Vault? Are users not asking for it? Or is it to complex?
Research on ETL tools
13. Ronald Damhof – @daan; pls send me the research that states that ETL tools lower the costs of DWH projects & maintenance by a factor 3 to 5. Never seen that kind of research, software is not responsible for lower project or maintenance costs – it never is – so pls link.
@daan; “why do vendors not embrace Data Vault”, since when is normal-sense modeling something that needs to be supported by ETL vendors? Do not get it. I was not kidding when I said all ETL tools support Data Vault. I suppose you have researched the typical ‘complexity’ of DV mappings in ETL – can you show us what you find complex?
@daan; “users not asking for it” – that made me laugh.
14. Martijn Evers – @Daan, Except for slowly changing dimensions, of which DV uses a specific and simple case for loading satellites, there is not a lot of DV specific work a classic ETL tool can do, unless it can also do model transformations, which is usually not the case. Current ETL tools just can’t provide a very good ROI on Data Vault, so you can skip them in favor of simpler approaches like generated SQL.
15. Daan van Beek – @Ronald, “software is not responsible for lower project or maintenance costs – it never is” – really confused, ‘all’ increase in productivity of the last twenty years or so is mostly due to automation (for the last centuries mainly due to technology), and you are saying software (in combination with hardware) never can’t held be responsible for that?? Why do you think that?
@Martijn, thanks for giving a quick reply to @Ronald 🙂 “current ETL tools just can’t provide a very good ROI on Data Vault”
16. Ronald Damhof – @daan; lol – not gonna get that research, am I?? You mentioned about research regarding ETL tools (not automation in general, you can automate ETL without ETL tools)….show me – you got numbers – factor 3 to 5 according to you in DWH projects and maintenance. I am curious what research you are referring 2. If you provide me the research I will answer your question – ok? @daan; all ETL tools support DV – the statement of Martijn does not change a thing to that statement
SQL or ETL tools?
17. Daan van Beek – @ronald, Where is this going! You see things that aren’t there. I didn’t mention the word research at all, it’s common sense like we know that bikes (SQL) can’t ever beat cars (ETL tools) or a Ferrari (PowerCenter for example) in terms of speed, functionality and usability. And by the way, you said earlier in this discussion, ‘all do’. It would be fair if you try at least to prove your claim.
Please feel free to comment.