Selection criteria for ETL tools

Photo Daan van Beek MSc
Author: Daan van Beek MSc
Business Partner
Table of Contents

Below we have listed most of the criteria we examined in our ETL & Data Integration Guide. For every vendor we have collected some general information such as the location of the headquarters and the start date of the company (see Company & Market). The bulk of the criteria relate to functional and technical subjects, such as:

What to consider when selecting an ETL tool?

Usability
✓ Platforms supported
Debugging facilities
✓ Data Quality & profiling
Reusability
Scalability
Batch vs Real-time
✓ Future prospects

Conditions for a successful implementation

Important conditions for a successful implementation are explained in more detail in the full ETL & Data Integration Guide 2024. The guide can be purchased here. The guide also contains expert opinions on each vendor.

Company & Market

  • Entry into the market (year): In which year was the product released for the first time?
  • Customers worldwide: The number of customers worldwide
  • Installations worldwide: The number of installations worldwide. For code generators the number of developer seats

Architecture & infrastructure

  • Symmetric Multiprocessing (SMP): Is Symmetric Multiprocessing supported? Standard in Windows NT and UNIX. The processors in SMP systems share their internal and external memory.
  • Massively parallel processing (MPP): Every processor in a MPP system has its own internal and external memory and database, allowing high performance to be achieved. These databases should be synchronized.
  • Cluster Aware: Is the ETL server ‘cluster aware’ and does it support load balancing, fail-over and other cluster facilities?
  • Grid: Can an ETL process run on a ‘grid’ of computers or servers?

Scalability

  • Job distribution: Is it possible to run ETL processes on different machines or processors?
  • Data pipelining: Can the different steps of an ETL process be run on different machines or processors?
  • Partitioning: Is it possible to partition based on, for example, product codes, to determine on which machine or processor the data has to be processed?
  • Basic architecture: h = hub & spoke (all data runs through one point), d = distributed (multiple lines between sources and targets) and m = multi hub/spoke
  • End-to-end BI infrastructure: Does the ETL tool exchange metadata, for example star schemas, with OLAP or reporting tools, from their own range of products or third parties products?
  • Supports CWM: Is the ETL tool CWM-compliant, in other words does it support the Common Warehouse Meta Model?
  • Version control system: Does the product contain a version control system with check-in and check-out features?

Core Functionality

  • Splitting data streams/multiple targets: Is it possible to read a data source once and load the results into two or more tables?
  • Union: Put rows of different tables with the same structure into one table or data set
  • Conditional splitting: The same, but then in a condition, for example, if revenue is higher than 1000 put the results in table 1 otherwise in table 2
  • Pivoting: Is it possible to transform denormalized data, putting data in the column names, into rows?
  • Depivoting: The other way around, transform (highly) normalized data to denormalized data, putting data in the columns
  • Key lookups in memory: Can you load a table completely into internal memory and search the table? (without having to make joins)
  • Key lookups reusable across processes: Are these tables reusable across different loading processes in such a way the key lookup table is loaded once into memory?
  • Read non-structured data: Can the product read non structured data like text, email, video, etc? And, if so, which file types?
  • Slowly changing dimensions: Does the tool support slowly changing dimensions (hm = manually; wizard = wizard; auto = out-of-the-box)
  • Scheduler: Is there a mature scheduler that supports dependencies?
  • Error handling within job: Can errors be detected within the job, and is it possible to change the route within the process flow when a specific error arises?
  • Impact analysis: Is it possible to make an impact analysis of proposed changes (when an attribute or table must change)
  • Data lineage: Can one easily track down the source of an attribute/information-element (reversed impact analysis)
  • Automatic documentation: Can you publish and document a process/transformation automatically and navigate through it with a browser?
  • Support for data mining models: Is it possible during the loading process to make use of the results of a data mining process?
  • Support for analytical functions: During the loading process, is it possible to invoke different kinds of analytical functions like forecasting, basket analysis, regression?

Usability

  • Ease-of-use: The ease-of-use of the product. Is it easy to learn and easy to use on a daily basis?
  • WYSIWYG: Is the principle ‘What You See Is What You Get’ applied to the DATA?
  • Screen design: Is the screen design ergonomic and well-balanced?
  • Task compatibility ETL / EAI: Does the tool support the tasks (in the same sequence) as the ETL developer?
    Training Requirement What is the recommended training both for the developer and the end user?
  • Completeness of the GUI: What percentage of the functionality can be generated directly from the GUI?
  • Training requirements both for the professional developer and the business user (where appropriate)

Reusability

  • Reusability of components: Are components reusable and can they handle parameters (this is not the same as copy-paste)?
  • Decomposition: Can processes be divided into named small pieces of building blocks (modular programming)?
  • User-defined functions: Is it possible to define user-defined functions and to use them in the process flow?
  • Comments on selection of objects: Can one make comments on a selection of objects in such a way that these comments are tightly connected

Debugging

  • Step-by-step running: Can you run the process flow step-by-step?
  • Row-by-row running: Can you run the process flow row-by-row?
  • Breakpoints: Can you set a breakpoint on a particular process step or a row of data?
  • Watch Points: Can you define watch points, so the system postpone running when a certain condition is met?
  • Compiler/validater: Is it possible to validate the process flow (and/or code) with one click of the mouse and are errors reported and marked?

Real-time vs batch

  • Integration batch – real-time: Is it possible to define – within the ETL tool – process flows moving and transforming data in real-time and in batch?
  • Mechanism: How are changes in the source systems detected and put through (mq = message queuing; logging = database logs or journals; trigger = database triggers)?
  • On-demand data integration: Can you publish an ETL process and/or target attribute as a Web Service?

Connectivity

  • Native connections: How much and which native connections does the ETL tool support? (ODBC, OLE DB and flat files excluded)
  • Packages / enterprise applications: How many packages / enterprise applications can the tool read meta data from with one click of the mouse (for example SAP, Siebel, Peoplesoft, JD Edwards, Baan)
  • Real-time connections: How many and which type of message queuing products can the tool connect to?
  • Support for joined tables as source: Can you join two tables in a graphical manner letting the database execute the join as opposed to letting the ETL tool join the tables?
  • Changed data capture: Does the ETL tool support the principle of Changed Data Capture (select only the changes (deltas) from the database)
  • Built-in functions for data quality: Are there functions available to control the quality of the data (for example a matching transformation or an address cleanser) # fuzzy logic
  • Built-in functions for data validation: Does the tool have functions for data validation like ‘delete duplicates’, ‘missing values’, ‘incorrect data’) # 100% match
  • Data profiling: Options for data profiling, uniqueness of values, maximum, minimum, distribution of values, and so on.

General ETL Tool characteristics

  • Platforms: The number of platforms the ETL tool runs on
  • Version: The version number of the product that has been evaluated
  • Engine-based / code generator: Is the tool Engine-based or a Code-generator (eg = engine based; cg = code generator)
  • Type: ‘Process’ when the user can define an unlimited number of tasks/steps between source and target else ‘Map’

Download our guide: 20+ ETL tools reviewed

Get all information to select the best (enterprise) ETL tooling for the best price by ordering the ETL & Data Integration Guide 2024. You’ll get real insight into using ETL tools to build successful ETL applications and we’ll send you results of comparing twenty one ETL tools across 90 criteria.

to the etl & data integration guide

You May Also Like

Featured image ETL tools
ETL tools
Featured image Selecting an ETL tool
Selecting an ETL tool
Featured image Comparison of 20 ETL & Data Integration tools
Comparison of 20 ETL & Data Integration tools

A selection of our customers

Become a customer now

Do you also want to become a customer of ours? We are happy to help you with data-driven working or other things that will make you smarter.

Photo Daan van Beek - Business PartnerDAAN VAN BEEK MScBusiness Partner

Contact me directly

Fact sheet

___
customers
___
training courses
___
people trained
9.3
customer satisfaction
___
consultants & teachers
3
offices
19
years of experience