Seven Pillars of an Effective Extract, Transform, and Load (ETL) Process

By Jim Earley, Director of Engineering

ETL imageOne of the biggest challenges that organizations of all sizes face is the expansion of data across a broad range of applications and storage platforms. Typically, there’s a set of active systems that maintain current operational data, and often there are legacy applications in the same domains running “legacy” data.

No matter where your data resides, it’s a vital asset that must be leveraged to its fullest potential for organizations to maximize revenue potential. The challenge is, how can you harness, harmonize, and normalize disparate data sources into platforms that allow you to run predictive analytics and other emerging artificial intelligence tooling on your data and maximize the value of your data in other ways?

We understand that data comes in all shapes and sizes, and leveraging all data is critical to business today. We built the ETL Acceleration Suite to addresses common challenges of the ETL process so you can move your data from legacy systems to an archive faster and with high confidence in data quality.

One way is to move data from legacy systems to an archive. Extract, transform, and load (ETL) is a critical component of this process.

ETL is often thought of as a standalone, one-off operation to convert data from one format to another. That’s part of it, but an effective ETL strategy needs to consider seven core pillars:

1.    Do I need to take everything? Can I take everything?

A useful analog for this is to think about moving from one home to another. As you undertake the challenge of packing up your belongings, you may inevitably ask yourself, “Do I really want to take that with me?”

The same is true with ETL. It’s often the case that data storage systems contain a lot of information for the underlying application state (e.g., configuration settings), and even application access controls (e.g., user login information). Our extraction tools can be configured to process everything or handle only what you need.

2.    How do I get at the data?

This is perhaps one of the biggest challenges of extracting data from a legacy system. It’s not uncommon for sophisticated relational databases to have complicated table structures with dozens, even hundreds, of joins to return records. Depending on the mainframe system, for example, data could be stored in any of several formats, including VSAM, IMS, or IDMS.

Our RDBMS Extract tool can create materialized views to retrieve data, and the Cobol Data Extractor can read mainframe copybooks to read the data from mainframe file dumps. If you have data stored in character separated value (CSV) files, we can process these as well.

3.    What’s my target format and data structure?

In any ETL strategy, understanding where the extracted data will be stored is critical. Today, that strategy will often depend on the platform and the intended use cases.

Many data applications store a combination of both structured and unstructured data, such as financial loan documents, medical X-rays, MRIs, and others. Maintaining the integrity and relationships between structured and unstructured data is paramount – our Extraction tools are designed to do this.

4.    How can I run ETL processing efficiently? What happens if something fails?

It’s not uncommon for us to see legacy systems with multiple terabytes of data that need to be processed. It’s critical to consider how segmentation, partitioning, and parallelism play a key role in handling the vast amount of data in an efficient manner. ETL Acceleration Suite tools are designed to run in parallel and in multithreaded processes to handle the load. They also can be configured to execute on different segments of the data for greater scalability.

Make no mistake, CPU, disk speed, and network bandwidth matter, so it’s important to think about ETL as a first-class production issue – at least temporarily – and given the appropriate resources to bear.

Resiliency, or the ability to recover from unexpected failure, is critical. Things happen. A network switch can hiccup, a disk can fail, or a system update could inadvertently reboot a system in the middle of an ETL process. When we’re dealing with millions and billions of records, it’s not efficient to have to restart an ETL process from the beginning in the event of a failure. For this reason, our ETL tools are designed with resume features to pick up from where they left off.

5.    How do I ensure the integrity and validity of my extracted data?

There are numerous ways that data can fail during an ETL process. For example, how do you ensure that a native database date format is properly formatted to an ISO-8601 date value in XML or JSON? Did the extraction properly escape special characters, like quotes and ampersands?

And what about data structure? How can you validate that a denormalized record that joins data from several tables matches the data structures you expect to have in your target data system? You don’t want to ingest terabytes of data unless you’re confident that the data you’ve extracted is valid.

This is where our Preflight Validation can help. Using defined schemas for the target data environment, you can quickly validate all the data to ensure that its format and structure are valid.

6.    Loading – Eating the Elephant

Most ETL tools today can reasonably support the “E” and the “T,” and they leave the “L” to other tooling that natively understand the target system. This is usually one of the most expensive and intensive parts of the entire process.

Like extraction, data loading shares the same potential points of failure. The axiom goes like this: “How do you eat an elephant? One bite (byte) at a time.” You can see how the potential risks of failure can creep into a single threaded linear process. This is where our Loading tools are essential to create a seamless, scalable, and resilient ingestion process.

7.    Did everything get loaded?

When you’re moving from one home to another, you typically take one last look around the home you’re leaving to make sure that nothing was left behind. Our Chain of Custody tools give you the confidence that your target system has all the data you expected from your source system.

Wrapping it Up

To recap, data is one of the most valuable assets in your organization. Often it’s locked into disparate and frequently incompatible systems and data platforms. Many of these may be legacy systems.

As organizations modernize their application portfolios, move to the cloud, and evolve their applications in other ways, there’s an incredible opportunity to harness information from legacy systems to maximize its value. An ETL strategy with tools like the ETL Acceleration Suite is a key enabler to making legacy data available.

In my next blog, I’ll discuss how to implement some of these ETL strategies, starting with data extraction and transformation from various data platforms.