Data Extraction Strategies

By Jim Earley, Director of Engineering

data extraction conceptIn my previous article, I provided a high-level overview for the main pillars of data extraction, transformation, and loading (ETL). I addressed the key principles for a successful ETL strategy.

In this segment, I’ll dive deeper into the “E” of ETL – data extraction. I’ll talk about types of data extraction, the reasons for choosing a particular type of extraction, and some tools that support these strategies. As organizations move more applications to the cloud and decommission older and obsolete applications, there is also a strong desire to migrate the data from these applications.

Relational Database Extraction

For the past three decades, relational database systems (RDBMS) have been the foundation of enterprise-level software. As organizations evolve and grow, and as software applications shift and change, valuable data is left behind in the legacy applications. Our experience with customers is that many of these legacy database versions are frequently out-of-date, and upgrading them to current versions is not worth the expense. Nevertheless, the data stored in them still has value.

Now as applications shift to the cloud, there is even greater opportunity to take advantage of newer data storage and archive platforms that lower total cost of ownership while allowing businesses to harness the data to its fullest potential.

There are several potential extraction strategies that can be employed to extract and archive relational archive data:

  • We can extract everything and create a replica dataset transformed into a standard data format like XML for portability. This is the simplest path forward with the least amount of risk of leaving data behind, especially for data that may be highly regulated or with strict retention policies. But it also incurs the greatest overhead and cost since many of these legacy application databases also store application data not relevant to the business. Moreover, depending on the age of the application, older data that may not be as valuable can be discarded to save on storage and management costs.
  • Select only the relevant tables and records containing the data needed by the business. This certainly simplifies storage and overhead and eliminates the “cruft” left behind by applications. Depending on your overall data archiving and business analytics strategy, leaving the data in its original table schema may require additional transformation to leverage the value from the data.
  • Create “Materialized Views” to extract data. This strategy provides the greatest level of control over the data you want to migrate with potentially added benefits of transforming it into data structures that could maximize your cloud data storage strategy.

Flatirons’ RDBMS Extract supports all these strategies, allowing organizations to choose the approach that best fits their business needs and/or regulatory requirements. We have connectors for a wide range of legacy relational databases, including Oracle 7 and up, SQL Server 2005 and up, AS400, DB2, MySQL, Progress, PostgreSQL, and others, ensuring that you can migrate from older databases.

Mainframe Data

One of the greatest needs we hear from organizations today is to safely decommission their mainframe applications, some of which date back several decades. Some of these applications, still in use today, provide mission-critical services such as state unemployment benefits, electronic healthcare records, finance and banking, and insurance. Many of these COBOL-based applications are being supported and maintained by a diminishing number of COBOL programmers. As a result, CIOs and CTOs need to figure out what their strategy is for reducing reliance on COBOL-based programs. One way to lessen reliance and safeguard COBOL-based data is to migrate to commodity and cloud storage or to an archive in data formats that are portable and easily maintained.

The uniqueness of COBOL applications, the underlying data storage platforms, and character encoding schemes like EBCDIC are often seen as challenges for data extraction and migration. With Flatirons’ COBOL Data Translator, we can configure a full transformation of mainframe data by reading in the application’s copybooks to understand the application’s data structure, including LEVELS, data offset locations, data types including PICTURE types and default VALUE clauses, and other features such as REDEFINES, RENAMES, and USAGE (with COMP-1, COMP-2 and COMP-3) clauses.

Once this configuration is generated, we can transform data binary files from VSAM, IMS, and IDMS storage platforms and support translating from older encoding schemes like EBCDIC to modern encodings like UTF-8. We also preserve the data structure and hierarchy using XML (and are working on a JSON serialization), making it easy to archive.

Character Separated Data Files

There are some uses cases where legacy data is stored in data formats that use proprietary tools and storage formats, or where access to the data source is not readily available. In these cases, we can transform “data dumps” formatted as character separated data files using Flatirons’ CSV Translator, which can be configured to handle huge files (in one example, we processed a 1TB file) with a low memory footprint.

Data Integrity and Reliability

One of the critical aspects of any data extract/transform strategy is ensuring that the integrity of the data is maintained. Whether handling Packed Decimals in COBOL data, DATE and TIME formats in various relational database platforms, embedded binaries, or different character encodings, the extract and transform tools in the ETL Acceleration Suite from Flatirons are designed to handle all these different cases and ensure that the underlying data is transformed without losing the “intent and meaning.”

Wrapping Up

In this segment, we discussed some of the key considerations and scenarios that organizations face today when thinking about decommissioning legacy applications and archiving the data. Whether it’s for maintaining regulatory compliance for business continuity, the big challenge has been to define an archiving strategy that can encompass and support the broad range of data platforms used by the numerous enterprise applications still being maintained today – all at significant cost and overhead for these organizations. As organizations pivot to the cloud, a key part of their strategy will be to decommission the myriad of legacy applications. Flatirons is uniquely positioned to help.

In my next blog, I’ll go deeper into data validation and chain of custody.