Choosing the Right Data Model for Data Archiving

By Patrick Moan, Program Manager

abstract digital image

If you’ve been following along with this set of blogs, congrats! If not, welcome to the discussion! As we’ve discussed in prior blogs, data modeling over the past 50 years has come a long way. In the 60s, data models were very application-private – frequently expressed in COBOL copybooks, with VSAM, IMS, or IDMS storage models gradually introducing the notion of both logical and physical data models.

In the ‘70s we had CODASYL (network) and then relational models. In the ‘80s we started seeing the concepts of object oriented design impacting storage models (MUMPS on CacheDB, ObjectStore, etc.). In the 2000s, the explosion of the World Wide Web fueled the growth of a variety of databases collectively known as NoSQL databases – XML databases such as xHive, MarkLogic, eXist; JSON centric stores such as MongoDB, Couch DB,New4j, HBASE, etc.; scalable, append-only file systems such as Hadoop; and massively scalable storage systems such as Google’s Big Query and Big Table tools.

That’s a lot of technology and a lot of underlying storage and data models to consider.

For the purposes of this conversation, let’s narrow our discussion to the problem of archiving data. We can simplify the conversation quite a bit and note that there are two broad archiving data model strategies – one that archives a legacy system that was defined using a relational data model for both physical and archive data models, and one that focuses on expressing business objects into the archive, regardless of the storage architecture of the original system.

In this article, we will explain a bit about each and describe common scenarios where it makes sense to use one versus the another.

Business Objects, SIP, and OAIS

There are many object model tools and standards in the market that are useful for describing an application’s object model. One of the standards we’ve found useful for expressing the interchange of business objects is the ISO reference model called the Open Archival Information System, or OAIS.

OAIS-compliant interchange between a legacy system and an archive is typically done by the construction of SIPs (Submission Information Packages). The SIPs can express any set of business objects that is deemed to be useful when accessing the data in the archive.

The intent with a business object approach and the use of SIPs for interchange is three-fold:

  • A SIP definition is an ISO standard, so vendor neutral.
  • A SIP has a payload that can be configured to express any business object model that can be expressed in a (set of) XML schema(s), so the standard itself is quite application agnostic.
  • Typical targets for ingestion of SIPs are highly scalable stores, typically providing massively horizontal scaling (Google Big Table) or optimized for scale and cost through transparent use of tiered storage (e.g., OpenText™ InfoArchive®).

Relational Model Source Systems and Table-based Archiving

Our other primary data model choice can be used when a source system uses a relational data model. In this approach, we may recreate both the logical and physical data model of the source system in the archive, or we may perform optimizations such as denormalizing data, creating materialized views, changing indexing structure to reflect the fact that most of the archive will be ready only, etc.

Key Differences Between Table-based and Object-based Archiving

When retiring an application using a table-based approach, the structure of the data in the archive will closely match the source legacy application (typically built on a relational database); for many tables, every column of every row with links to unstructured data such as images and PDFs will be recreated in the archive.

For customers with limited knowledge of the source application and/or stringent regulatory requirements, the entire source database can be archived. The data in this new archive is organized to be very similar to the legacy database, perhaps excluding operational tables such as auditing, temporary tables, or other structures that will not be needed in a read-only archive. Using this approach, queries that leverage knowledge of the source system can be developed years after archiving as new business requirements are identified.

On the other hand, object-based archiving typically requires more upfront business analysis of the application data and business requirements than table-based archiving does. The reason for this is because the design of the objects is the equivalent of an archival item such as a book, a record album, or a motion picture. It consists of multiple diverse elements from the source system’s physical data model, including both metadata and file content. We make decisions to simplify query development on the business objects and facilitate massive scale. We denormalize data to eliminate joins and sub-selects, and cluster metadata and content files. The data model we build is fast to ingest, fast to index, fast to search, and easy to use to leverage tiered storage technologies.

Let’s review some scenarios where it makes sense to use one approach versus the other.

Scenarios Where Object-based Archiving Makes Sense

If your legacy data set is extremely large and the query patterns are known a priori, then an object model approach may be the way to go since the SIPs used during ingestion are already optimized to your specific search and reporting needs. Searching and reporting against a large, pre-defined object model can be much more performant than a large table archive.

If you need to combine data from multiple applications into a consolidated archive with a well known data model, a business object approach would be a natural candidate since it is possible to populate an object model based archive with data from multiple legacy applications. For example, if you need to archive data from multiple and different electronic medical record (EMR) systems into a single, consolidated archive, you can have a business object model archive that enables you to have a 360 degree view of patient data from all applications in a single consolidated archive.

In addition, this approach enables the use of simple, extremely scalable backend data stores because this approach minimizes the use of complicated backend joins or cross-collection queries.

Even if your legacy application is built upon a relational database format, a table-based approach may not be the best way to go. Often, the legacy application logic and queries required to get to the data are very complex and CPU-intensive, involving joins across numerous tables.

If you need to retrieve data quickly and the object model and reporting requirements are well known, a business object model approach is often the way to go since data can be reorganized into a business object that is very close to what the application needs, thus eliminating CPU-intensive joins and speeding up search and reporting performance. This data modeling approach is done with some petabyte-scale stores such as Google Earth and Google Web Pages.

Another scenario to consider is subsets of the legacy application data representing different business objects. With a business object model approach, the archive can be designed to just hold the specific data from the legacy application that is needed for business continuity, separated by business object types. For example, your legacy application contains both patient and accounts receivable data. These are distinct business objects with different reporting requirements. The relevant data for each may be collected from the legacy application and stored to the archive in distinct data sets for efficient access and reporting.

Scenarios Where Table-based Archiving Makes Sense

Not surprisingly, our other approach, which we refer to as table-based archiving, has a design point that’s quite different from the business object model approach. A table-based archive approach is designed to optimize requirements that include the following:

  • Smaller archives. There may be a few use cases that require the creation of materialized views from the source system, but if the source system has data < 1 TB in size most enterprise scale archives can deliver reasonable performance without the creation of business objects.
  • Rapid project schedule. Remember that defining a scalable business object model may involve discussions with many stakeholders, which can introduce large delays into the project schedule.
  • With table-based archiving, you can begin the extract, transform, and load process right away without significant up-front data analysis relating to reporting. All you need do is set up your archiving infrastructure, and analysis for reporting can follow at a time of your choosing. If subsequent analysis requires the addition of materialized views this can be done as needed, since the source physical data model has been recreated in the archive.
  • New reporting requirements or ad-hoc reporting can be enabled post-archive based on new business requirements. Note that ad-hoc reports can be created against both a table or business object model approach, but a table-based archive will enable access to source system physical data model components that might not have been included in a business object model archive.

A final note is to consider that you may have some applications that are suited to a business object model approach and some better suited to a table-based approach. That’s totally OK – today’s archiving platforms can host business object model-based and table-based solutions in the same deployment fabric, whether it is on-prem, cloud, or hybrid cloud-based.

Not sure where to begin? We’re glad to help you get started. Contact us.