Monday, April 6, 2009

IBM Optical Data Conversion (EBCDIC)

Here at FusionWare we love a good challenge involving disparate systems, data and business logic. We've been doing this for so long that very few things can stump us.

Recently we had a customer with a large amount of data on an AS/400 on optical drives. They had an application on the AS/400 that would let them read and process this data so that they could view historical information. They had migrated their application to another system, but compliance regulations (and collections) required them to retain access to their historical data.

Unfortunately, this meant that they kept paying maintenance on the AS/400, and worse, they were facing a situation where their hardware was old enough that it was going to drop off IBM maintenance, so they were facing a hardware upgrade.

The customer visited trade shows like COMMON and contacted all sorts of companies, but everyone they spoke to said "No, we can't migrate this data off - you're stuck with the AS/400". Their own, incredibly creative efforts were gradually getting them there, but they simply didn't have the time to do all the conversions themselves, and really needed tooling to make it efficient.

Finally, the customer found us, we held a discussion of what they were trying to do, and we provided them with a proposal and estimate to do the following:

  • Conversion of their historical data to a SQL Database.
  • A web-based application to access this data, providing at least the same functionality as the current AS/400-based lookup program.
  • Good performance when accessing the database.

In consultation with the customer, we decided to use SQL Server for the converted data (any SQL Database that could handle the volume would have done) and IIS with ASP.NET to rapidly create the web GUI for accessing the new data store. Again, other web options could have been used. We work with our customers to find the solution that will give the best results and meet your corporate standards.

In doing the work, there were a number of interesting challenges that we had to work through:

  • The optical data is in EBCDIC format. We needed special tools to provide the conversion from EBCDIC to ASCII, including handling Packed Decimal and other special formats.

  • The optical data was huge. Over 60 GB of raw data. We needed a target database that could handle the volume, and indexing was critical, to ensure reasonable performance for the resulting application.
  • The optical files consisted of 3 parts: Header, metadata and data. Over time, the format of the data written changed, so that there were 6 variations of metadata for one file type.
  • Occasionally, garbage files were written to the optical drive and/or garbage data was written to some of the files. The only way to know this was to process the file and detect the problem when processing the converted data.
  • Because the data set was so large, it turned out that attempts to anticipate data problems by sampling data were largely unsuccessful. You really had to go for it and deal with anomalies as you encountered them. A good restart approach was critical.
  • Some critical data was embedded in the data in formats that required complex handling to extract it reliably. Basically it was in free-format text fields whose placement changed over time. A complex algorithm had to be devised to figure out how to get this data out reliably.
  • There were several collections of data. We started with one of the better defined, but larger sets. One objective was to come up reusable components and code that would make subsequent collections easier to work with.
  • Security and privacy. The customer's data included data with privacy concerns, so we transferred it between the customer's office and ours using Maxtor Black Armor secure USB drives (http://www.maxtor.com/en/hard-drive-backup/external-drives/maxtor-blackarmor.html). We did our development and testing work locally with all the data (including the SQL Server database) on our own Black Armor drives, ensuring maximum security and protection of the customer's data.

The solution involved a number of tools and steps:

First, we used a product called VEDIT from Greenview Data, Inc. (http://www.vedit.com/) including their Level 2 EBCDIC conversion tools to facilitate the conversion. This product allows you to inspect the data and view it in both ASCII-converted and Hex mode, on the same screen (split window). It also supports a macro mode so you can automate operations from a command line. VEDIT uses something called a layout file to do it's EBCDIC conversion.

We used the FusionWare Integration Server (our own product) both to orchestrate the steps, transfer the resulting ASCII-delimited files, run SQL DML Scripts, and create layout files and SQL DDL Scripts.

Because formats changed over time, we had to do the conversion in several steps:

The first step was a preprocess phase. We started by breaking the EBCDIC files up into header, metadata and data portions. Then we processed the metadata, and used XSLT to create layout and SQL DDL scripts. We had to associate each converted file with the appropriate layout files. When this pass was done, we had numerous variations of both the SQL DDL scripts and the layout files.

We used these initial steps to create the SQL Server tables and to build the application for viewing the data. This application was an ASP.NET application and used a browser to access the data, using windows authentication and role-based access.

Then we started the process of conversion. The conversion process had to detect which variation of the layout file to use to create the ASCII comma-delimited data. Because of some data issues, we also had to put a process in to do a data-cleansing step on some of the files. Then we transfered the data into the SQL Server tables.

Actually, the above 3 steps were iterative. As we got to the data phase we discovered data issues that required us to restart the final phase, and in some cases redo parts of the preprocessing phase, as well. Some of this required changes to the SQL Tables as we discovered that at one point the application added new fields to the database.

The application that we built for accessing their data provided them with greater searchability than their original AS/400 application, and performance was not only better than accessing the AS/400 Optical data, but it was actually faster than accessing historical data on their new system.

The end result of this process was a set of reusable code components that can be applied to the additional collections.

The customer now has their initial collection (statements) sitting in a SQL Server database (about half a billion rows worth, taking up about 130 GB of data, index and tempdb space) and a process and components that they can repurpose to convert their other collections of historical data. Once we complete the conversion of their other collections, they will be able to decommission the old AS/400 system with its optical drives, while still meeting their legal obligations.

No comments:

Post a Comment