Sunday, September 2, 2018

Adventures in Mapping Multivalues - Data Cleansing on the Fly

In my last post, I talked about how we used a powerful feature that we alternately called Select Methods and simply Filters, to solve a problem with control files.  In this post, I'll take you through one of the powerful tricks that Select Methods let you play to control your data.

Select Methods, or Filters, are simply Pick/BASIC subroutines that get called for each record read from the file, before any processing is done on that record.  They have the following signature:


It is generally recommended that you not change ID, but we won't stop you. ITEM is the item body, and as you will see, changing it is quite acceptable. In the last post we showed a way that you would use PROCEED to indicate if a particular record should be processed for the current view, or skipped.

Data Cleansing on the Fly

To understand the importance of this feature, you need to understand what we've mentioned in previous posts. You can define a dictionary definition for a file, if you wish, but this is merely suggestive. Your BASIC programs, and any number of TCL commands (Terminate Control Language was the command-line for Pick systems) could update any format of data into the file.  Examples are a programmer, fixing bad data, accidentally copies a fixed record into the wrong file, with the wrong format for that file.  A BASIC program with a bug, writes a text string into a numeric field. Or some kind of data corruption occurs and the data in a record becomes gibberish.

In the SQL world, you just can't do that, so we would crash a query if we encountered bad data.  We used to have customers insist that their data was good, so we would run our validator on their files and it would start kicking out reports of bad data. In all the years that I worked with Pick customers, I have NEVER run into one that did not have bad data in at least a few of their key files. In most cases, every file with more than 10,000 records had some bad data.

Customers had two options. One was to fix the bad data our validator found. This was great except that we would always ask how the bad data got there. The first time you got bad data again, the query would crash again.  In cases where data was being corrupted by a program or process unknown, there were a couple of simple fixes that could be done using Select Methods:

The first option would be to use a Select Method as a filter. Let's say that you have two fields, attributes 5 and 7, that need to be numeric.  The following program would detect if the data contained was not numeric and tell the program to skip it. As noted in the comments and the omitted code, you can also write a message with the item-id into a log file so you can fix it later.  This is a better option when you think the record is possibly complete garbage and you really don't want to play with it until you've had a look.

Here is what that subroutine would look like:

   * Optionally add code here to write the ID to an error log
   * file so someone can review it

The other option is to cleanse the data on the fly. This is what that might look like:


This option makes more sense if you know what attributes are getting the bad data, but have reasonable confidence that the other data in the record is good.  This latter approach was one of the most common uses of our Select Methods.

No comments:

Post a Comment