Saturday, August 11, 2018

What is a PICK MultiValue Database?

Why I'm Blogging This

In some upcoming posts, I'm going to be talking about how we implemented an ANSI/92 compliant SQL engine and ODBC driver for this, and some of the interesting challenges we encountered. In some cases, I'll be talking about the relative merits of these systems vs. relational databases.  They all have their strengths.

In order to do this, I thought it would be good to have a bit of a primer about how Pick data is stored and processed. I'll probably reference this little blog post quite a bit.

It's Still Here

The first thing I want to say is that systems running on Pick based multivalue are ubiquitous and they are EVERYWHERE! In some cases they are legacy solutions that have been in place and working for ever.  Probably every fortune 500 company has at least one system, running in some department, that runs on it.  It's also still being sold as part of applications around the world.

Examples of industries and solutions that continue to run and/or sell best-of-breed solutions running on PICK include (but are not limited to) Automotive Retail, Libraries, Banking, Healthcare, Insurance, Manufacturing, Distribution, EMS, Hospitality, Municipal, Government, and many more.

Ignore them at your peril!

Where Did it Come From?

In about 1965, the US department of defense awarded a contract to TRW to create a system to manage Cheyenne helicopter parts.  Two engineers, Don Nelson and Dick Pick were assigned to the task, and started work on an IBM System/360 computer. They named their system GIRLS, which was an acronym for Government Inventory Relational Language System.

This was before commodity hardware was a thing. Somehow they got the idea to create a system based on a virtual machine that they could implement on any hardware.  There was some really good architecture and design that went into this system.

How Does PICK Data Work?


The database was a very simple concept.  Whatever document you needed to represent, you would have a data structure that contained all the information in one place.  With a single read, a programmer could pull back all the key data that was needed for an invoice.  In real practice, your invoice would reference a customer by an id, and the customer name would remain in the customer master file. Similar things would happen with part numbers and the part names.  The invoice would include both the header information and the detail lines for the parts being ordered.

The data is stored in a hashed file structure and each record is stored as a sparse string with delimiters separating fields, values and subvalues.  Let's look at an example.

Still working with the invoice concept, lets consider a manufacturing organization that sells three key products, widgets, gidgets and gadgets.  The invoice file has an invoice date, customer id, then it has 4 fields that contain multiple values (multivalues) in them.  These are product id, options, quantity and price.  When you display an invoice it might look like this:

Invoice 0001 Date 11 Aug 2018 Customer ABC - ABC Company

Product Options Quantity   Price............ Extended...
WIDGET  GOLD           2            $115.95    $231.90
GIDGET  RED            1            $220.75    $220.75
        SMALL
========================================================
Subtotal:                                      $452.65
Taxes
Totals

When looking at the document, you can see that you have header data that occurs just once per document, including the invoice number, date, and customer information. You would look up the customer's name from the customer master file.  You also have computed or calculated fields, like extended costs, subtotals, taxes and final totals.

The interesting thing to note is the invoice lines data. In a real invoice all the lines, an set of arbitrary size, is part and parcel of, and contained in, the invoice document.  For a single invoice you could have an arbitrary number of products being purchased. for each you would have a quantity and a price.  In our case, we further complicated it by allowing different products to have different options. You might choose a colour and a size as with the gidget we had on the second invoice line.  At this point we not only have multiple values, but one of the values (the OPTIONS value for the GIDGET line) has two subvalues, one for the colour and one for the size.

Don and Dick came up with a database structure that allowed that.  Your fields were delimited by field or attribute mark delimiters. Your multiple values were delimited by value mark delimiters and your subvalues were delimited by subvalue mark delimiters. The data for the above INVOICE file might look like this:


Note that the blocks are the field delimiters. the superscripted '2' is a value mark, and the superscripted 'n' is a subvalue mark. These are high-ASCII delimiters. Going left to right, we have the first field, the item-id (similar to a primary key) which is the invoice number '00001'. Then you have a date. This is represented as number of days since Dec 31, 1967.  Next we have the customer id 'ABC' for ABC Company. Then we get a field that has two values. This field and all the subsequent fields have two values, the first one is line 1 of the detail lines and the second is line 2 of the detail lines. This first multivalued field has the product ids of the products being bought.The next one has the options selected for each line. You can see that for detail line 1 there is only one option, but for the 2nd detail line (the GIDGET), we have two options: The colour RED and the size SMALL. Then we have the quantities of each of these and the unit price for each line. The price has an implied 2 decimals (it is storing the value as cents, not decimal dollars).

For a programmer, the invoice number gave you the whole document. You might need to do ancilliary lookups in the CUSTOMER, PRODUCTS and OPTIONS tables, but a single read and a single write gets you your data and saves it to the database.

In a relational database, the INVOICE file would need to be a minimum of 3 tables. The record represented above would take a minimum of 6 disk reads to get the same data. As you increase the number of detail lines and the number of product options, the complexity grows rapidly.  A programmer in this case needs to retrieve the data from the database and make sure he gets it all. Then he needs to properly relate the data sets in his program. The complexity is huge compared with the multivalued approach. 

So, in relational terms, you would probably have these tables:  INVOICE_HEADER, INVOICE_LINES, INVOICE_LINES, INVOICE_LINES_OPTIONS.

Think of how this would work if you stored your physical invoices like this. You'd go to the filing cabinet with header information and look up the header part of the invoice. Then you'd go to another filing cabinet where you'd have to find a separate page for each invoice line. You'd use the invoice number and line number. You'd keep looking until you couldn't find another line number for that invoice. Then you'd go to yet another filing cabinet, and for each invoice line for that invoice, you'd look for one or more pages that had information for the options (possibly multiple) for those invoice lines. I've seen invoices that had hundreds of lines in them. Can you imagine then having to organize and manage those hundreds of pages?  Instead of one invoice that was maybe 3 pages long?

There is also a performance component to this.  I remember working with a customer who had moved from a mainframe to a minicomputer platform, and had determined that they could not afford the disk head movement of a relational database.  They were a big customer of our ODBC technology as we enabled them to have a relation view of the data and use it with tools that required this, while still giving them the reduced disk head movement for their core application.

How Did You Access That Data?

Pick had a concept called a dictionary. You would have a file that contained field definition records associated with every file. It was completely optional, and once the BASIC language came out for Pick, it didn't enforce anything in it. In fact, you could have two dictionaries that defined the same attribute completely differently. Generally, one would be wrong and the other right, but I've seen where a single file has multiple different record types, based on a prefix in the key. Which dictionary you wanted to use depended on the record key!

You could create a dictionary definition item that actually used a value from the current record to read a value from another file. This was called a translate, and was very powerful.  For the example index above, you would take the customer id (let's say it's attribute 2) and use it to pull attribute 1 (customer name) from the CUSTOMERS file. That translate correlative would look like this:

001 A
002 2
003 CUSTOMER NAME
...
007 TCUSTOMERS;X;1;1
...

Similarly, you could use this to pull information from a PRODUCTS file.

When these systems were first commercially sold, salespeople would walk a machine in from a panel van (they were the size of a fridge, so this was quite a feat.)  Then they'd use it to create a couple of tables, use the editor to create some dictionary definitions. Create a script program using PROC and BATCH to enter data. Then they'd use a LIST command to print out the data.

So the command "LIST INVOICES" would be automatically expanded out to:
LIST INVOICES INVOICE.DATE CUSTOMER.ID PRODUCT.ID OPTIONS QUANTITY PRICE

which would give us this output:


It was not uncommon for this to be so impressive for the business users watching the demo that a cheque would be written and the computer left there. Then they had to get programmers in and write a system... but it was also not uncommon for a business person to dictate what the system had to do and create his own reports.  Compared with the options available at that time, this was a huge step forward, and many commercial systems that are still in use today had their start with a business owner looking over a programmer's shoulders. This was the first instance of agile and pair programming!

Later Add-ons

The first systems were written using a combination of PROC (a scripting language), BATCH (which was horrible, and impossible to make pretty - totally character based), and when you got stuck, Assembler.

Later, a BASIC implementation was added with built-in support for the multivalued database, and that allowed you to use extended features of an ASCII terminal.  These databases still exist and go by a number of names:


  • D3
  • Universe
  • Unidata
  • mvBase
  • mvEnterprise
  • jBase
  • OpenInsight
  • QM and OpenQM
  • RealityX
  • And others

Relational Value and Pushback

An interesting thing to note is that many applications are being written today to use what are called NoSQL databases. The rationale for these databases sounds like an ad for Pick.  While relational databases have their uses, there are clearly applications where they are just not the right solution.  

That said, a huge amount of investment has been made by companies like Oracle, IBM, Microsoft, Cognos, Business Objects and others into enhancements and tools that leverage relational databases. This has not been the case with Pick, so there are many applications where it is imperative that the regardless where your data originates, you need to get it into relational in order to leverage these tools and technologies.

The industry is clearly seeing a divide form between SQL and NoSQL databases, yet there is also growing clarity about when you should use each.  What's not as clear is that there is a NoSQL option that has already existed for a long time and that is widely in use, that's another option.

No comments:

Post a Comment