Thursday, August 16, 2018

Adventures in Mapping Multivalues - Two Common Approaches

The Problem

In a previous post, I talked about what a Pick database was.  Pick had a way of representing complex data that was intuitive and matched the structure of common documents that it was modeling very closely. It was a very human-friendly way to represent that data. Note that this is much the same argument for XML data (that's for another blog post...)

In this post, I'm going to talk about how this might look in a relational database, then I'll talk about two different approaches that Pick uses to represent this kind of data:

  • Correlated Multivalues
  • Subfiles

Our Document Data

So, let's consider the example we gave of our Invoice. Here is how it might look in a printed document, with no consideration of how it actually is stored in the database:

Invoice 0001 Date 11 Aug 2018 Customer ABC Company

Product Options Quantity   Price............ Extended...
WIDGET  GOLD           2            $115.95    $231.90
GIDGET  RED            1            $220.75    $220.75
        SMALL
========================================================
Subtotal:                                      $452.65

Note that you have an invoice number, which should be a unique key, a date, and customer, that are all fields that have a single value.  Then you have a bunch of fields that have more than one value: Product, Options, Quantity, Price. We also have two calculated fields, one being a subtotal, and the other being the extended price (Quantity times Price), which has multiple values, one for each product line.  Note also that the Options value can have multiple values for each product value, so this is a sub-multivalued field, or subvalued for short.

Relational (SQL) Model

Here is one possible relational representation of that data:

SELECT * FROM INVOICE_MASTER
Invoice_Id Invoice_date Customer...........................
      0001  11 Aug 2018 ABC Company

SELECT * FROM INVOICE_LINES
Invoice_Id Line_Id Product Quantity Price....
      0001       1 WIDGET         2    115.95
      0001       2 GIDGET         1    220.75

SELECT * FROM INVOICE_OPTIONS
Invoice_Id Line_Id Option_Id Option
      0001       1         1 GOLD
      0002       1         1 RED
      0002       2         1 SMALL

Let's talk about some points here.  In the INVOICE_LINES table, we've added a Line_Id field, which combined with the Invoice_Id would be the primary key (unique identifier). If you knew for sure that you would never repeat a product in another line of your invoice, that any product would be unique within any invoice, you could skip that Line_Id field and use the Product value with the Invoice_Id as the primary key. That assumption would probably bite you. (In a multivalued database, you never have to create artificial keys like this.)  The same thing happens twice over in the INVOICE_OPTIONS table.  This mess of decisions, keys, and additional values creates code entropy, as programmers are required to do multiple reads, then navigate multiple data sets to process the data. One is generally safer to create a new key in these cases, but then you have a task of managing these keys when you write the data back, or insert (or delete) new invoice lines or options.

Correlated Multivalues

Here is the most common representation of the above in Pick:

0001^18486^ABC Company^WIDGET]GIDGET^C*GOLD]C*RED\SZ*SMALL^2]1^
11595^22075


Notice that in this structure, you have minimal wasted space. It was developed when memory and disk were both expensive, so keeping things small was beneficial.  The up-arrow character '^' is the attribute, or field delimiter. The square bracket ']' is a multivalue delimiter, and the backslash '\' is the subvalue delimiter. The actual delimiters are high-ASCII characters that were not commonly used in data in typical ASCII systems.

The first thing you see is what Pick calls the Item-Id. In Relational terms, this is a single field that is always the entire primary key. It is a unique identifier that is used to determine which bucket of the hash-file to put the data  in, or to find it in, if you are looking data up.

The next two fields are the invoice date (number of days since Dec 31, 1967) and the customer name. They both have just one value in them.

The next 4 fields contain multivalues. The first value of each of these fields is the data for line 1 of the invoice. The second value of each of these fields is the data for line 2 of the invoice.  For the Options field, the first value contains just that value. The second value contains two subvalues. The first one is the first option line for the second invoice line, the second is the second option line for the second invoice line.

For a Pick BASIC programmer who has read this record into a variable called X, he would access the first invoice line's product with this code:

PRODUCT.ID=X<3,1>

The second line's product value would be this:

PRODUCT.ID=X<3,2>

The Item-Id is not part of what comes back from a read (you need it to do the read), so it's often referred to as attribute 0.

To get the second option of the second invoice line, you would use this:

OPTION.CODE=X<4,2,2>

You basically had to know the attribute numbers for your fields and you could get at anything. One read statement got you everything for your document. Your record really was your document.

Subfiles

A less common, but still often used approach to representing this data is called a subfile. In a subfile, all the detail line data is contained in a single field. Here is how it would be represented:

0001^18486^ABC Company^WIDGET\C*GOLD\2\11595]GIDGET\C*RED,
SZ*SMALL\1\22075

All of your detail lines are contained in attribute 3. The first multivalue is all of invoice line 1's data. The second multivalue is invoice line 2's data.  The first subvalue of each multivalue is always the product, the second subvalue is the options, the third is the quantity and the fourth subvalue of each multivalue is the price.

Note that we ran out of delimiters, so we had to use a comma separator two delimit the second line's two option values.

So here is what a Pick BASIC programmer would do to get the first invoice line's product:

PRODUCT.ID=X<3,1,1>

You need to specify that you only want that first subvalue, or you'll get the whole invoice line. That may be handy in itself, but then you'd have to do a second line to get that data out.

The second line's product value would be this:

PRODUCT.ID=X<3,2,1>

To get the second option of the second invoice line, you would use this:

OPTION.CODE=FIELD(X<3,2,2>,",",2)

This line gets a bit more involved. The angle brackets only work with system delimiters, so it retrieves both option values, with the comma separator. The FIELD statement says to pull the 2nd comma delimited value out.

Again, your record is really the entire document.  Note also, that the BASIC implementation has data access and handling functionality built in. You don't have to call out to an API as it's all there!

Liberty Mapping Terminology

Liberty ODBC's SQL Mapping layer handled both of these structures very simply.

The Correlated Multivalues were termed a "Group Floating" table.  If you were dealing with Correlated Subvalues, you had a "Group Floating within Group Floating" table.

The subfiles were called a "Group Positional" table.  By using these structures, you would get the equivalent to the above.  You could define SQL columns that were the multivalue or subvalue offsets, effectively giving you the Line_Id and Option_Id fields.

When joining two tables where one was actually contained inside the other, our engine would optimize and read the Pick record only once for efficiency sake.

Upcoming

In some of my upcoming posts, I'll talk about some of the interesting and odd problems we had to deal with in order to efficiently map and process multivalued data in our SQL engine.

No comments:

Post a Comment