So Totally Software

Saturday, October 13, 2018

Adventures in Mapping Multivalues - Joins

What's a Join?

Relational databases are built on a mathematical principal called Set Theory. To help us along, let's look at two sets.

We'll call set A the PRODUCTS set. Each element of the set contains the following properties:
PRODUCT_ID - always numeric, 1-4 digits
PRODUCT_NAME - text, up to 100 characters
COLOUR_ID - always numeric, 1-4 digits, may be empty

Note that not all products come in multiple colours, so the PRODUCT_COLOUR_CODE field may be empty in those cases. Most SQL implementations treat an empty value in a string field as a NULL. I explain what that means later.

We'll call set B the COLOURS set. Each element of the set contains the following properties:
COLOUR_ID - always numeric, 1-4 digits
COLOUR_NAME - text, up to 100 characters

So let's look at the elements of our PRODUCTS set (I've shortened the column names to fit the blog):

P_ID P_NAME................. C_ID
1 Widget 1
2 Gidget 2
3 Gadget 1
4 Thing-a-ma-bob 4
5 What-cha-ma-call-it NULL (not assigned)
6 Whatever 3

Here are the elements of our COLOURS set:

C_ID C_NAME..............
1 Red
2 Green
3 Purple
4 Blue

In the purest sense, a join is a super-set made up of combining two sets into one new set. Let's look at the "simplest" example:

Full Cartesian Joins

The simplest join of all is a full Cartesian join. While it's simple, it's very counter-intuitive for humans. Here is what a Cartesian join of the two sets above would look like:

P_ID P_NAME.............. P.C_ID C.C_ID C_NAME
1 Widget 1 1 Red
1 Widget 1 2 Green
1 Widget 1 3 Purple
1 Widget 1 4 Blue
2 Gidget 2 1 Red
2 Gidget 2 2 Green
2 Gidget 2 3 Purple
2 Gidget 2 4 Blue
...

Wait!!! What's happening here? Well, if you don't constrain a join, it will create a new set in which every member of set A gets paired up with every member of set B. In our case above, the result is 24 rows. As stated earlier, while the logic is very simple, it's very counter-intuitive. Note that I've added the prefix P. or C. to the front of the C_ID columns to distinguish which one came from the PRODUCTS or COLOURS sets.

While the is the default processing, it is almost never what you want to do in real commercial systems. In SQL terms you would get this from the following syntax:

SELECT * FROM PRODUCTS, COLOURS

What this says in SQL terms is "Create a new set (the result set), containing all properties and all combinations of elements, without constraint, of the PRODUCTS and COLOURS sets".

Inner Joins

As a human being, I look at the two sample sets (table) above, and realize immediately, that I only want to combine elements from the COLOURS set in my superset when the COLOUR_ID property in the PRODUCTS set matches the COLOUR_ID property in the COLOURS set. This is called an Inner Join. There are two ways to accomplish this with SQL. The following two SQL commands product the identical result set:

SELECT * FROM PRODUCTS P, COLOURS C WHERE P.COLOUR_ID = C.COLOUR_ID

SELECT * FROM PRODUCTS P INNER JOIN COLOURS C ON P.COLOUR_ID = C.COLOUR_ID

The key thing is that we are constraining the result set to be only those combinations with matching COLOUR_ID properties in both sets.

For our example this produces the following result set:

P_ID P_NAME............... P.C_ID C.C_ID C_NAME
1 Widget 1 1 Red
2 Gidget 2 2 Green
3 Gadget 1 1 Red
4 Thing-a-ma-bob 4 4 Blue
6 Whatever 3 3 Purple

Hmmm. What happened to the element of the PRODUCTS set that had PRODUCT_ID 5? Well, given our join condition, there was no element of the COLOURS set whose COLOUR_ID property matched the empty COLOUR_ID value of the PRODUCTS table element, so it didn't fit our criteria and got dropped. While we're getting closer, that's still not quite what we were looking for! This brings us to...

Left Outer Joins

What you really want is this syntax:

SELECT * FROM PRODUCTS P LEFT OUTER JOIN COLOURS C ON P.COLOUR_ID = C.COLOUR_ID

What this is saying is that, reading your sets (tables) from left to right in the SQL command, the result set should always include all elements of the left-most set (in this case PRODUCTS), even if there isn't a matching element in the sets to the right (COLOURS). In this case you get this result set:

P_ID P_NAME............... P.C_ID C.C_ID C_NAME
1 Widget 1 1 Red
2 Gidget 2 2 Green
3 Gadget 1 1 Red
4 Thing-a-ma-bob 4 4 Blue
5 What-cha-ma-call-it
6 Whatever 3 3 Purple

I'm going to display it as most raw SQL outputs would display it:

P_ID P_NAME............... P.C_ID C.C_ID C_NAME
1 Widget 1 1 Red
2 Gidget 2 2 Green
3 Gadget 1 1 Red
4 Thing-a-ma-bob 4 4 Blue
5 What-cha-ma-call-it NULL NULL NULL
6 Whatever 3 3 Purple

NULLs are Nasty!

NULLs are not the same as an empty string. They have nasty implications. Think of NULL as an "unassigned" or "unknown" value.

One artefact of set theory is tertiary logic. Lets think of phone numbers. If I have a customer table where I track phone numbers, there are actually three possibilities for any customer.

They have a phone number and I have captured it
They don't have a phone number and I know that for sure
I have no idea if they have a phone number or not, and/or don't know what it is

Note the the syntax "NULL = NULL" technically returns false. I may not know your phone number and I may not know your best friend's phone number, so I use the NULL value to indicate that. That does not make them the same. Unless you two are room-mates, the two numbers are almost certainly NOT the same. That's why NULL = NULL returns FALSE in most SQL databases.

In the case of our PRODUCTS table, it may be desirable to pull back all PRODUCTS that don't define colours. In those cases, all those NULLs really define the same state. It's like the N/A value. Doesn't apply and all "doesn't applies" are the same when it comes to colours. NULL causes more problems than it solves, but because it's an important part of set theory, it became entrenched in all relational databases, for better or worse.

How PICK Handles This

In the Multivalued PICK world, we do not bother ourselves with set theory. It's nice if you want to apply mathematics to your data, but even there, the only math is really assists you with is set theory.

In the PICK world, you'd add a dictionary definition record to the dictionary of the PRODUCTS file and use a translate correlative code, "TCOLOURS;X;1;1" to pull the name of the COLOUR in, using the COLOUR_ID field on the PRODUCTS table. the "X" in that correlatives is a code that says, "If there's no item in the target table (COLOURS) that contains that id, return an empty string." The result would look something like this:

P_ID P_NAME............... C_ID.. C_NAME
1 Widget 1 Red
2 Gidget 2 Green
3 Gadget 1 Red
4 Thing-a-ma-bob 4 Blue
5 What-cha-ma-call-it
6 Whatever 3 Purple

No NULL values exist in the PICK world. You would have to come up with a special value for "unknown" or "N/A" and handle it explicitly. That can be done and would be far more obvious and intuitive than SQL's NULL value.

Dick Pick and Don Nelson designed their database to store its data as a sparsely populated string, called a dynamic array. Empty strings were quite common and very efficient. Empty strings were often called NULL, but did not have the same meaning as in SQL.

How Did Liberty ODBC Handle This?

PICK users were (and in some cases still are) a very unique breed. They were typically business people, with no computer science training, who found the PICK system so intuitive that they could create their own queries, and in many cases help design their own databases. This sometimes caused problems because with their lack of computer science background, they'd do things that would cause their system to grind to a halt.

As we started moving these people into the ODBC world, they would try to pull data into Excel, and things would go sideways quickly.

One of the problems was that they would do a full Cartesian join, assuming that the SQL processor would be "smart enough" to figure out the obvious join conditions. Our very tiny, simple example above produced 24 rows. Let's imagine that you have your largest table, with 1 million rows of data, and you join it to a parts table with 10 thousand rows. 1 million times 10 thousand is 10 billion rows. The result set will be 10 thousand times bigger than your biggest table. This likely results in two undesirable problems:

It will take forever to come back with results
You will likely run out of disk space and crash your system

We added logic to our SQL processor to detect a Cartesian join. If we saw one, we'd check a system parameter called "allow_full_cartesian_joins". The default value was FALSE. A user could change the setting. If they did it using our admin tool, we'd put up a big warning and give the the option to cancel out or proceed. This alone solved a lot of support problems.

The other thing we did was to ALWAYS teach our users what a LEFT OUTER JOIN was and why the ALMOST ALWAYS wanted it!

And lastly, we would train our users to understand what a SQL NULL was and what the pitfalls were, and that when they transferred an empty string to a SQL VARCHAR field, it would become a nasty NULL, and would often behave in ways that were counter-intuitive.

Friday, September 7, 2018

Adventures In Mapping Multivalues - Complex Data Parsing

Splitting Fields

A common situation in Pick is where you have a compound key in one file that links to another file. As in the case of the control file, you may have a record with a key like C*RED. Somewhere in your data, you may store that option value in a multivalue. You may wish to display the fact that you have selected the colour "RED". In order to do that, you could reference a dictionary that does a group extract on that field. The group extract "G1*1" says find the first "*" and extract 1 "*" delimited value. In our case it would give you the value "RED".

With this structure, our mapping tool would allow you to simply pick a dictionary that gave you the whole value for the foreign key, but also pick a dictionary that used the group extract to show you that extracted, or computed, value.

Sometimes the logic for doing this would get quite convoluted, and while I have seem massive correlatives in dictionaries, these are hard to maintain and very hard to create properly. In these cases, our Select Method subroutines had a simple way of allowing you to do this type of processing. To illustrate this, let's look at this example Pick record:

0001^18486^ABC Company^WIDGET]GIDGET^C*GOLD]C*RED\SZ*SMALL^2]1^
11595^22075

Note that everything up to the first attribute mark (^) is attribute 0 (attributes are zero-based), and the "]" signifies a value mark, and the "\" signifies a sub-value mark.

In order to process this option data, we want SQL to have a field that's a Colour field and a field that's a Size field. No such attribute exists in PICK, so what we could do is start by choosing two attribute positions that are not used by real data. Let's say we look at the file and there are never more than 23 attributes, and no valid dictionaries reference anything higher than 23. We could leave some room for growth and pick attributes 31 and 32 as dummy attributes.

Next we'd create a dummy dictionary for attribute 31, called Colour and one for attribute 32 called Size. We'd use the comment field of the dictionary to note that it's used by the Liberty mapping for a computed value. We need the dictionaries for the mapping wizard to create the SQL column mapping.

COLOUR
001 A
002 31
003 Colour
...
009 L
010 10
011 Liberty Mapping Virtual Field

SIZE
001 A
002 32
003 Size
...
009 L
010 10

011 Liberty Mapping Virtual Field

Then we'd create this subroutine:

SUBROUTINE SELECT.METHOD.INVOICE(ID, ITEM, PROCEED)
EQU VM TO CHAR(253)
EQU SVM TO CHAR(252)
* Run through multivalues - invoice lines
FOR M=1 TO DCOUNT(ITEM<4>,VM)
* Run through subvalues in the current multivalue
* these are the option values for that invoice line
FOR S=1 TO DCOUNT(ITEM<4,M>,SVM)
OPTION=ITEM<4,M,S>
TYPE=FIELD(OPTION,"*",1)
BEGIN CASE
CASE TYPE EQ "C"
* the multivalue position must match
* but there will only be one Colour option
ITEM<31,M>=FIELD(OPTION,"*",2)
CASE TYPE EQ "SZ"
* the multivalue position must match
* but there will only be one Size option
ITEM<32,M>=FIELD(OPTION,"*",2)
END CASE
NEXT S
NEXT M
RETURN
END

The SQL might look like this:

SELECT * FROM INVOICE_MASTER
Invoice_Id Invoice_date Customer...........................
0001 11 Aug 2018 ABC Company

SELECT * FROM INVOICE_LINES
Invoice_Id Line_Id Product Quantity Price.... Colour. Size.
0001 1 WIDGET 2 115.95 GOLD NULL
0001 2 GIDGET 1 220.75 RED SMALL

The benefit of this is that you don't need a third view for the options, as in our hypothetical case we know (hypothetically) that any invoice line will only have one of any option type assigned to it.

Mixed Types

One of our customers that did deliveries had a very unique delivery field. You could put a date and time in, or you could put text in. The date might be 2018-09-12 15:30:00 or it might be "Tomorrow, any time before noon". There was value in being able to sort any date fields that were entered as dates, but you also had to deal with text. We simply did a format check, and created two virtual fields. One would have the properly formatted dates and times, the other would have the text.

As you can see, Select Methods were an incredibly powerful tool for mapping multivalues.

Sunday, September 2, 2018

Adventures in Mapping Multivalues - Data Cleansing on the Fly

In my last post, I talked about how we used a powerful feature that we alternately called Select Methods and simply Filters, to solve a problem with control files. In this post, I'll take you through one of the powerful tricks that Select Methods let you play to control your data.

Select Methods, or Filters, are simply Pick/BASIC subroutines that get called for each record read from the file, before any processing is done on that record. They have the following signature:

SUBROUTINE SUB-NAME(ID, ITEM, PROCEED)

It is generally recommended that you not change ID, but we won't stop you. ITEM is the item body, and as you will see, changing it is quite acceptable. In the last post we showed a way that you would use PROCEED to indicate if a particular record should be processed for the current view, or skipped.

Data Cleansing on the Fly

To understand the importance of this feature, you need to understand what we've mentioned in previous posts. You can define a dictionary definition for a file, if you wish, but this is merely suggestive. Your BASIC programs, and any number of TCL commands (Terminate Control Language was the command-line for Pick systems) could update any format of data into the file. Examples are a programmer, fixing bad data, accidentally copies a fixed record into the wrong file, with the wrong format for that file. A BASIC program with a bug, writes a text string into a numeric field. Or some kind of data corruption occurs and the data in a record becomes gibberish.

In the SQL world, you just can't do that, so we would crash a query if we encountered bad data. We used to have customers insist that their data was good, so we would run our validator on their files and it would start kicking out reports of bad data. In all the years that I worked with Pick customers, I have NEVER run into one that did not have bad data in at least a few of their key files. In most cases, every file with more than 10,000 records had some bad data.

Customers had two options. One was to fix the bad data our validator found. This was great except that we would always ask how the bad data got there. The first time you got bad data again, the query would crash again. In cases where data was being corrupted by a program or process unknown, there were a couple of simple fixes that could be done using Select Methods:

The first option would be to use a Select Method as a filter. Let's say that you have two fields, attributes 5 and 7, that need to be numeric. The following program would detect if the data contained was not numeric and tell the program to skip it. As noted in the comments and the omitted code, you can also write a message with the item-id into a log file so you can fix it later. This is a better option when you think the record is possibly complete garbage and you really don't want to play with it until you've had a look.

Here is what that subroutine would look like:

SUBROUTINE VALIDATE.MYFILEDATA(ID, ITEM, PROCEED)
IF NOT(NUM(ITEM<5>)) OR NOT(NUM(ITEM<7>)) THEN
PROCEED=0
* Optionally add code here to write the ID to an error log
* file so someone can review it
END ELSE
PROCEED=1
END

The other option is to cleanse the data on the fly. This is what that might look like:

SUBROUTINE FIX.MYFILEDATA(ID, ITEM, PROCEED)
IF NOT(NUM(ITEM<5>)) THEN
ITEM<5>=0
END
IF NOT(NUM(ITEM<7>)) THEN
ITEM<7>=0
END

This option makes more sense if you know what attributes are getting the bad data, but have reasonable confidence that the other data in the record is good. This latter approach was one of the most common uses of our Select Methods.

Friday, August 24, 2018

Adventures in Mapping Multivalues - Control Files

What Shape is Your Data?

In a SQL world, a specific column in a specific table has exactly one data type and it cannot change. In a pinch, you can represent everything with a varchar, but that's not very useful. Arithmetic gets tricky for numbers, and dates become impossible to sort!

Pick, on the other hand, much like XML, is essentially free-form. You can define dictionaries for your data, but there is nothing that forces you to follow the dictionary layout, or even warn you if you depart, and in many Pick dictionaries, you don't have to look far to find two dictionary records that point at the same attribute, but have completely different definitions.

Just about every Pick system I ever worked on had a control file. About 75 to 80% of the time, it was called "CF". The item-id of this file would normally have a structure like this:

Control-record-type "*" Identifier

The asterix was probably the most common delimiter in early Pick systems and carries through to many applications. So some examples might be:

SIZE*SMALL
COLOUR*RED
INDUSTRY.CODE*312000

The thing with this file is that each record type would have a unique structure. For SIZE the first attribute might be an integer, which is the number of inches. COLOUR might have 3 integers for red, green and blue, and INDUSTRY.CODE might have a text name of the industry, followed by some tax rates.

In order to process this and a whole bunch of other problems, we came up with a strategy called a "Select Method" or "Filter". This was a subroutine that you would associate with a table that you defined. This filter was a Pick BASIC subroutine with this signature:

SUBROUTINE SUB-NAME(ID, ITEM, PROCEED)

As soon as our SQL engine read a record for processing, it called this subroutine, passing the item-id in ID, and the full dynamic array text of the item body in ITEM. You got to set PROCEED to 1 or 0 (true or false). If you set it to false, we'd skip the record.

So, for our example above, we'd define 3 table views on the CF file. The first might be called CF_SIZE and the filter method would look like this:

SUBROUTINE LF.CF.SIZE(ID, ITEM, PROCEED)
IF FIELD(ID,"*",1) EQ "SIZE" THEN PROCEED=1 ELSE PROCEED=0
RETURN
END

It would only process records that started with "SIZE*".

You'd then create another view called CF_COLOUR with this filter:

SUBROUTINE LF.CF.SIZE(ID, ITEM, PROCEED)
IF FIELD(ID,"*",1) EQ "COLOUR" THEN PROCEED=1 ELSE PROCEED=0
RETURN
END

This would only process records that started with "COLOUR*".

And you'd wind up with one called CF_INDUSTRY_CODE with this filter method:

SUBROUTINE LF.CF.SIZE(ID, ITEM, PROCEED)
IF FIELD(ID,"*",1) EQ "INDUSTRY.CODE" THEN PROCEED=1 ELSE PROCEED=0
RETURN
END

This would only process records that started with "INDUSTRY.CODE".

We would then define a different mapping for each record type.

Note that we avoided periods in table names as periods were significant separators in the SQL world. If you used them you'd have to quote your table names. In the early days, many ODBC clients wouldn't do this, so you were asking for trouble if you used them!

In addition to Row Select Methods, there were Row Delete Methods, Row Insert Methods and Row Update Methods that you could use to tweak how you wrote data back. We'll deal with this in another post. Our next few posts will look at some more magic that we did with Row Select Methods, otherwise known as Filter Methods.

Thursday, August 16, 2018

Adventures in Mapping Multivalues - Two Common Approaches

The Problem

In a previous post, I talked about what a Pick database was. Pick had a way of representing complex data that was intuitive and matched the structure of common documents that it was modeling very closely. It was a very human-friendly way to represent that data. Note that this is much the same argument for XML data (that's for another blog post...)

In this post, I'm going to talk about how this might look in a relational database, then I'll talk about two different approaches that Pick uses to represent this kind of data:

Correlated Multivalues
Subfiles

Our Document Data

So, let's consider the example we gave of our Invoice. Here is how it might look in a printed document, with no consideration of how it actually is stored in the database:

Invoice 0001 Date 11 Aug 2018 Customer ABC Company

Product Options Quantity Price............ Extended...
WIDGET GOLD 2 $115.95 $231.90
GIDGET RED 1 $220.75 $220.75
SMALL
========================================================
Subtotal: $452.65

Note that you have an invoice number, which should be a unique key, a date, and customer, that are all fields that have a single value. Then you have a bunch of fields that have more than one value: Product, Options, Quantity, Price. We also have two calculated fields, one being a subtotal, and the other being the extended price (Quantity times Price), which has multiple values, one for each product line. Note also that the Options value can have multiple values for each product value, so this is a sub-multivalued field, or subvalued for short.

Relational (SQL) Model

Here is one possible relational representation of that data:

SELECT * FROM INVOICE_MASTER
Invoice_Id Invoice_date Customer...........................
0001 11 Aug 2018 ABC Company

SELECT * FROM INVOICE_LINES
Invoice_Id Line_Id Product Quantity Price....
0001 1 WIDGET 2 115.95
0001 2 GIDGET 1 220.75

SELECT * FROM INVOICE_OPTIONS
Invoice_Id Line_Id Option_Id Option
0001 1 1 GOLD
0002 1 1 RED
0002 2 1 SMALL

Let's talk about some points here. In the INVOICE_LINES table, we've added a Line_Id field, which combined with the Invoice_Id would be the primary key (unique identifier). If you knew for sure that you would never repeat a product in another line of your invoice, that any product would be unique within any invoice, you could skip that Line_Id field and use the Product value with the Invoice_Id as the primary key. That assumption would probably bite you. (In a multivalued database, you never have to create artificial keys like this.) The same thing happens twice over in the INVOICE_OPTIONS table. This mess of decisions, keys, and additional values creates code entropy, as programmers are required to do multiple reads, then navigate multiple data sets to process the data. One is generally safer to create a new key in these cases, but then you have a task of managing these keys when you write the data back, or insert (or delete) new invoice lines or options.

Correlated Multivalues

Here is the most common representation of the above in Pick:

0001^18486^ABC Company^WIDGET]GIDGET^C*GOLD]C*RED\SZ*SMALL^2]1^
11595^22075

Notice that in this structure, you have minimal wasted space. It was developed when memory and disk were both expensive, so keeping things small was beneficial. The up-arrow character '^' is the attribute, or field delimiter. The square bracket ']' is a multivalue delimiter, and the backslash '\' is the subvalue delimiter. The actual delimiters are high-ASCII characters that were not commonly used in data in typical ASCII systems.

The first thing you see is what Pick calls the Item-Id. In Relational terms, this is a single field that is always the entire primary key. It is a unique identifier that is used to determine which bucket of the hash-file to put the data in, or to find it in, if you are looking data up.

The next two fields are the invoice date (number of days since Dec 31, 1967) and the customer name. They both have just one value in them.

The next 4 fields contain multivalues. The first value of each of these fields is the data for line 1 of the invoice. The second value of each of these fields is the data for line 2 of the invoice. For the Options field, the first value contains just that value. The second value contains two subvalues. The first one is the first option line for the second invoice line, the second is the second option line for the second invoice line.

For a Pick BASIC programmer who has read this record into a variable called X, he would access the first invoice line's product with this code:

PRODUCT.ID=X<3,1>

The second line's product value would be this:

PRODUCT.ID=X<3,2>

The Item-Id is not part of what comes back from a read (you need it to do the read), so it's often referred to as attribute 0.

To get the second option of the second invoice line, you would use this:

OPTION.CODE=X<4,2,2>

You basically had to know the attribute numbers for your fields and you could get at anything. One read statement got you everything for your document. Your record really was your document.

Subfiles

A less common, but still often used approach to representing this data is called a subfile. In a subfile, all the detail line data is contained in a single field. Here is how it would be represented:

0001^18486^ABC Company^WIDGET\C*GOLD\2\11595]GIDGET\C*RED,
SZ*SMALL\1\22075

All of your detail lines are contained in attribute 3. The first multivalue is all of invoice line 1's data. The second multivalue is invoice line 2's data. The first subvalue of each multivalue is always the product, the second subvalue is the options, the third is the quantity and the fourth subvalue of each multivalue is the price.

Note that we ran out of delimiters, so we had to use a comma separator two delimit the second line's two option values.

So here is what a Pick BASIC programmer would do to get the first invoice line's product:

PRODUCT.ID=X<3,1,1>

You need to specify that you only want that first subvalue, or you'll get the whole invoice line. That may be handy in itself, but then you'd have to do a second line to get that data out.

The second line's product value would be this:

PRODUCT.ID=X<3,2,1>

To get the second option of the second invoice line, you would use this:

OPTION.CODE=FIELD(X<3,2,2>,",",2)

This line gets a bit more involved. The angle brackets only work with system delimiters, so it retrieves both option values, with the comma separator. The FIELD statement says to pull the 2nd comma delimited value out.

Again, your record is really the entire document. Note also, that the BASIC implementation has data access and handling functionality built in. You don't have to call out to an API as it's all there!

Liberty Mapping Terminology

Liberty ODBC's SQL Mapping layer handled both of these structures very simply.

The Correlated Multivalues were termed a "Group Floating" table. If you were dealing with Correlated Subvalues, you had a "Group Floating within Group Floating" table.

The subfiles were called a "Group Positional" table. By using these structures, you would get the equivalent to the above. You could define SQL columns that were the multivalue or subvalue offsets, effectively giving you the Line_Id and Option_Id fields.

When joining two tables where one was actually contained inside the other, our engine would optimize and read the Pick record only once for efficiency sake.

Upcoming

In some of my upcoming posts, I'll talk about some of the interesting and odd problems we had to deal with in order to efficiently map and process multivalued data in our SQL engine.

Saturday, August 11, 2018

What is a PICK MultiValue Database?

Why I'm Blogging This

In some upcoming posts, I'm going to be talking about how we implemented an ANSI/92 compliant SQL engine and ODBC driver for this, and some of the interesting challenges we encountered. In some cases, I'll be talking about the relative merits of these systems vs. relational databases. They all have their strengths.

In order to do this, I thought it would be good to have a bit of a primer about how Pick data is stored and processed. I'll probably reference this little blog post quite a bit.

It's Still Here

The first thing I want to say is that systems running on Pick based multivalue are ubiquitous and they are EVERYWHERE! In some cases they are legacy solutions that have been in place and working for ever. Probably every fortune 500 company has at least one system, running in some department, that runs on it. It's also still being sold as part of applications around the world.

Examples of industries and solutions that continue to run and/or sell best-of-breed solutions running on PICK include (but are not limited to) Automotive Retail, Libraries, Banking, Healthcare, Insurance, Manufacturing, Distribution, EMS, Hospitality, Municipal, Government, and many more.

Ignore them at your peril!

Where Did it Come From?

In about 1965, the US department of defense awarded a contract to TRW to create a system to manage Cheyenne helicopter parts. Two engineers, Don Nelson and Dick Pick were assigned to the task, and started work on an IBM System/360 computer. They named their system GIRLS, which was an acronym for Government Inventory Relational Language System.

This was before commodity hardware was a thing. Somehow they got the idea to create a system based on a virtual machine that they could implement on any hardware. There was some really good architecture and design that went into this system.

How Does PICK Data Work?

The database was a very simple concept. Whatever document you needed to represent, you would have a data structure that contained all the information in one place. With a single read, a programmer could pull back all the key data that was needed for an invoice. In real practice, your invoice would reference a customer by an id, and the customer name would remain in the customer master file. Similar things would happen with part numbers and the part names. The invoice would include both the header information and the detail lines for the parts being ordered.

The data is stored in a hashed file structure and each record is stored as a sparse string with delimiters separating fields, values and subvalues. Let's look at an example.

Still working with the invoice concept, lets consider a manufacturing organization that sells three key products, widgets, gidgets and gadgets. The invoice file has an invoice date, customer id, then it has 4 fields that contain multiple values (multivalues) in them. These are product id, options, quantity and price. When you display an invoice it might look like this:

Invoice 0001 Date 11 Aug 2018 Customer ABC - ABC Company

Product Options Quantity Price............ Extended...
WIDGET GOLD 2 $115.95 $231.90
GIDGET RED 1 $220.75 $220.75
SMALL
========================================================
Subtotal: $452.65
Taxes
Totals

When looking at the document, you can see that you have header data that occurs just once per document, including the invoice number, date, and customer information. You would look up the customer's name from the customer master file. You also have computed or calculated fields, like extended costs, subtotals, taxes and final totals.

The interesting thing to note is the invoice lines data. In a real invoice all the lines, an set of arbitrary size, is part and parcel of, and contained in, the invoice document. For a single invoice you could have an arbitrary number of products being purchased. for each you would have a quantity and a price. In our case, we further complicated it by allowing different products to have different options. You might choose a colour and a size as with the gidget we had on the second invoice line. At this point we not only have multiple values, but one of the values (the OPTIONS value for the GIDGET line) has two subvalues, one for the colour and one for the size.

Don and Dick came up with a database structure that allowed that. Your fields were delimited by field or attribute mark delimiters. Your multiple values were delimited by value mark delimiters and your subvalues were delimited by subvalue mark delimiters. The data for the above INVOICE file might look like this:

Note that the blocks are the field delimiters. the superscripted '2' is a value mark, and the superscripted 'n' is a subvalue mark. These are high-ASCII delimiters. Going left to right, we have the first field, the item-id (similar to a primary key) which is the invoice number '00001'. Then you have a date. This is represented as number of days since Dec 31, 1967. Next we have the customer id 'ABC' for ABC Company. Then we get a field that has two values. This field and all the subsequent fields have two values, the first one is line 1 of the detail lines and the second is line 2 of the detail lines. This first multivalued field has the product ids of the products being bought.The next one has the options selected for each line. You can see that for detail line 1 there is only one option, but for the 2nd detail line (the GIDGET), we have two options: The colour RED and the size SMALL. Then we have the quantities of each of these and the unit price for each line. The price has an implied 2 decimals (it is storing the value as cents, not decimal dollars).

For a programmer, the invoice number gave you the whole document. You might need to do ancilliary lookups in the CUSTOMER, PRODUCTS and OPTIONS tables, but a single read and a single write gets you your data and saves it to the database.

In a relational database, the INVOICE file would need to be a minimum of 3 tables. The record represented above would take a minimum of 6 disk reads to get the same data. As you increase the number of detail lines and the number of product options, the complexity grows rapidly. A programmer in this case needs to retrieve the data from the database and make sure he gets it all. Then he needs to properly relate the data sets in his program. The complexity is huge compared with the multivalued approach.

So, in relational terms, you would probably have these tables: INVOICE_HEADER, INVOICE_LINES, INVOICE_LINES, INVOICE_LINES_OPTIONS.

Think of how this would work if you stored your physical invoices like this. You'd go to the filing cabinet with header information and look up the header part of the invoice. Then you'd go to another filing cabinet where you'd have to find a separate page for each invoice line. You'd use the invoice number and line number. You'd keep looking until you couldn't find another line number for that invoice. Then you'd go to yet another filing cabinet, and for each invoice line for that invoice, you'd look for one or more pages that had information for the options (possibly multiple) for those invoice lines. I've seen invoices that had hundreds of lines in them. Can you imagine then having to organize and manage those hundreds of pages? Instead of one invoice that was maybe 3 pages long?

There is also a performance component to this. I remember working with a customer who had moved from a mainframe to a minicomputer platform, and had determined that they could not afford the disk head movement of a relational database. They were a big customer of our ODBC technology as we enabled them to have a relation view of the data and use it with tools that required this, while still giving them the reduced disk head movement for their core application.

How Did You Access That Data?

Pick had a concept called a dictionary. You would have a file that contained field definition records associated with every file. It was completely optional, and once the BASIC language came out for Pick, it didn't enforce anything in it. In fact, you could have two dictionaries that defined the same attribute completely differently. Generally, one would be wrong and the other right, but I've seen where a single file has multiple different record types, based on a prefix in the key. Which dictionary you wanted to use depended on the record key!

You could create a dictionary definition item that actually used a value from the current record to read a value from another file. This was called a translate, and was very powerful. For the example index above, you would take the customer id (let's say it's attribute 2) and use it to pull attribute 1 (customer name) from the CUSTOMERS file. That translate correlative would look like this:

001 A
002 2
003 CUSTOMER NAME
...
007 TCUSTOMERS;X;1;1
...

Similarly, you could use this to pull information from a PRODUCTS file.

When these systems were first commercially sold, salespeople would walk a machine in from a panel van (they were the size of a fridge, so this was quite a feat.) Then they'd use it to create a couple of tables, use the editor to create some dictionary definitions. Create a script program using PROC and BATCH to enter data. Then they'd use a LIST command to print out the data.

So the command "LIST INVOICES" would be automatically expanded out to:
LIST INVOICES INVOICE.DATE CUSTOMER.ID PRODUCT.ID OPTIONS QUANTITY PRICE

which would give us this output:

It was not uncommon for this to be so impressive for the business users watching the demo that a cheque would be written and the computer left there. Then they had to get programmers in and write a system... but it was also not uncommon for a business person to dictate what the system had to do and create his own reports. Compared with the options available at that time, this was a huge step forward, and many commercial systems that are still in use today had their start with a business owner looking over a programmer's shoulders. This was the first instance of agile and pair programming!

Later Add-ons

The first systems were written using a combination of PROC (a scripting language), BATCH (which was horrible, and impossible to make pretty - totally character based), and when you got stuck, Assembler.

Later, a BASIC implementation was added with built-in support for the multivalued database, and that allowed you to use extended features of an ASCII terminal. These databases still exist and go by a number of names:

D3
Universe
Unidata
mvBase
mvEnterprise
jBase
OpenInsight
QM and OpenQM
RealityX
And others

Relational Value and Pushback

An interesting thing to note is that many applications are being written today to use what are called NoSQL databases. The rationale for these databases sounds like an ad for Pick. While relational databases have their uses, there are clearly applications where they are just not the right solution.

That said, a huge amount of investment has been made by companies like Oracle, IBM, Microsoft, Cognos, Business Objects and others into enhancements and tools that leverage relational databases. This has not been the case with Pick, so there are many applications where it is imperative that the regardless where your data originates, you need to get it into relational in order to leverage these tools and technologies.

The industry is clearly seeing a divide form between SQL and NoSQL databases, yet there is also growing clarity about when you should use each. What's not as clear is that there is a NoSQL option that has already existed for a long time and that is widely in use, that's another option.

Sunday, July 1, 2018

ODBC for PICK - Why You Shouldn't

Fundamental Mismatches

There were a number of very key, fundamental mismatches between PICK databases, and the ODBC model. Enough in number and severity that many would say it's impossible, or "don't even try", but the group of pioneers that assailed this opportunity were not deterred by a few minor challenges. Here are some of the things that we had to address.

Two Main Parts of ODBC

ODBC consisted of two main components. One was a SQL syntax, the other was an implementation of a Call Level Interface (CLI), which is much the same as an Application Programming Interface (API). As an implementer of a driver, you could declare yourself as supporting one of three levels of each. In our case we implemented the middle level of both.

Structured Query Language (SQL)

This was the biggest one, and I'll address it better in a separate post, but will touch on it here. ODBC assumed SQL. While there were ways to bypass that, most tools that worked with ODBC would assume a certain level of SQL, and if you reported that you didn't support it, they'd just give up. So, if you wanted to support ODBC, you had no choice. You had to support SQL. At the time that ODBC was released, while it was gaining popularity rapidly, both ISAM and Multivalued databases still outnumbered SQL databases.

SQL demanded certain norms, including that it was strongly typed and that it abhorred multivalues. Multivalue databases like PICK and MUMPS (which was the other popular multivalued system, especially in health care) were, as the name implies, designed to work with multivalues. Again, I'll cover this in more detail elsewhere. In short, the data models supported by each were very widely different.

Strong vs. Weak Data Typing

In PICK, everything is a string. If anyone is familiar with XML, there are some definite similarities. In XML you can have numbers (but they are simply strings containing numeric characters), and you can even have a schema that declares a certain XML element to be numeric, but in the end, if you don't enforce the schema in code, you can put what you want in that XML element.

PICK had similar structures. In PICK, much like an XML schema, you had dictionaries. It was quite common over time, for PICK dictionaries to collect garbage. You'd have several conflicting dictionaries pointing at the same attribute (the PICK name for a field), and only one of them was really correct, or perhaps none of them were correct! And it was not uncommon for a file's dictionary level to be empty. Dictionaries were not enforced, but were useful for doing LIST, SORT, SELECT or SSELECT commands. LIST and SORT were for creating user reports and SELECT or SSELECT were for activating a list of unique primary keys (called item-ids in PICK terminology.)

The CLI portion of ODBC, meanwhile, was designed for programming languages designed to produce machine code, and perform machine-level calls (hence "Call Level Interface"). This included Assembler, C, C++, and COBOL among other languages. These languages were all strongly typed. You would declare a variable as being a string (null terminated - no 16-bit Unicode in the early days - that came later) or an integer. The CLI had its own data types and would even declare the size of the storage for an integer. The CLI had details of the sizing and precision of each data type, which you had to map to your machine's (or compiler's) data types. This guaranteed that regardless whether your machine called a 32-bit integer a small int or a long int, you knew how big the data coming back from, or going into, the CLI had to be, and therefore could allocate enough space for it. This allowed for interoperability between machine types, which was a key requirement for the standard.

Multi-threaded vs. Single User

PICK was a legacy system, originally built on the premise that a user was connecting to an expensive user license through a single, predefined serial port that ran through the building right to their terminal (or PC with a terminal emulator, like PK Harmony or Wintegrate.) There was no multi-tasking and definitely no multi-threading in that system.

While ODBC didn't require you to support multi-threading, most applications using it expected to open multiple connections to do their work. Now, if you had two serial ports, you could open two connections, but this cost a lot, and required you to run another serial cable to your PC. Even when we went to networked connections, the license cost for an extra PICK user was prohibitive, so this was another issue we needed to resolve.

Authentication and Authorization

In a pre-public-internet, pre-malicious-hacker age, we were very much aware of the security issues that we were likely to raise, and took these very seriously. Although it was only a reality for a small number of users, network connectivity was already a reality for some, and we knew it was only a matter of time before it became the norm. We already had people dialing up systems over the telephone network, and I had seen people using dial-up, Prime-Net and then Internet (pre-commercial), and get connected to the wrong Prime Information system. This system was a NORAD system and had no password on the SYSPROG account (the PICK equivalent of root!) We logged off very quickly!

PICK's authentication and authorization model at that time was very weak and not designed for a broad network-connected world.

Performance

We were very much aware that we were likely to run into interesting performance problems for a number of reasons:

Data Architecture Mismatch

We were forcing round pegs into square holes. There were fundamental differences in how SQL and Multivalue databases architected their data and trying fool PICK into thinking it was SQL was likely to present some interesting challenges for performance.

Data transfer speed

When we started this, most PICK systems still used Serial I/O for their connectivity. There were solutions from Netware that allowed you to use the network, but the last bit of the connection was through a serial port concentrator, so you were still limited to the maximum speed of the serial I/O. This was exacerbated by the fact that most PICK I/O channels designed for terminals would fall over if you burst data into them too quickly.

Challenge Accepted!

I've never found a challenge like this to be discouraging, but rather invigorating, and I've been blessed to work with many people who share that enthusiasm for solving difficult problems in elegant ways, so we got to work addressing the issues. Challenge Accepted!