Saturday, October 13, 2018

Adventures in Mapping Multivalues - Joins

What's a Join?

Relational databases are built on a mathematical principal called Set Theory.  To help us along, let's look at two sets. 

We'll call set A the PRODUCTS set.  Each element of the set contains the following properties:
PRODUCT_ID - always numeric, 1-4 digits
PRODUCT_NAME - text, up to 100 characters
COLOUR_ID - always numeric, 1-4 digits, may be empty

Note that not all products come in multiple colours, so the PRODUCT_COLOUR_CODE field may be empty in those cases. Most SQL implementations treat an empty value in a string field as a NULL. I explain what that means later.

We'll call set B the COLOURS set.  Each element of the set contains the following properties:
COLOUR_ID - always numeric, 1-4 digits
COLOUR_NAME - text, up to 100 characters

So let's look at the elements of our PRODUCTS set (I've shortened the column names to fit the blog):

P_ID P_NAME................. C_ID
   1 Widget                     1
   2 Gidget                     2
   3 Gadget                     1
   4 Thing-a-ma-bob             4
   5 What-cha-ma-call-it     NULL (not assigned)
   6 Whatever                   3

Here are the elements of our COLOURS set:

C_ID C_NAME..............
   1 Red
   2 Green
   3 Purple
   4 Blue

In the purest sense, a join is a super-set made up of combining two sets into one new set.  Let's look at the "simplest" example:

Full Cartesian Joins

The simplest join of all is a full Cartesian join.  While it's simple, it's very counter-intuitive for humans. Here is what a Cartesian join of the two sets above would look like:

P_ID P_NAME.............. P.C_ID C.C_ID C_NAME
   1 Widget                    1      1 Red
   1 Widget                    1      2 Green
   1 Widget                    1      3 Purple
   1 Widget                    1      4 Blue
   2 Gidget                    2      1 Red
   2 Gidget                    2      2 Green
   2 Gidget                    2      3 Purple
   2 Gidget                    2      4 Blue
...

Wait!!! What's happening here? Well, if you don't constrain a join, it will create a new set in which every member of set A gets paired up with every member of set B. In our case above, the result is 24 rows.  As stated earlier, while the logic is very simple, it's very counter-intuitive. Note that I've added the prefix P. or C. to the front of the C_ID columns to distinguish which one came from the PRODUCTS or COLOURS sets.

While the is the default processing, it is almost never what you want to do in real commercial systems. In SQL terms you would get this from the following syntax:

SELECT * FROM PRODUCTS, COLOURS

What this says in SQL terms is "Create a new set (the result set), containing all properties and all combinations of elements, without constraint, of the PRODUCTS and COLOURS sets".

Inner Joins

As a human being, I look at the two sample sets (table) above, and realize immediately, that I only want to combine elements from the COLOURS set in my superset when the COLOUR_ID property in the PRODUCTS set matches the COLOUR_ID property in the COLOURS set.  This is called an Inner Join.  There are two ways to accomplish this with SQL. The following two SQL commands product the identical result set:

SELECT * FROM PRODUCTS P, COLOURS C WHERE P.COLOUR_ID = C.COLOUR_ID

SELECT * FROM PRODUCTS P INNER JOIN COLOURS C ON P.COLOUR_ID = C.COLOUR_ID

The key thing is that we are constraining the result set to be only those combinations with matching COLOUR_ID properties in both sets.

For our example this produces the following result set:

P_ID P_NAME............... P.C_ID C.C_ID C_NAME
   1 Widget                     1      1 Red
   2 Gidget                     2      2 Green
   3 Gadget                     1      1 Red
   4 Thing-a-ma-bob             4      4 Blue
   6 Whatever                   3      3 Purple

Hmmm.  What happened to the element of the PRODUCTS set that had PRODUCT_ID 5?  Well, given our join condition, there was no element of the COLOURS set whose COLOUR_ID property matched the empty COLOUR_ID value of the PRODUCTS table element, so it didn't fit our criteria and got dropped. While we're getting closer, that's still not quite what we were looking for!  This brings us to...

Left Outer Joins

What you really want is this syntax:

SELECT * FROM PRODUCTS P LEFT OUTER JOIN COLOURS C ON P.COLOUR_ID = C.COLOUR_ID

What this is saying is that, reading your sets (tables) from left to right in the SQL command, the result set should always include all elements of the left-most set (in this case PRODUCTS), even if there isn't a matching element in the sets to the right (COLOURS). In this case you get this result set:

P_ID P_NAME............... P.C_ID C.C_ID C_NAME
   1 Widget                     1      1 Red
   2 Gidget                     2      2 Green
   3 Gadget                     1      1 Red
   4 Thing-a-ma-bob             4      4 Blue
   5 What-cha-ma-call-it
   6 Whatever                   3      3 Purple

I'm going to display it as most raw SQL outputs would display it:

P_ID P_NAME............... P.C_ID C.C_ID C_NAME
   1 Widget                     1      1 Red
   2 Gidget                     2      2 Green
   3 Gadget                     1      1 Red
   4 Thing-a-ma-bob             4      4 Blue
   5 What-cha-ma-call-it    NULL  NULL   NULL
   6 Whatever                   3      3 Purple

NULLs are Nasty!

NULLs are not the same as an empty string.  They have nasty implications. Think of NULL as an "unassigned" or "unknown" value. 

One artefact of set theory is tertiary logic. Lets think of phone numbers. If I have a customer table where I track phone numbers, there are actually three possibilities for any customer.
  1. They have a phone number and I have captured it
  2. They don't have a phone number and I know that for sure
  3. I have no idea if they have a phone number or not, and/or don't know what it is
Note the the syntax "NULL = NULL" technically returns false. I may not know your phone number and I may not know your best friend's phone number, so I use the NULL value to indicate that.  That does not make them the same. Unless you two are room-mates, the two numbers are almost certainly NOT the same. That's why NULL = NULL returns FALSE in most SQL databases.

In the case of our PRODUCTS table, it may be desirable to pull back all PRODUCTS that don't define colours.  In those cases, all those NULLs really define the same state. It's like the N/A value.  Doesn't apply and all "doesn't applies" are the same when it comes to colours.  NULL causes more problems than it solves, but because it's an important part of set theory, it became entrenched in all relational databases, for better or worse.

How PICK Handles This

In the Multivalued PICK world, we do not bother ourselves with set theory. It's nice if you want to apply mathematics to your data, but even there, the only math is really assists you with is set theory.

In the PICK world, you'd add a dictionary definition record to the dictionary of the PRODUCTS file and use a translate correlative code, "TCOLOURS;X;1;1" to pull the name of the COLOUR in, using the COLOUR_ID field on the PRODUCTS table. the "X" in that correlatives is a code that says, "If there's no item in the target table (COLOURS) that contains that id, return an empty string."  The result would look something like this:

P_ID P_NAME............... C_ID.. C_NAME
   1 Widget                     1 Red
   2 Gidget                     2 Green
   3 Gadget                     1 Red
   4 Thing-a-ma-bob             4 Blue
   5 What-cha-ma-call-it
   6 Whatever                   3 Purple

No NULL values exist in the PICK world. You would have to come up with a special value for "unknown" or "N/A" and handle it explicitly. That can be done and would be far more obvious and intuitive than SQL's NULL value.

Dick Pick and Don Nelson designed their database to store its data as a sparsely populated string, called a dynamic array.  Empty strings were quite common and very efficient. Empty strings were often called NULL, but did not have the same meaning as in SQL.

How Did Liberty ODBC Handle This?

PICK users were (and in some cases still are) a very unique breed. They were typically business people, with no computer science training, who found the PICK system so intuitive that they could create their own queries, and in many cases help design their own databases.  This sometimes caused problems because with their lack of computer science background, they'd do things that would cause their system to grind to a halt.

As we started moving these people into the ODBC world, they would try to pull data into Excel, and things would go sideways quickly.

One of the problems was that they would do a full Cartesian join, assuming that the SQL processor would be "smart enough" to figure out the obvious join conditions.  Our very tiny, simple example above produced 24 rows.  Let's imagine that you have your largest table, with 1 million rows of data, and you join it to a parts table with 10 thousand rows.  1 million times 10 thousand is 10 billion rows.  The result set will be 10 thousand times bigger than your biggest table. This likely results in two undesirable problems:
  1. It will take forever to come back with results
  2. You will likely run out of disk space and crash your system

We added logic to our SQL processor to detect a Cartesian join. If we saw one, we'd check a system parameter called "allow_full_cartesian_joins". The default value was FALSE.  A user could change the setting. If they did it using our admin tool, we'd put up a big warning and give the the option to cancel out or proceed.  This alone solved a lot of support problems.

The other thing we did was to ALWAYS teach our users what a LEFT OUTER JOIN was and why the ALMOST ALWAYS wanted it!

And lastly, we would train our users to understand what a SQL NULL was and what the pitfalls were, and that when they transferred an empty string to a SQL VARCHAR field, it would become a nasty NULL, and would often behave in ways that were counter-intuitive.

No comments:

Post a Comment