Structured Query Language/NULLs and the Three Valued Logic

The Problem

edit

As mentioned in a previous chapter of this wikibook and in wikipedia sometimes there is no value in a column of a row, or - to say it the other way round - the column stores the NULL marker (a flag to indicate the absence of any data), or - to use the notion of the SQL standard - the column stores the NULL value. This NULL marker is very different from the numeric value zero or a string with a length of zero characters! Typically it occurs when an application yet hasn't stored anything in the column of this row.

(A hint to Oracle users: For Oracle, the NULL marker is identical to a string of zero characters.)

The existence of the NULL marker introduces a new fundamental problem. In the usual boolean logic, there are the two logical values TRUE and FALSE. Every comparison evaluates to one of the two - and the comparisons negation evaluates to the opposite one. If a comparison evaluates to TRUE, its negation evaluates to FALSE and vice versa. As an example, in the usual boolean logic, one of the following two comparisons is TRUE, and the other one is FALSE: 'x < 5', 'x >= 5'.

Imagine now the new situation that x holds the NULL marker. It is not feasible that 'NULL < 5' is true (1). But can we say 'NULL < 5' is false (2) and its negation 'NULL >= 5' is true (3)? Is (3) more feasible than (1)? Of course not. (1) and (3) have the same 'degree of truth', so they shall evaluate to the same value. And this value must be different from TRUE and FALSE.

Therefore the usual boolean logic is extended by a third logic value. It is named UNKNOWN. All comparisons to the NULL marker results per definition in this new value. And the well-known statement 'if a statement is true, its negation is false' gets lost because there is a third option.

SQL's logic is an implementation of this so-called trivalent, ternary or three-valued logic (3VL). The existence of the NULL marker in SQL is not without controversy. But if NULLs are accepted, the 3VL is a necessity.

This page proceeds in two stages: First, it explains the handling of NULLs concerning comparisons, grouping, etc. . Second, it explains the boolean logic for the cases where the new value UNKNOWN interacts with any other boolean value - including itself.

Example Table

edit

To demonstrate NULL behaviors, we define an example tables: t1, and t2.

CREATE TABLE t1 (
  id    DECIMAL PRIMARY KEY,
  col_1 DECIMAL,
  col_2 VARCHAR(20),
  col_3 DECIMAL
);

INSERT INTO t1 VALUES ( 1,  1,    'Hello World', 1);
INSERT INTO t1 VALUES ( 2,  2,    NULL,          2);
INSERT INTO t1 VALUES ( 3,  3,    'Hello World', NULL);
INSERT INTO t1 VALUES ( 4,  4,    'Hello World', NULL);
INSERT INTO t1 VALUES ( 5,  5,    'Hello Their', NULL);
INSERT INTO t1 VALUES ( 6,  NULL, 'Hello World', NULL);
INSERT INTO t1 VALUES ( 7,  NULL, 'Hello World', NULL);
INSERT INTO t1 VALUES ( 8,  8,    'Hello World', NULL);
INSERT INTO t1 VALUES ( 18, 18,   'Hello World', NULL);

CREATE TABLE t2 (
  id DECIMAL PRIMARY KEY,
  col_x DECIMAL
);

INSERT INTO t2 VALUES ( 1, 1 );
INSERT INTO t2 VALUES ( 2, NULL );
INSERT INTO t2 VALUES ( 3, 3 );
INSERT INTO t2 VALUES ( 4, 4 );
INSERT INTO t2 VALUES ( 5, 5 );
INSERT INTO t2 VALUES ( 18, 18 );
COMMIT;


Step 1: Evaluation of NULLs

edit

Comparison Predicates, IS NULL Predicate

edit

SQL knows the six comparison predicates <, <=, =, >=, > and <> (unequal). Their main purpose is the arithmetic comparison of numeric values. Each of them needs two variables or constants (infix notation). This implies that it is possible that one or even both operands hold the NULL marker. As stated before, the common and very simple rule is: "All comparisons to the NULL marker results per definition in this new value (unknown).". Here are some examples:

  • NULL = 5 evaluates to UNKNOWN.
  • 5 = NULL evaluates to UNKNOWN.
  • NULL <= 5 evaluates to UNKNOWN.
  • col_1 = 5 evaluates to UNKNOWN for rows where col_1 holds the NULL marker.
  • col_1 = col_2 evaluates to UNKNOWN for rows where col_1 or col_2 holds the NULL marker.
  • NULL = NULL evaluates to UNKNOWN.
  • col_1 = col_2 evaluates to UNKNOWN for rows where col_1 and col_2 holds the NULL marker.

The WHERE clause returns such rows where it evaluates to TRUE. It does not return rows where it evaluates to FALSE or to UNKNOWN. In consequence, it is not guaranteed that the following SELECT will return the complete table t1:

-- This SELECT will not return such rows where col_1 holds the NULL marker.
SELECT *
FROM   t1
WHERE  col_1 > 5
OR     col_1 = 5
OR     col_1 < 5;

Of course, there are use cases where rows with the NULL marker must be retrieved. Because the arithmetic comparisons are not able to do so, another language construct must do the job. It is the IS NULL predicate.

-- This SELECT will return exactly these rows where col_1 holds the NULL marker.
SELECT *
FROM   t1
WHERE  col_1 IS NULL;

Other Predicates

edit

For the other predicates, there is no simple rule of thumb. They must be explained one after the other.

The IN predicate is a shortcut for a sequence of OR operations:

Only the two comparisons 'col_1 = 3' and 'col_1 = 18' are able to retrieve rows (possibly many rows). The comparison 'col_1 = NULL' will never evaluate to TRUE. It's always UNKNOWN, even if col_1 holds the NULL marker. To retrieve those rows, it's necessary - as shown above - to use the 'IS NULL' predicate.

-- Shortcut for: col_1 = 3 OR col_1 = 18 OR col_1 = NULL
SELECT *
FROM   t1
WHERE  col_1 IN (3, 18, NULL);  -- the NULL case will never hit with the IN predicate!

This is a little more complex. This will only return 1, 3, and 4, the items that don't have NULL in t2.col_x or t1.col_1.

SELECT *
FROM   t1
WHERE  col_1 IN (SELECT col_x FROM t2 WHERE id < 10);

The subselect of an EXISTS predicate evaluates to TRUE if the cardinality of the retrieved rows is greater than 0, and to FALSE if the cardinality is 0. It is not possible that the UNKNOWN value occurs.

-- The subselect to t2 can hit some rows - or not. If there are hits in the subselect, ALL rows of t1
-- are returned, else no rows of t1 are returned.
SELECT *                                 -- The select to table t1
FROM   t1
WHERE  EXISTS
       (SELECT * FROM t2 WHERE id < 10); -- The subselect to table t2

The LIKE predicate compares a column with a regular expression. If the column contains the NULL marker, the LIKE predicate returns the UNKNOWN value, which means that the row is not retrieved.

-- The LIKE retrieves only rows where col_2 matches the WHERE statement and col_2 is not NULL.
SELECT *
FROM   t1
WHERE  col_2 LIKE 'Hello %';

Predefined Functions

edit

The aggregate functions COUNT(<column_name>), MIN(<column_name>), MAX(<column_name>), SUM(<column_name>) and AVG(<column_name>) ignores such rows where <column_name> contains the NULL marker. On the other hand COUNT(*) includes all rows.

If a parameter of one of the scalar functions like UPPER(), TRIM(), CONCAT(), ABS(), SQRT(), ... contains the NULL marker the resulting value is - in the most cases - the NULL marker.

Grouping

edit

There are some situations where column values are compared to each other to answer the question, whether they are distinct. For usual numbers and strings, the result of such decisions is obvious. But how shall the DBMS handle NULL markers? Are they distinct from each other, are they equal to each other or is there no answer to this question at all? To get results, which are expected by (nearly) every end-user, the standard defines "Two null values are not distinct.", they build a single group.

SELECT DISTINCT col_1 FROM t1; retrieves one and only row for all rows where col_1 holds the NULL marker.

... GROUP BY col_1 ...; builds one and only one group for all rows where col_1 holds the NULL marker.

Step 2: Boolean Operations within three-valued logic (3VL)

edit

After we have seen how various comparisons and predicates on the NULL marker produce TRUE, FALSE, and UNKNOWN, it's necessary to explain the rules for the new logic value UNKNOWN.

Inspection

edit

A first elementary operation is the inspection of a truth value: is it TRUE, FALSE or UNKNOWN? Analogous to the IS NULL predicate there are three additional predicates:

  • IS [NOT] TRUE
  • IS [NOT] FALSE
  • IS [NOT] UNKNOWN
-- Check for 'UNKNOWN'
SELECT *
FROM   t1
WHERE  (col_1 = col_3) IS UNKNOWN;  -- parenthesis are not necessary

-- ... is semantically equivalent to
SELECT *
FROM   t1
WHERE  col_1 IS NULL 
OR     col_3 IS NULL;

In the abstract syntax of logical systems p shall represent any of its truth values. Here is the three-valued logic truth table:

p IS TRUE IS FALSE IS UNKNOWN   IS NOT TRUE IS NOT FALSE IS NOT UNKNOWN
TRUE TRUE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE TRUE FALSE TRUE
UNKNOWN FALSE FALSE TRUE TRUE TRUE FALSE

All predicates lead to TRUE or FALSE and never to UNKNOWN.

The next operation is the negation of the new value. Which values evaluate to 'NOT UNKNOWN'? The UNKNOWN value represents the impossibility to decide between TRUE and FALSE. It is not feasible that the negation of this impossibility leads to TRUE or FALSE. Likewise, it is UNKNOWN.

-- Which rows will match?  (1)
SELECT *
FROM   t1
WHERE  NOT col_2 = NULL; -- 'col_2 = NULL' evaluates to UNKNOWN in all cases, see above.

-- Is this SELECT equivalent to the first one?  (2)
SELECT *
FROM   t1
  EXCEPT
SELECT *
FROM   t1
WHERE  col_2 = NULL;

-- No, it's different!! Independent from NULL markers in col_2, (1) retrieves
-- absolutely NO row and (2) retrieves ALL rows.

The above SELECT (1) will retrieve no rows as 'NOT col_2 = NULL' evaluates to the same as 'col_2 = NULL', namely UNKNOWN. And the SELECT (2) will retrieve all rows, as the part after EXCEPT will retrieve no rows, hence only the part before EXCEPT is relevant.

In the abstract syntax of logical systems p shall represent any of its truth values and NOT p its negation. Herein the following table applies:

p NOT p
TRUE FALSE
FALSE TRUE
UNKNOWN UNKNOWN

AND, OR

edit

There are the two binary operations AND and OR. They evaluate as follows:

p q p AND q p OR q
TRUE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE
TRUE UNKNOWN UNKNOWN TRUE
FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE
FALSE UNKNOWN FALSE UNKNOWN
UNKNOWN TRUE UNKNOWN TRUE
UNKNOWN FALSE FALSE UNKNOWN
UNKNOWN UNKNOWN UNKNOWN UNKNOWN

The precedence of the operations is defined as usual: IS predicate, NOT, AND, OR.

Some Examples

edit
--
-- Add a new row to the test data base
INSERT INTO person (id, firstname, lastname)  -- Omit some columns to generate NULL markers
VALUES             (99, 'Tommy',   'Test');
COMMIT;

SELECT *
FROM   person
-- focus all tests to the new row
WHERE  id = 99                  -- (1): TRUE 
AND                             -- (3): (1) AND (2) ==> TRUE AND UNKNOWN ==> UNKNOWN
(
       date_of_birth = NULL     -- (2): UNKNOWN
);                              -- no hit

SELECT *
FROM   person
WHERE  id = 99                  -- (1): TRUE 
AND                             -- (3): (1) AND (2) ==> TRUE AND TRUE ==> TRUE
(
       date_of_birth IS NULL    -- (2): TRUE
);                              -- hit

SELECT *
FROM   person
WHERE  id = 99                  -- (1): TRUE 
OR                              -- (3): (1) OR (2) ==> TRUE OR UNKNOWN ==> TRUE
(
       date_of_birth = NULL     -- (2): UNKNOWN
);                              -- hit

SELECT *
FROM   person
WHERE  id = 99                  -- (1): TRUE 
AND                             -- (4): (1) AND (3) ==> TRUE AND FALSE ==> FALSE
(
NOT                             -- (3): NOT (2) ==> NOT TRUE ==> FALSE
       date_of_birth IS NULL    -- (2): TRUE

);                              -- no hit (same as AND date_of_birth IS NOT NULL)

-- Clean up the test database
DELETE FROM person WHERE id = 99;
DROP TABLE IF EXISTS t1, t2;     
COMMIT;