Structured Query Language/SELECT: Set Operations

Structured Query Language
SELECT: Set Operations

Tables, views, and results of SELECT commands are in somewhat similar to sets of set theory. In this comparison the elements of sets correspond to rows of tables, views, and SELECT results. The differences between set theory and the itemized SQL constructs are:

Sets of set theory do not allow duplicates, whereas SQL allows duplicates. (Even different rows of one table may be identical as there is no duty to use the concept of primary keys.) In the following, we use the term multiset when we speak about sets in SQL where duplicates are possible.
Sets of set theory and multisets are not ordered. But for the result of a SELECT command, we can enforce an ordering through the optional ORDER BY clause.

The comparison between set theory and SQL goes even further. In SQL, we have operations which act on multisets in the sense of set theory: The SQL operations UNION, INTERSECT, and EXCEPT (some name it MINUS) process intermediate multisets generated by differents SELECT commands. The operations expect the multisets are of the same type. This means mainly that they must have the same number of columns. Also, their data type should correlate, but this is not mandatory. If they do not, the DBMS will cast them to a common data type - if possible.

UNION

The UNION operation pushes the result of several SELECT commands together. The result of the UNION contains those values, which are in the first or the second intermediate result.

-- Please consider that this is only one command (only ONE semicolon at the very end)
SELECT firstname  -- first SELECT command
FROM   person
  UNION           -- push both intermediate results together to one result
SELECT lastname   -- second SELECT command
FROM   person;

The UNION of two intermediate results

This is a single SQL command. It consists of two SELECTs and one UNION operation. The SELECTs are evaluated first. Afterward, their results are pushed together to one single result. In our example, the result contains all lastnames and firstnames in a single column (our example may be of limited help in praxis, it's only a demonstration for the UNION).

DISTINCT / ALL
If we examine the result carefully, we will notice that it consists only of 17 values. The table person contains ten rows so that we probably expect twenty values in the result. If we perform the 'SELECT firstname ...' and 'SELECT lastname ...' as separate commands without the UNION, we receive for both commands ten values. The explanation for the three missing values is the UNION command. By default, UNION removes duplicates. Therefore some of the intermediate values are skipped. If we want to obtain the duplicate values we have to modify the UNION operation. Its behavior can be changed with one of the two keywords DISTINCT or ALL. DISTINCT is the default, and it removes duplicate values as we have seen before. ALL will retain all values, including duplicates.

-- remove (that's the default) or keep duplicates
SELECT ...
  UNION [DISTINCT | ALL]
SELECT ...
[ -- it is possible to 'UNION' more than 2 intermediate results
  UNION [DISTINCT | ALL]
SELECT ...
];

A hint for Oracle users: The use of the keyword DISTINCT, which is the default, is not accepted by Oracle. Omit it.

General hint
In most cases, the UNION combines SELECT commands on different tables or different columns of the same table. SELECT commands on the same column of a single table usually use the WHERE clause in combination with boolean logic.

-- A very unusual example. People apply such queries on the same table only in combination with very complex WHERE conditions.
-- This example would normally be expressed with a phrasing similar to: WHERE lastname IN ('de Winter', 'Goldstein');
SELECT *
FROM   person
WHERE  lastname = 'de Winter'
  UNION ALL
SELECT *
FROM   person
WHERE  lastname = 'Goldstein';

INTERSECT

INTERSECT

The INTERSECT operation evaluates to those values, which are in both intermediate results, in the first as well as in the second.

-- As in our example database, there is no example for the INTERSECT we insert a new person.
-- This person has the same last name 'Victor' as the first name of another person.
INSERT INTO person VALUES (21, 'Paul', 'Victor', DATE'1966-04-02', 'Washington', '078-05-1121', 66);
COMMIT;
-- All firstnames which are used as lastname.
SELECT firstname  -- first SELECT command
FROM   person
  INTERSECT       -- looking for common values
SELECT lastname   -- second SELECT command
FROM   person;

A hint to MySQL users: MySQL (5.5) does not support INTERSECT operation. But as it is not an elementary operation, there are workarounds.

EXCEPT

EXCEPT

The EXCEPT operation evaluates to those values, which are in the first intermediate result but not in the second.

-- All firstname except for 'Victor', because there is a lastname with this value.
SELECT firstname  -- first SELECT command
FROM   person
  EXCEPT          -- are there values in the result of the first SELECT but not of second?
SELECT lastname   -- second SELECT command
FROM   person;

A hint to MySQL users: MySQL (5.5) does not support the EXCEPT operation. But as it is not an elementary operation, there are workarounds.
A hint to Oracle users: Oracle uses the keyword MINUS instead of EXCEPT.

-- Clean up the example database
DELETE FROM person WHERE id > 10;
COMMIT;

Order By

We can combine set operations with all other elements of SELECT command, in particular with ORDER BY and GROUP BY. But this may lead to some uncertainties. Therefore, we would like to explain some of the details below.

SELECT firstname  -- first SELECT command
FROM   person
  UNION           -- push both intermediate results together to one result
SELECT lastname   -- second SELECT command
FROM   person
ORDER BY firstname;

To which part of the command belongs the ORDER BY? To the first SELECT, the second SELECT, or the result of the UNION? The SQL rules determine that set operations are evaluated before ORDER BY clauses (as always parenthesis can change the order of evaluation). Therefore the ORDER BY sorts the final result and not any of the intermediate results.

We rearrange the example in the hope that things get clear.

-- Equivalent semantic
SELECT * FROM
  (SELECT firstname  -- first SELECT command
   FROM   person
     UNION           -- push both intermediate (unnamed) results together to the next intermediate result 't'
   SELECT lastname   -- second SELECT command
   FROM   person
  ) t                -- 't' is the name for the intermediate result generated by UNION
ORDER BY t.firstname;

First, the two SELECTS are evaluated, afterward the UNION. This intermediate result gets the name 't'. 't' is ordered.

Often one would like the rows from the first SELECT to be ordered independent from the rows of the second SELECT. We can do this by adding a virtual column to the result of each SELECT statement and using the virtual columns in the ORDER BY.

SELECT '1' as dummy, firstname
FROM   person
  UNION
SELECT '2', lastname
FROM   person
ORDER BY dummy, firstname;

Group By

With the GROUP BY clause, things are a little more complicated than with ORDER BY. The GROUP BY refers to the last SELECT or - to say it the other way round - to the SELECT of its direct level.

-- Will not work because the GROUP BY belongs to the second SELECT and not to the UNION!
SELECT firstname
FROM   person
  UNION
SELECT lastname
FROM   person
GROUP BY firstname;
-- 
-- Works, but possibly not what you want to do.
-- The alias name for the (only) column of the UNION is 'firstname'.
SELECT firstname
FROM   person
  UNION
-- We group over the (only) column of the second SELECT, which is 'lastname' and results in 7 values
SELECT lastname
FROM   person
GROUP BY lastname;
--
-- Make things clear: rearrange the query to group over the final result 
SELECT * FROM
  (SELECT firstname  -- columnnames of the first SELECT determins the columnnames of the UNION
   FROM   person
     UNION
   SELECT lastname
   FROM   person
  ) t
GROUP BY t.firstname; -- now we can group over the complete result

Exercises

Show the lowest, highest and mean weight as a) 3 values of 1 row and b) 1 value in 3 rows.

Click to see solution

-- 1 row
SELECT min(weight), max(weight), avg(weight)
FROM   person;

-- 3 rows
SELECT min(weight)
FROM   person
  UNION
SELECT max(weight)
FROM   person
  UNION
SELECT avg(weight)
FROM   person;

Extend the previous 3-rows-solution to meet two additional criteria: a) consider only persons born in San Francisco and
b) add a virtual column to show 'Min', 'Max' and 'Avg' according to the correlating numeric values.

Click to see solution

SELECT 'Min', min(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
  UNION
SELECT 'Max', max(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
  UNION
SELECT 'Avg', avg(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco';

Extend the previous solution to order the result: the minimum value first, followed by the average and then the highest value.

Click to see solution

-- 'ugly' solution
SELECT '1 Min' AS note, min(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
  UNION
SELECT '3 Max' AS note, max(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
  UNION
SELECT '2 Avg' AS note, avg(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
ORDER BY note;

-- 'clean' solution
SELECT 1 AS note, 'Min', min(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
  UNION
SELECT 3 AS note, 'Max', max(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
  UNION
SELECT 2 AS note, 'Avg', avg(weight)
FROM   person
WHERE  place_of_birth = 'San Francisco'
ORDER BY note;

Create a list of lastnames for persons with a weight greater than 70 kg together with
all e-mail values (one value per row). There is no concordance between lastnames and e-mails.
(This example is not very helpfull for praxis, but instructive.)

Click to see solution

SELECT lastname
FROM   person
WHERE  weight > 70
  UNION
SELECT contact_value
FROM   contact
WHERE  contact_type = 'email';

In the previous example the lastname 'de Winter' is shown only once. But there are more than one persons of the family with a weight greater than 70 kg.
Why?
Extend the previous solution to show as much resulting rows as hits to the criteria.

Click to see solution

-- Extend 'UNION' to 'UNION ALL'. The default is 'UNION DISTINCT'
SELECT lastname
FROM   person
WHERE  weight > 70
  UNION ALL
SELECT contact_value
FROM   contact
WHERE  contact_type = 'email';

SELECT: Predefined Functions

Structured Query Language
SELECT: Set Operations

SELECT: Case Expression