Last modified on 11 June 2014, at 09:07

Structured Query Language/SELECT: Join Operation

← SELECT: Fundamentals | SELECT: Grouping →

Data should be stored in a way, that no redundant information exists in the database. If more than one person indulge a hobby, for example, we avoid storing information about hobbies (designation, remarks, ...) within each affected persons row as this would double the hobbies information. And we avoid storing information about persons (name, ssn, ...) within each affected hobby row for the same reason. Instead we create independent person and hobby tables and point from one to the other. This technique of grouping data items redundancy-free into separate tables is called database normalization. As a result, information about complex objects is scattered across multiple tables. And we need a opportunity to reassemble the whole, original situation. This reverse technique is called a 'JOIN operation'.

The IdeaEdit

In our example database there are the two tables person and contact. The contact table contains the column person_id, which correlates with the Primary Key column id of the person table. By evaluating the column values we can join contacts and persons together.


person table P

ID LASTNAME FIRSTNAME ...
1 Goldstein Larry ...
2 Burton Tom ...
3 Hamilton Lisa ...
4 Goldstein Kim ...
... ... ... ...
... ... ... ...


contact table C

ID PERSON_ID CONTACT_TYPE CONTACT_VALUE
1 1 fixed line 555-0100
2 1 email larry.goldstein@acme.xx
3 1 email lg@my_company.xx
4 1 icq 12111
5 4 fixed line 5550101
6 4 mobile 10123444444
... ... ... ...
... ... ... ...


Joined (virtual) table, created out of person and contact

P.ID P.LASTNAME P.FIRSTNAME ... C.ID C.PERSON_ID C.CONTACT_TYPE C.CONTACT_VALUE
1 Goldstein Larry ... 1 1 fixed line 555-0100
1 Goldstein Larry ... 2 1 email larry.goldstein@acme.xx
1 Goldstein Larry ... 3 1 email lg@my_company.xx
1 Goldstein Larry ... 4 1 icq 12111
2 Burton Tom ...  ?  ?  ?  ?
3 Hamilton Lisa ...  ?  ?  ?  ?
4 Goldstein Kim ... 5 4 fixed line 5550101
4 Goldstein Kim ... 6 4 mobile 10123444444
... ... ... ... ... ... ... ...

As we see, Larry Goldstein, who exists only once in the real table person now occurs four times in the joined, virtual table, every time in combination with one of his four contact items. The same applies for Kim Goldstein with his two contact items.

But what is going on with Tom Burton and Lisa Hamilton, whose contact information is not available? We may have some trouble with the attempt to join their person with - not existing - contact information. For the moment we have flagged the situation with question marks. Later on this page we will explain in detail how to transform the problem into a solution.

The Basic SyntaxEdit

Obviously it's necessary to specify two things with the JOIN operation

  • the names of involved tables
  • the names of involved columns

The basic syntax extends the SELECT command with this two elements

SELECT <things_to_be_displayed>       -- as usual
FROM   <tablename_1> <table_1 alias>  -- a table alias
JOIN   <tablename_2> <table_2 alias> ON <JOIN condition> -- the join criterion
...    -- optionally all the other elements of SELECT command
;

Let's take a first attempt.

SELECT *
FROM   person p
JOIN   contact c ON p.id = c.person_id;


The table names are mentioned behind the FROM (as usual) and behind the new keyword JOIN, which instructs the DBMS to perform a join operation. Next there is the ON keyword, which introduces the columnnames together with a comparison operator (or a general condition as you will see later). The columnnames are prefixed with the aliases of the tablenames p and c respectively. This is necessary, because in different tables there may exist columns with identical names like id.

When the DBMS executes the command it delivers 'something' that contains all columns of table person and all columns of table contact - amongst others the two columns id of person and id of contact. The result contains 9 rows, one per existing combination of person and contact ('existing' in the sense of the formulation behind the 'ON' keyword). Please notice that persons without any stored contact are not part of the result.

The delivered 'something' looks like a new table. And in fact it has the same structure, behaviour and data like a table. If it is created during a subselect or from a view, we can perform new SELECTs on it. But there is an important difference to tables: The data is not stored in the DBMS at its own place, the data is computed at runtime out of the values of real tables.

The key feature of assembling complex information out of simple tables is possible by using the two simple keywords JOIN and ON. At the same time - as you will see - the syntax is extendable to manage very complex queries, so that there may be a great number of additional details in specifying your join criteria. People sometimes gets confused about the gap between their intention and some unexpected results. If you run into a problem of that kind, try to reduce your query to the shown simple form. Confusion often results from the fact, that the JOIN syntax itself may be complex due to a lot of details. Additionally joining can be combined with all other syntactical elements of the SELECT command, which also may lead to lack of clarity.

The combination of the join syntax with other language elements is shown in the following examples.


--
-- show only important columns
SELECT p.firstname, p.lastname, c.contact_type AS "Kind of Contact", c.contact_value AS "Call Number"
FROM   person p
JOIN   contact c ON p.id = c.person_id;
 
-- show only desired rows
SELECT p.firstname, p.lastname, c.contact_type AS "Kind of Contact", c.contact_value AS "Call Number"
FROM   person p
JOIN   contact c ON p.id = c.person_id
WHERE  c.contact_type IN ('fixed line', 'mobile');
 
-- apply any sort order
SELECT p.firstname, p.lastname, c.contact_type AS "Kind of Contact", c.contact_value AS "Call Number"
FROM   person p
JOIN   contact c ON p.id = c.person_id
WHERE  c.contact_type IN ('fixed line', 'mobile')
ORDER BY p.lastname, p.firstname, c.contact_type DESC;
 
-- use functions: min() / max() / count()
SELECT COUNT(*)
FROM   person p
JOIN   contact c ON p.id = c.person_id
WHERE  c.contact_type IN ('fixed line', 'mobile');
 
-- JOIN a table with itself. Example: Search different persons with same lastname
SELECT p1.id, p1.firstname, p1.lastname, p2.id, p2.firstname, p2.lastname
FROM   person p1
JOIN   person p2 ON p1.lastname = p2.lastname -- for second incarnation of person we must use a different alias
WHERE  p1.id != p2.id
-- sorting of p2.lastname is not necessary as it is identical to the already sorted p1.lastname
ORDER BY p1.lastname, p1.firstname, p2.firstname;
 
-- JOIN more than two tables. Example: contact information of different persons with same lastname
SELECT p1.id, p1.firstname, p1.lastname, p2.id, p2.firstname, p2.lastname, c.contact_type, c.contact_value
FROM   person p1
JOIN   person p2 ON p1.lastname = p2.lastname
JOIN   contact c ON p2.id = c.person_id       -- contact info from person2. p1.id would lead to person1
WHERE  p1.id != p2.id
ORDER BY p1.lastname, p1.firstname, p2.lastname;

 

Four Join TypesEdit

At the beginning of this page we have seen a strange situation where no corresponding rows exist. We have inserted a question mark into this cells. With the basic syntax of the JOIN operation all rows, where those question marks would appear, are not part of the result. This kind of JOIN is called an INNER join. Additionally there are three kinds of so called OUTER joins. An OUTER join likewise contains all the corresponding rows as an INNER join, but is supplemented with rows with no counterpart in one or both of the two tables: LEFT OUTER, RIGHT OUTER and FULL OUTER.

So we can widen the basic JOIN syntax to the four possibilities:

  • [INNER] JOIN
  • LEFT [OUTER] JOIN
  • RIGHT [OUTER] JOIN
  • FULL [OUTER] JOIN

Keywords surrounded by [ ] are optional. Omitting the INNER keyword leads to the basic syntax we have seen in the previous paragraph.

Inner JoinEdit

In most cases, people use the inner join. As we have seen previously the result in this version contains exactly those rows, which achieves exactly the criterion specified behind the ON. As a first example we create a list of persons and their contacts.

-- A list of persons and their contacts
SELECT p.firstname, p.lastname, c.contact_type, c.contact_value
FROM   person p
JOIN   contact c ON p.id = c.person_id  -- identical meaning: INNER JOIN ...
ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;

The important thing is that persons without any contact information are not part of the result.

Left (outer) JoinEdit

Sometimes we need a little more, for example a list of all persons, which contains - as an additional information about the person - their contact information. Please consider the difference to the above example. The list should contain all persons, also such without contact information.

-- A list of ALL persons plus their contacts
SELECT    p.firstname, p.lastname, c.contact_type, c.contact_value
FROM      person p
LEFT JOIN contact c ON p.id = c.person_id  -- identical meaning: LEFT OUTER JOIN ...
ORDER BY  p.lastname, p.firstname, c.contact_type DESC, c.contact_value;

Obviously the result contains all rows just like in the above inner join. Additionally there are rows for persons which have no contact information. As they do not have those information the values for the contact columns are not known. The DBMS fills them with the so called 'null value' or 'null special marker' (which is different from the string 'null value' or 'null' and different from the binary 0 - and for SQL users it's waste to know how it is implemented). You will find more informations about the null special marker and how to retrieve them in a later chapter.

In summary the left (outer) join is an inner join plus one row for each row without a counterpart.

Consider the word 'left'. It indicates that the evaluation of the formulation "FROM <table_1> LEFT JOIN <table_2>" is done from left to right. All rows of the left listed table (here: table_1) are included in the result, they may have a counterpart in table_2 or not.


Another example:

SELECT    p.firstname, p.lastname, c.contact_type, c.contact_value
FROM      contact c
LEFT JOIN person p  ON p.id = c.person_id  -- identical meaning: LEFT OUTER JOIN ...
ORDER BY  p.lastname, p.firstname, c.contact_type DESC, c.contact_value;

What makes the difference? We have changed the order of tablenames. As contact is now mentioned on the left side, it is evaluated first (all the contact rows go to the result - plus their according person information). As in our example database there is no contact without a person, the result is the same as with an inner join. But it's different from the first example of the left join paragraph.

Right (outer) JoinEdit

The right join obeys the same rules as the left join. First all joining rows goes to the result, supplemented by all rows with no corresponding rows in the other table. The DBMS fills all missing column values with the null special marker. The only difference is that the evaluation sequence of tables takes place from right to left. In a formulation like "FROM <table_1> RIGHT JOIN <table_2>" all rows of table_2 are included in the result, they may have a counterpart in table_1 or not.

-- A list of ALL contacts plus according persons
SELECT     p.firstname, p.lastname, c.contact_type, c.contact_value
FROM       person p
RIGHT JOIN contact c ON p.id = c.person_id  -- identical meaning: RIGHT OUTER JOIN ...
ORDER BY   p.lastname, p.firstname, c.contact_type DESC, c.contact_value;

In our example database their is no contact without a person, so the result of this query is identical to an inner join. In general a right join is identical to a left join with reverse order of table names, see our 'another example' above.

Full (outer) JoinEdit

A full join combines the proceedings of left and right join such that the result contains all corresponding rows plus rows of left table without match in right table plus rows of right table without match in left table.

SELECT    p.firstname, p.lastname, c.contact_type, c.contact_value
FROM      person p
FULL JOIN contact c ON p.id = c.person_id  -- identical meaning: FULL OUTER JOIN ...
ORDER BY  p.lastname, p.firstname, c.contact_type DESC, c.contact_value;


Suppose there are two tables table_1 and table_2.

table_1

ID X
1 11
2 12
3 13

table_2

ID TABLE_1_ID Y
1 1 21
2 5 22

The full join

SELECT    *
FROM      table_1 t1
FULL JOIN table_2 t2 ON t1.id = t2.table_1_id;

will result in

T1.ID T1.X T2.ID T2.TABLE_1_ID T2.Y
1 11 1 1 21
2 12 null null null
3 13 null null null
null null 2 5 22

The result contains the (single) matching row plus all other rows, which are extended by the null special marker in the missing cells.


Hint: The full join is not supported by all DBMS. But as it isn't an atomic operation it is always possible to create the desired result by a combination of multiple SELECTs with SET operations.

Cartesian ProductEdit

In the case of the inner join version of the command we can omit the ON part. SQL interprets this situation as a request to combine every row of the left table with every row of the right table. The result will contain a large number of rows: number of rows within left table multiplied by number of rows within right table.

This special kind of an inner join is called a cartesian product, see also: cartesian product in German Wiki. The cartesian product is an elementary operation of the relationa algebra, which is the foundation for all rDBMS implementations.

-- all persons combined with all contacts (some implementations insists in replacing the
-- keyword 'JOIN' by an comma)
SELECT p.firstname, p.lastname, c.contact_type, c.contact_value
FROM   person p
JOIN   contact c   -- missing ON keyword: p X c will be created
ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;
 
-- count the arising rows
SELECT COUNT(*)
FROM   person p
JOIN   contact c;

Be careful: If we (falsely) forget the ON term we will receive a large number of resulting rows. If, for example, the first table contains 10.000 rows and the second one 20.000 rows, we will receive 200 million rows.

The n:m SituationEdit

How can we create a list of persons and their hobbies? Remember: one person may run many hobbies and several persons may run the same hobby. So there is no direct connection from persons to hobbies. Between the two tables we have created a third one person_hobby. It holds the id of persons as well as the id of hobbies.

We have to 'walk' from person to person_hobby and from there to hobby.

-- persons combined with their hobbies
SELECT p.id p_id, p.firstname, p.lastname, h.hobbyname, h.id h_id
FROM   person       p
JOIN   person_hobby ph ON p.id = ph.person_id
JOIN   hobby        h  ON ph.hobby_id = h.id
ORDER BY p.lastname, p.firstname, h.hobbyname;

Please notice that no column of the table person_hobby goes to the result. This table acts only during intermediate execution steps. Even its column id is not of interest.

Some people do not perform a hobby. As we performed an INNER JOIN they are not part of the above list. If we want to see in the list also persons without hobbies, we must do what we have done before: use LEFT OUTER JOINs instead of INNER JOINs.

-- ALL persons plus their hobbies (if present)
SELECT p.id p_id, p.firstname, p.lastname, h.hobbyname, h.id h_id
FROM   person            p
LEFT JOIN   person_hobby ph ON p.id = ph.person_id
LEFT JOIN   hobby        h  ON ph.hobby_id = h.id
ORDER BY p.lastname, p.firstname, h.hobbyname;

Hint: If necessary we can combine every kind of join with every other kind of join in every desired sequence, eg: LEFT OUTER with FULL OUTER with INNER ... .

More DetailsEdit

Criteria for join operations are not restricted to the usual formulation:

SELECT ...
FROM   table_1 t1
JOIN   table_2 t2 ON t1.id = t2.fk
...


First, we can use any column, not only primary key and foreign key columns. In one of the above examples we used the lastname for a join. Lastname is of type character and has no meaning of any key. To avoid poor performance some DBMS restrict the use of columns to those having an index.


Second, the comparator is not restricted to the equal sign. We can use any sensfull operator, for example the 'greater than' for numeric values.

-- Which person has the greater body weight - restricted to 'de Winter' for clarity
SELECT p1.id, p1.firstname AS "is heavier", p1.weight, p2.id, p2.firstname AS "than", p2.weight
FROM   person p1
JOIN   person p2 ON p1.weight > p2.weight
WHERE  p1.lastname = 'de Winter'
AND    p2.lastname = 'de Winter'
ORDER BY p1.weight DESC, p2.weight DESC;


Third, we can use an arbitrary function.

-- short lastnames vs. long lastnames
SELECT p1.firstname, p1.lastname AS "shorter lastname", p2.firstname, p2.lastname
FROM   person p1
JOIN   person p2 ON LENGTH(p1.lastname) < LENGTH(p2.lastname)
-- likewise ORDER BY can use functions
ORDER BY LENGTH(p1.lastname), LENGTH(p2.lastname);

ExercisesEdit

Show first- and lastname plus icq number for persons having an icq number

Click to see solution
SELECT p.id, p.firstname, p.lastname, c.contact_value
FROM   person p
JOIN   contact c ON p.id = c.person_id
WHERE  c.contact_type = 'icq';

Show first- and lastname plus ICQ number plus fixed line number for persons having an ICQ number AND a fixed line. You need to join the contact table twice.

Click to see solution
SELECT p.id, p.firstname, p.lastname,
       c1.contact_value AS icq,
       c2.contact_value AS "fixed line"   -- looks like previous, but is different
FROM   person p
JOIN   contact c1 ON p.id = c1.person_id
JOIN   contact c2 ON p.id = c2.person_id  -- it's a second (virtual) incarnation of contact table
WHERE  c1.contact_type = 'icq'            -- from first incarnation
AND    c2.contact_type = 'fixed line';    -- from second incarnation
 
-- In this example of an INNER JOIN we can convert the WHERE part to an additional JOIN criterion.
-- This may clarify the intention of the command. But be careful: This shifting in combination with
-- one of the OUTER JOINs may lead to different results.
SELECT p.id, p.firstname, p.lastname, c1.contact_value AS icq, c2.contact_value AS "fixed line"
FROM   person p
JOIN   contact c1 ON p.id = c1.person_id AND c1.contact_type = 'icq' 
JOIN   contact c2 ON p.id = c2.person_id AND c2.contact_type = 'fixed line';

Show first- and lastname plus (if present) the ICQ number for ALL persons

Click to see solution
-- To retrieve ALL persons it's necessary to use a LEFT join.
-- But the first approach is not what we expect! In this example the LEFT JOIN is evaluated first
-- and creates an intermediate table with null-values in contact_type (eliminate the
-- WHERE clause to see this intermediate result). These rows and all other except the 
-- one with 'ICQ' are then thrown away by evaluating the WHERE clause.
SELECT    p.id, p.firstname, p.lastname, c.contact_value
FROM      person p
LEFT JOIN contact c ON p.id = c.person_id
WHERE     c.contact_type = 'icq';
-- It's necessary to formulate the search criterion as part of the JOIN. Unlike with
-- the INNER JOIN in the previous example with (LEFT/FULL/RIGHT) OUTER JOINs it is not possible 
-- to shift it to the WHERE clause.
SELECT    p.id, p.firstname, p.lastname, c.contact_value
FROM      person p
LEFT JOIN contact c ON p.id = c.person_id AND c.contact_type = 'icq';

Create a list which contains ALL hobbies plus according persons (if present)

Click to see solution
SELECT p.id p_id, p.firstname, p.lastname, h.hobbyname, h.id h_id
FROM   person              p
RIGHT  JOIN   person_hobby ph ON p.id = ph.person_id
RIGHT  JOIN   hobby        h  ON ph.hobby_id = h.id
ORDER BY h.hobbyname, p.lastname, p.firstname;

Is it possible that one of the three outer joins contains fewer rows than the corresponding inner join?

Click to see solution
No.

All four join types contain the same rows with column-matching-values. In addition
outer joins contain rows where column values do not match - if such a situation exists.