SQL technique: multiple joins and the distinct keyword

It is important to realize that if you have a properly designed and linked database, you can retrieve information from as many tables as you want, specify retrieval conditions based on any data in the tables, and show the results in any order that you like.

Example: We’ll use the order entry model. We’d like a list of all the products that have been purchased by a specific customer.

• Unless you can memorize the exact spelling of every attribute in the database, along with all the PKs and FKs (I can’t), you should keep a copy of the relation scheme diagram handy when you build the query. We’ll show it again here, circling the attributes that we need for this query.

Sales model relation scheme diagram

Other views of this diagram: Large image - Description (text)

• Our query is still a simple one that could be expressed in RA using only the select, project, and join operators:

πcLastName, cFirstName, prodName σcFirstName='Alvaro' and cLastName='Monge' (customers join orders join orderlines join products)

• The information that we need is found in the Products table. You could think of this as “output” from the query that will be part of the SELECT clause attribute list. The retrieval condition, or “input” to the query, is based on the Customers table—these attributes will be needed for the WHERE clause. You should include these attributes in the SELECT list, so that you are sure that your query is showing the data that you want.

• We will also need to include the intervening tables in the FROM clause of the query, since this is the only way to correctly associate the Customers data with the Products data. (You can’t join Customers to Products, because they don’t have any common attributes.)

• Another way to think about this is to simply “follow the PK-FK pairs” from table to table until you have completely linked all of the information you need. If you look carefully at the relation scheme for this query, you will realize that we could have bypassed the Orders table (since the custID is also part of the OrderLines scheme). If there were an orderID, or if we needed data from the Orders table, it would have to be included, so we’ll keep it here for illustration. In SQL, we’ll build the query using the same step-by-step procedure that you have seen before.

1. Look at the result set (all of the linked data). We won’t show the result here because of space limitations on the web page.

        SELECT *
        FROM customers 
          NATURAL JOIN orders
          NATURAL JOIN orderlines
          NATURAL JOIN products;

• Your database system might not support the NATURAL JOIN syntax that we show here. We’ll discuss this issue further when we look at join types. The multiple natural joins in our example work correctly because there are no non-pk/fk attributes in any of our tables that have the same name. In larger, more complicated databases, this might not be true.

2. Pick the rows you want, and be sure that all of the information makes sense and is really what you are looking for. (This one is still too wide for the page.)

        SELECT *
        FROM customers 
          NATURAL JOIN orders
          NATURAL JOIN orderlines
          NATURAL JOIN products
        WHERE cFirstName = 'Alvaro' AND cLastName = 'Monge';

3. Now pick the columns that you want, and again check the results. Notice that we are including the retreival condition attributes in the SELECT clause, to be sure that this really is the right answer.

        SELECT cFirstName, cLastName, prodName
        FROM customers 
          NATURAL JOIN orders
          NATURAL JOIN orderlines
          NATURAL JOIN products
        WHERE cFirstName = 'Alvaro' AND cLastName = 'Monge';

Products purchased
cFirstNamecLastNameprodName
AlvaroMongeHammer, framing, 20 oz.
AlvaroMongeScrewdriver, Phillips #2, 6 inch
AlvaroMongeHammer, framing, 20 oz.
AlvaroMongeScrewdriver, Phillips #2, 6 inch
AlvaroMongePliers, needle-nose, 4 inch

The distinct keyword

Oops! We only wanted a list of the individual product names that this customer has purchased, but some of them are listed more than once. What went wrong?

• If the RA version of our query could have actually been executed, each row of the result table above would be distinct—remember that a relation is a set of tuples, and sets can’t have duplicates—and there would of course be fewer rows than you see here.

• SQL doesn’t work the same way. The reason for the duplicates is that the SELECT clause simply eliminated the unwanted columns from the result set; it left all of the rows that were picked by the WHERE clause.

• The real problem in SQL is that the SELECT attribute list is not a super key for the result set. Look again very carefully at the relation scheme to understand why this is true. Any time that this happens, we can eliminate the duplicate rows by including the DISTINCT keyword in the SELECT clause. While making this revision, we’ll also list the product names in alphabetical order.

        SELECT DISTINCT cFirstName, cLastName, prodName
        FROM customers 
          NATURAL JOIN orders
          NATURAL JOIN orderlines
          NATURAL JOIN products
        WHERE cFirstName = 'Alvaro' AND cLastName = 'Monge'
        ORDER BY prodName;

Distinct products
cFirstNamecLastNameprodName
AlvaroMongeHammer, framing, 20 oz.
AlvaroMongePliers, needle-nose, 4 inch
AlvaroMongeScrewdriver, Phillips #2, 6 inch

• A sloppy way to be sure that you never have duplicate rows would be to always use the DISTINCT keyword. Please don’t do this—it just keeps you from understanding what is really going on in the query. If the SELECT attribute list does form a super key of the FROM clause (result set), the DISTINCT keyword is not needed, and should not be used.