Everything You Need to Know About the SQL GROUP BY Statement

Much of the power of relational databases comes from filtering data and joining tables together. This is why we represent those relations in the first place. But modern database systems provide another valuable technique: grouping.

Grouping allows you to extract summary information from a database. It lets you combine results to create useful statistical data. Grouping saves you from writing code for common cases such as averaging lists of figures. And it can make for more efficient systems.

What Does the GROUP BY Clause Do?

GROUP BY, as the name suggests, groups results into a smaller set. The results consist of one row for each distinct value of the grouped column. We can show its usage by looking at some sample data with rows that share some common values.

The following is a very simple database with two tables representing record albums. You can set up such a database by writing a basic schema for your chosen database system. The albums table has nine rows with a primary key id column and columns for name, artist, year of release, and sales:

        +----+---------------------------+-----------+--------------+-------+
| id | name                      | artist_id | release_year | sales |
+----+---------------------------+-----------+--------------+-------+
|  1 | Abbey Road                |         1 |         1969 |    14 |
|  2 | The Dark Side of the Moon |         2 |         1973 |    24 |
|  3 | Rumours                   |         3 |         1977 |    28 |
|  4 | Nevermind                 |         4 |         1991 |    17 |
|  5 | Animals                   |         2 |         1977 |     6 |
|  6 | Goodbye Yellow Brick Road |         5 |         1973 |     8 |
|  7 | 21                        |         6 |         2011 |    25 |
|  8 | 25                        |         6 |         2015 |    22 |
|  9 | Bat Out of Hell           |         7 |         1977 |    28 |
+----+---------------------------+-----------+--------------+-------+

The artists table is even simpler. It has seven rows with id and name columns:

        +----+---------------+
| id | name          |
+----+---------------+
|  1 | The Beatles   |
|  2 | Pink Floyd    |
|  3 | Fleetwood Mac |
|  4 | Nirvana       |
|  5 | Elton John    |
|  6 | Adele         |
|  7 | Meat Loaf     |
+----+---------------+

You can understand various aspects of GROUP BY with just a simple data set such as this. Of course, a real-life data set would have many, many more rows, but the principles remain the same.

Grouping by a Single Column

Let’s say we want to find out how many albums we have for each artist. Start with a typical SELECT query to fetch the artist_id column:

        SELECT artist_id FROM albums

This returns all nine rows, as expected:

        +-----------+
| artist_id |
+-----------+
|         1 |
|         2 |
|         3 |
|         4 |
|         2 |
|         5 |
|         6 |
|         6 |
|         7 |
+-----------+

To group these results by the artist, append the phrase GROUP BY artist_id:

        SELECT artist_id FROM albums GROUP BY artist_id

Which gives the following results:

        +-----------+
| artist_id |
+-----------+
|         1 |
|         2 |
|         3 |
|         4 |
|         5 |
|         6 |
|         7 |
+-----------+

There are seven rows in the result set, reduced from the total nine in the albums table. Each unique artist_id has a single row. Finally, to get the actual counts, add COUNT(*) to the columns selected:

        SELECT artist_id, COUNT(*)
FROM albums
GROUP BY artist_id

+-----------+----------+
| artist_id | COUNT(*) |
+-----------+----------+
|         1 |        1 |
|         2 |        2 |
|         3 |        1 |
|         4 |        1 |
|         5 |        1 |
|         6 |        2 |
|         7 |        1 |
+-----------+----------+

The results group two pairs of rows for the artists with ids 2 and 6. Each has two albums in our database.

How to Access Grouped Data With an Aggregate Function

You may have used the COUNT function before, particularly in the COUNT(*) form as seen above. It fetches the number of results in a set. You can use it to get the total number of records in a table:

        SELECT COUNT(*) FROM albums

+----------+
| COUNT(*) |
+----------+
|        9 |
+----------+

COUNT is an aggregate function. This term refers to functions that translate values from multiple rows into a single value. They are often used in conjunction with the GROUP BY statement.

Rather than just count the number of rows, we can apply an aggregate function to grouped values:

        SELECT artist_id, SUM(sales)
FROM albums
GROUP BY artist_id

+-----------+------------+
| artist_id | SUM(sales) |
+-----------+------------+
|         1 |         14 |
|         2 |         30 |
|         3 |         28 |
|         4 |         17 |
|         5 |          8 |
|         6 |         47 |
|         7 |         28 |
+-----------+------------+

The total sales shown above for artists 2 and 6 are their multiple albums’ sales combined:

        SELECT artist_id, sales
FROM albums
WHERE artist_id IN (2, 6)
+-----------+-------+
| artist_id | sales |
+-----------+-------+
|         2 |    24 |
|         2 |     6 |
|         6 |    25 |
|         6 |    22 |
+-----------+-------+

Grouping by Multiple Columns

You can group by more than one column. Just include multiple columns or expressions, separated by commas. The results will group according to the combination of these columns.

        SELECT release_year, sales, count(*)
FROM albums
GROUP BY release_year, sales

This will typically produce more results than grouping by a single column:

        +--------------+-------+----------+
| release_year | sales | count(*) |
+--------------+-------+----------+
|         1969 |    14 |        1 |
|         1973 |    24 |        1 |
|         1977 |    28 |        2 |
|         1991 |    17 |        1 |
|         1977 |     6 |        1 |
|         1973 |     8 |        1 |
|         2011 |    25 |        1 |
|         2015 |    22 |        1 |
+--------------+-------+----------+

Note that, in our small example, just two albums have the same release year and sales count (28 in 1977).

Useful Aggregate Functions

Aside from COUNT, several functions work well with GROUP. Each function returns a value based on the records belonging to each result group.

COUNT() returns the total number of matching records.
SUM() returns the total of all values in the given column added up.
MIN() returns the smallest value in a given column.
MAX() returns the largest value in a given column.
AVG() returns the mean average. It’s the equivalent of SUM() / COUNT().

You can also use these functions without a GROUP clause:

        SELECT AVG(sales) FROM albums

+------------+
| AVG(sales) |
+------------+
|    19.1111 |
+------------+

Using GROUP BY With a WHERE Clause

Just as with a normal SELECT, you can still use WHERE to filter the result set:

        SELECT artist_id, COUNT(*)
FROM albums
WHERE release_year > 1990
GROUP BY artist_id

        +-----------+----------+
| artist_id | COUNT(*) |
+-----------+----------+
|         4 |        1 |
|         6 |        2 |
+-----------+----------+

Now you have only those albums released after 1990, grouped by artist. You can also use a join with the WHERE clause, independently from the GROUP BY:

        SELECT r.name, COUNT(*) AS albums
FROM albums l, artists r
WHERE artist_id=r.id
AND release_year > 1990
GROUP BY artist_id

        +---------+--------+
| name    | albums |
+---------+--------+
| Nirvana |      1 |
| Adele   |      2 |
+---------+--------+

Note, however, that if you try to filter based on an aggregated column:

        SELECT r.name, COUNT(*) AS albums
FROM albums l, artists r
WHERE artist_id=r.id
AND albums > 2
GROUP BY artist_id;

You’ll get an error:

        ERROR 1054 (42S22): Unknown column 'albums' in 'where clause'

Columns based on aggregate data are not available to the WHERE clause.

Using the HAVING Clause

So, how do you filter the result set after a grouping has taken place? The HAVING clause deals with this need:

        SELECT r.name, COUNT(*) AS albums
FROM albums l, artists r
WHERE artist_id=r.id
GROUP BY artist_id
HAVING albums > 1;

Note that the HAVING clause comes after the GROUP BY. Otherwise, it’s essentially a simple replacement of the WHERE with HAVING. The results are:

        +------------+--------+
| name       | albums |
+------------+--------+
| Pink Floyd |      2 |
| Adele      |      2 |
+------------+--------+

You can still use a WHERE condition to filter the results before the grouping. It will work together with a HAVING clause for filtering after the grouping:

        SELECT r.name, COUNT(*) AS albums
FROM albums l, artists r
WHERE artist_id=r.id
AND release_year > 1990
GROUP BY artist_id
HAVING albums > 1;

Only one artist in our database released more than one album after 1990:

        +-------+--------+
| name  | albums |
+-------+--------+
| Adele |      2 |
+-------+--------+

Combining Results With GROUP BY

The GROUP BY statement is an incredibly useful part of the SQL language. It can provide summary information of data, for a contents page, for example. It is an excellent alternative to fetching large quantities of data. The database handles this extra workload well since its very design makes it optimal for the job.

Once you understand grouping and how to join multiple tables, you’ll be able to utilize most of the power of a relational database.