primary key

History & Principles
The phrase "primary key" dates back to the invention of VSAM ("Virtual Sequential/Storage Access Method"), KSAM ("Keyed..."), and ISAM ("Indexed...") which are still used by batch-processing mainframes today.

Where the amount of data to be retrieved by a process is so large that random access becomes problematic because of the amount of memory needed to cache a significant number of blocks¹ and the time needed to move magnetic read heads to distant addresses, there is a desire to maintain data in a physical (sequential) order that corresponds to the most likely order of processing for the greatest bulk of tasks. The combination of columns whose values, sorted according to the prevailing collating sequence, define the desired physical order is and was called the "primary key".

Besides the primary key, other orderings were allowed. These orderings were all called "secondary keys". A secondary key could not be defined if a primary key had not been defined previously.

In the good ol' days of computing (early 1970s), computer systems executed programs that were very succinct by today's standards. This author assumes that there was consequently a desire on the part of the access method developers to settle on just two access routines to handle the two types of keys, and this led to enforcing that the primary key be unique while secondary keys were either unique or not, without a performance consequence either way.

The concept of primary keys continued being useful through the period where hierarchical databases were popular, and today's (2002) relational databases still provide for their definition. However, unless there is a need to specify an aspect of table organization that relies on the physical ordering of the rows (such as key-range partitioning of a table), primary keys are optional in many relational databases.

Necessity & Debate
The reasons that primary keys are generally optional in relational databases are two-fold: the mission of relational databases is to provide fast random access, so sequential access is not considered especially important; and the number of access routines is greater, so the uniqueness of an index can be explicitly and separately declared during its creation.

Any performance benefit to be gained from defining a primary key in a relational database stems from the fact that being able to rely on the physical order of the rows means that a range scanning process supported by that key does not have to go back to the index to find the next row. Not having to use the index means saving disk I/O and processing time.

Clearly this benefit does not rely on the uniqueness property since it is related to range scans not cardinal/identity/equijoin matches. So it follows that uniqueness is not a logically necessary property of primary keys.

Occasionally someone argues that foreign key relationships must rely on unique keys for the parent entity, and therefore primary keys must be unique for backward compatibility; but this too is anachronistic. First of all, in order to maintain referential integrity, all that is required is that some rows exist in the parent entity for that key value - there is no logical requirement that the number of rows be one. Secondly, database designers that wish to enforce that foreign keys relate to unique parent rows can do so without a global uniqueness constraint on primary keys. And thirdly, all foreign keys that currently point to unique primary keys would not be affected by making it possible to define new primary keys that are not unique; so there is no backward compatibility issue.

Relational databases do currently enforce that primary keys be unique, but this often forces database designers to extend their keys with columns that arbitrarily create unique key values, which is not an efficient way of doing things. Consequently, this author thinks that budding database designers should not assume that "primary key" means "unique key" since it is likely that database manufacturers will see the error of their ways and drop the mandatory uniqueness of primary keys on the basis that the logical necessity of intended relationships and the physical necessity of efficient processing should drive the development of DDL syntax and not the other way around.

In addition, most RDBMS engines also enforce that all the columns composing a primary key be constrained to be never null. Once it is admitted that primary keys need not be unique, the "not null" requirement can also go away.

Another common misconception about primary keys is that the key which supports the access path with the highest volume should be declared the primary key. This is an incomplete understanding.

Primary keys can be avoided as unnecessary in many databases. The only times that primary keys are useful are when defining key-range partitioning and otherwise when supporting the efficiency of processes that perform range scans. For example, a number of processes could use a certain key 999 times out of a thousand; but if that access is based on cardinal matching, then the single process that does a range scan on another key should have its key declared the primary key. But among multiple keys supporting range scan access, the one with the highest volume or most critical response time requirements should be declared primary.

¹ See block device or block mode file.

candidate key	relational database model	On becoming a geek	surrogate key
foreign key	Key Violation	node_id	DBMS
database table	LLBLGen	composite primary key	CODASYL Database
Protect against deleted .bash_history files	Venn Diagram	ISBN	update
RDBMS	insert	key	database
E2 Annex	Clustered index	Primary Evaluation

Common primary keys