Database Design: Some Notes
IBM computer scientist E. F. Codd first developed the relational model. While the Structured Query Language (SQL) is an implementation of the relational model, it is sometimes helpful to understand the underlying relational model itself.
Thinking abstractly, a relational database can be thought of in terms of three interlocking parts: The object, the integrity and the operators. The object is how the data is structured. The integrity restrains the data to fit the appropriate formats outlined in the Object, as well as keep everything tidy and, well, relational. Finally, the operators are the ways in which that data is allowed to be manipulated.
A database requires that a set of data to be entered in a uniform fashion. Each set of data is called a relation. A database can have multiple sets of relations, perhaps joined by foreign keys (more on that later).
A relation defines a set of columns in which each column, called an attribute, is devoted to a predefined subset of the data. Each entry into this relation, called a tuple, should have a data for each attribute (though null values can be permitted, depending on the database's predicate, or definition of possible contents for each tuple).
For example, a video store may keep a database of customers. This relation may consist of attributes such as “customer ID,” “last name,” “first name,” “zip code,” and so on. In this case, a tuple is an individual entry describing one person (#42, “John” “Smith” “20707”).
Thinking in terms of attributes, the set of all the implied possible values for an attribute is called a domain. For instance, the Domain for Social Security numbers would be all the possible sets of numbers that fit in the NNN-NN-NNNN format (though an entry with a letter or letters inserted in place of a number would not be part of the domain).
For more defining characteristics of relational databases, see Dodd's 12 Rules.
Wrapping up, the entire set of relations is called the schema. The collected values at any one point of time is called its state.
Keep in mind that the terminology for the relational model does not always map exactly to terminology in SQL. For the most part they are pretty close. In SQL, term relation is equivalent to table, tuple means row, and attribute is a column.
Integrity constraints are the rules that ensure the relation keeps order of its data.
At the tuple level, identifying each tuple with some unique characteristic is important. A superkey is a combination of attributes that uniquely defines a tuple. Alternately, one of the keys can be designated as a primary key. A primary key must be completely different for each tuple. If a new entry is added for each new date, with no two entries ever falling on the same date, then the date could be the primary key. Or, in a sales environment, an “order number” could serve as a primary key, since each order number is unique. Obviously, no primary key can have a null value.
Alternately, the could be as simple as an incremented number assigned to each new entry, by the database (This entry, generated solely for the purpose of creating a primary key, is called a surrogate key). If more than one entry is used is to fashion a unique identifier, this approach is called a composite primary key.
Foreign keys are the way to link different tuples. A e-commerce store may have a regular customer buy a new item. In the customer relation, the customer ID is the primary key, while in the transactions relation, the order number is the primary key. However, the order tuple includes the customer ID. In this tuple, the customer ID is a foreign key.
It should be noted that these two relations must maintain what is called referential integrity, meaning that any customer ID in the transactions relation must be in the customer relation.
In addition to these internal constraints, the database operator may apply some own rules, derived from the operations being recorded in the database. These rules can be enforced through the CHECK constraints in SQL.
Now that we've described the database structure (the “object”), as well as how the pieces are related to one another (the “integrity”), lastly we get to the operations, or what can be done with the database.
Basically, you can do one of four things with a database data:
*You can retrieve existing data *You can insert new data *You can update existing data *You can delete existing data
Of course, there are a lot of different things you can do within these four broad categories. All of these options are contained within what is generically called SQL's data manipulation language, or DML.
At their most basic, relational databases are composed of three sets of elements: An entity, a relationship or an attribute. Relationships tie together different entities, and both can have attributes.
This rule-of-thumb comes from the Entity-Relationship (ER) model, which sets the formal rules for how different attributes relate to one another, at an abstract level.
An entity can be anything, or, to be more precise, any thing. An entity can be a person, or a type of motorbike, or a particular motorbike. An entity's attributes are a set of properties used to describe the entity. An person may have a first name, last name, birthdate---entities all.
Attributes can be single-valued or have multiple values (the names of the person's siblings, for instance). An attribute can be a composite of multiple values: A person's address itself an attribute. It could be composed of an address, city name, zip code, and other elements.
Attributes may also be derived, meaning that they don't actually reside anywhere on the database, but can be calculated when needed, i.e. how many people live within a certain zip code.
Also, in many cases, one of the attributes will what is known as a key attribute. This attribute must be a unique value, at least against for each and every other entity kept in the database. The key attribute allows a query to pick out an individual entry, among all the entries. A person's social security number, for instance, can work as a primary key, because every person's SSN is different.
Relationships are used to tie together different entities, explaining the relation between the two. A PERSON (one entity) WORKS FOR (a relationship) a COMPANY (another entity).
Relationships can have different levels of cardinality, meaning that one entity can have one or more than one relationships with other entities. A PERSON can work for more than one COMPANY. And a COMPANY can employ more than one person. Such rules are usually defined in the database model.
Relationships, like entities, can have attributes. In the example above, the WORKS FOR relationship can have an attribute such as DATE STARTED.
First, we can specify different types of entities a bit further, into subclasses and superclasses. Subclasses are groupings within a set of entities that are clustered in different roles they play, within the category of the entity. If "employee" is an entity, then "technician" and "manager" could be two different subclasses. Inversely, "employee" is a superclass of "technician" and "manager."
Defining how subclasses is a process called specialization. The employee entity can be broken into specialization subclasses, such as "technician" or "secretary." Some attributes can only be applied to certain subclasses (a secretary's typing speed, for instance). These are called specific attributes or local attributes.
In some cases, inclusion into a subclass can be determined by the existence of a certain attribute. These subclasses are called predicate-defined subclasses. Membership is defined by a particular value of an attribute. (Subclasses can also be defined without any particular attribute, though they must be defined manually--Those are called user-defined subclasses).
The reverse operation of specialization--namely to summarize a set of subclasses into a master entity, is called a generalization.
"The single most important concept in relational schema design theory is that of a functional dependency," write Ramez Elmasri and Shamkant Navathe in "Fundmanetals of Database Systems."
But what is a functional dependency? It is the actual relationships in the relational database. It is the relations among the attributes within a table. It is a constraint between two sets of attributes. The relations are permanent and unalterable.
For instance, a table may have two attributes, or columns. If one is the primary key, we can then say we can always determine the value of the second attribute using the primary key. This means we can use the primary key as an index to look up the second attribute. It is a mathematical certainty. The primary key is the determinant and the other is the dependent, the parlance of database-speak.
(Keep in mind that this relationship does not work in reverse. You can not use a dependent value to definitively determine the primary key, chiefly because the dependent value may not be unique in a given table).
Beyond the simple connection drawn between the primary key and the dependents, a number of other inferences can be made as well, using Armb's inference rules, which cover the laws of logic such as transitivity and reflexivity.