Normalization is the process of organizing data in a database. This includes creating tables and
establishing relationships between those tables according to rules designed both to protect the
data and to make the database more flexible by eliminating two factors: redundancy and
inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more
than one place must be changed, the data must be changed in exactly the same way in all
locations. A change in the customer is much easier to implement if that data is stored only in the
Customers table and nowhere else in the database.
What is an "inconsistent dependency"? While it is intuitive for a user to look in the Customers table
for the address of a particular customer, it may not make sense to look there for the salary of the
employee who calls on that customer. The employee's salary is related to, or dependent on, the
employee and thus should be moved to the Employees table. Inconsistent dependencies can
make data difficult to access; the path to find the data may be missing or broken.
Goals
- Ensuring data dependencies make sense.
- Role of Normalization.
Role of Normalization
In the E-R model, a conceptual schema using an entity- relationship diagram is built and then
mapped to a set of relations. This technique ensures that each entity has information about only
one thing and once the conceptual schema is mapped into a set of relations, each relation would
have information about only one thing. The relations thus obtained would normally not suffer from
any of the anomalies. This bottom-up approach is likely to lead to a relation that is likely to suffer
from all the problemsn. For example, the relation is highly likely to have redundant information and
update, deletion and insertion anomalies. Normalization of such large relation will then be essential
to avoid (or at least minimize) these problems.
Now to define the normal forms more formally, we first need to define the concept of functional
dependence.
Functional Dependency
Consider a relation R that has two attributes A and B. The attribute B of the relation is functionally
dependent on the attribute A if and only if for each value of A no more than one value of B is
associated. In other words, the value of attribute A uniquely determines the value of B and if there
were several tuples that had the same value of A then all these tuples will have an identical value
of attribute B.
If B is functionally dependent on A (or A functionally determines B).
For example, in the student database that we have discussed earlier, we have the following
functional dependencies:
sno -> sname
sno -> address
cno -> cname
cno -> instructor
instructor -> office
These functional dependencies imply that there can be only one name for each sno, only one
address for each student and only one subject name for each cno. It is of course possible that
several students may have the same name and several students may live at the same address. If
we consider cno -> instructor, the dependency implies that no subject can have more than one
instructor (perhaps this is not a very realistic assumption). Functional dependencies therefore
place constraints on what information the database may store. In the above example, one may be
wondering if the following FDs hold
sname -> sno
cname -> cno
First Normal Form (1NF)
A table satisfying the properties of a relation is said to be in first normal form. As discussed in an
earlier chapter, a relation cannot have multivalued or composite attributes. This is what the 1NF
requires.
A relation is in 1NF if and only if all underlying domains contain atomic values only.
The first normal form deals only with the basic structure of the relation and does not resolve the
problems of redundant information or the anomalies discussed earlier. All relations discussed in
these notes are in 1NF.
For example consider the following example relation:
student (sno, sname, dob)
Add some other attributes so it has anomalies and is not in 2NF
The attribute dob is the date of birth and the primary key of the relation is sno with the functional
dependencies sno -> sname and sno -> dob. The relation is in 1NF as long as dob is considered
an atomic value and not consisting of three components (day, month, year). The above relation of
course suffers from all the anomalies that we have discussed earlier and needs to be normalized.
(add example with date of birth)
Second Normal Form (2NF)
The second normal form attempts to deal with the problems that are identified with the relation
above that is in 1NF. The aim of second normal form is to ensure that all information in one
relation is only about one thing.
A relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each candidate
key of the relation.
To understand the above definition of 2NF we need to define the concept of key attributes. Each
attribute of a relation that participates in at least one candidate key of is a key attribute of the
relation. All other attributes are called non-key.
The concept of 2NF requires that all attributes that are not part of a candidate key be fully
dependent on each candidate key. If we consider the relation
student (sno, sname, cno, cname)
and the functional dependencies
sno -> cname
cno -> cname
and assume that (sno, cno) is the only candidate key (and therefore the primary key), the relation
is not in 2NF since sname and cname are not fully dependent on the key. The above relation
suffers from the same anomalies and repetition of information as discussed above since sname
and cname will be repeated. To resolve these difficulties we could remove those attributes from
the relation that are not fully dependent on the candidate keys of the relations. Therefore we
decompose the relation into the following projections of the original relation:
S1 (sno, sname)
S2 (cno, cname)
SC (sno, cno)
Use an example that leaves one relation in 2NF but not in 3NF.
We may recover the original relation by taking the natural join of the three relations.
If however we assume that sname and cname are unique and therefore we have the following
candidate keys
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
The above relation is now in 2NF since the relation has no non-key attributes. The relation still has
the same problems as before but it then does satisfy the requirements of 2NF. Higher level
normalization is needed to resolve such problems with relations that are in 2NF and further
normalization will result in decomposition of such relations.
Third Normal Form (3NF)
Although transforming a relation that is not in 2NF into a number of relations that are in 2NF
removes many of the anomalies that appear in the relation that was not in 2NF, not all anomalies
are removed and further normalization is sometime needed to ensure further removal of
anomalies. These anomalies arise because a 2NF relation may have attributes that are not directly
related to the thing that is being described by the candidate keys of the relation. Let us first define
the 3NF.
A relation R is in third normal form if it is in 2NF and every non-key attribute of R is non-transitively
dependent on each candidate key of R.
To understand the third normal form, we need to define transitive dependence which is based on
one of Armstrong's axioms. Let A, B and C be three attributes of a relation R such that A -> B and
B -> C. From these FDs, we may derive A -> C. As noted earlier, this dependence A -> C is
transitive.
The 3NF differs from the 2NF in that all non-key attributes in 3NF are required to be directly
dependent on each candidate key of the relation. The 3NF therefore insists, in the words of Kent
(1983) that all facts in the relation are about the key (or the thing that the key identifies), the whole
key and nothing but the key. If some attributes are dependent on the keys transitively then that is
an indication that those attributes provide information not about the key but about a kno-key
attribute. So the information is not directly about the key, although it obviously is related to the key.
Consider the following relation
subject (cno, cname, instructor, office)
Assume that cname is not unique and therefore cno is the only candidate key. The following
functional dependencies exist
sno -> cname
cno -> instructor
instructor -> office
We can derive cno -> office from the above functional dependencies and therefore the above
relation is in 2NF. The relation is however not in 3NF since office is not directly dependent on cno.
This transitive dependence is an indication that the relation has information about more than one
thing (viz. course and instructor) and should therefore be decomposed. The primary difficulty with
the above relation is that an instructor might be responsible for several subjects and therefore his
office address may need to be repeated many times. This leads to all the problems that we
identified at the beginning of this chapter. To overcome these difficulties we need to decompose
the above relation in the following two relations:
s (cno, cname, instructor)
ins (instructor, office)
s is now in 3NF and so is ins.
An alternate decomposition of the relation subject is possible: