Database Normalization

Database System Concepts: Database Normalization During the construction of a database, it is tremendously beneficial to have a good understanding of database system concepts. The database system concept I will discuss here is database normalization. The term 'database normalization' relates to the concept of reducing redundant datathroughout a database.

Redundant data is problematic. If the same value (example: name information) is being stored in multiple places, it is possible for an update to occur (example: last name to change) in one place, but not the other. By normalizing a database, the database administrator is removing this possible problem.

Instead of having a piece of data located in multiple locations, there is a link to the one spot where the data is. A database can meet several models for being normalized.

Normal Forms: One can only describe a database as having a normal form if the relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood, but most database designers model the relationships in terms of an "idealized schema". (The mathematical supports come back into play in proofs regarding the process of transforming from one form to another.)

Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that are generally accepted, but 3NF is widely considered to be sufficient for many practical applications. Most tables when reaching 3NF are also in BCNF. 4NF and 5NF are further extensions, and 6NF only applies to temporal databases. Normalizing beyond 3NF can be tricky with current SQL technology as of 2005, but a non-fully normalized database may be vulnerable to data corruption (referred to as update anomalies). Full normalization, even when not fully implemented in the target technology, is considered a good exercise to help discover all potential internal database consistency problems.

First Normal Form: First Normal Form ( 1NF ) is also known as atomic normal form. The word atomic, or atom, comes to us from the Greek word atom that mean indivisible. We now know that atoms are divisible, but the atomic normal form is not meant to be. The field data within first normal form should store the least amount of information possible (making the field atomic / indivisible). This breakdown of field data will result in no redundant columns.

Database systems during the time Codd was writing were primitive. Newer databases now support abstract data types and other data-storage methods that usually offer better performance for the management of such data. However, such methods could be considered denormalization, as they often require the hierarchical model, which was abandoned mostly due to its inflexibility, to be reintroduced into the architecture of the system.

Second Normal Form: Second normal form ( 2NF ) prescribes full functional dependancy on the primary key. It most commonly applies to tables that have composite primary keys, where two or more attributes comprise the primary key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A table is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e. not partially dependent on candidate key).

Third normal form: Third normal form ( 3NF ) requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A table is in 3NF if none of the non-primary key attributes is a fact about any other non-primary key attribute. In summary, all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).

Other forms of Normalization include Boyce-Codd Normal Form, Fourth Normal Form, Fifth Normal Form, Domain/key Normal Form, and Sixth Normal Form.

A functional dependency (FD) is a constraint between two sets of attributes in a relation from a database. Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y, also in R, (written X → Y) iff each X value is associated with precisely one Y value. Customarily we call X the determinant set and Y the dependent attribute. Thus, given a tuple and the values of the attributes in X, one can determine the corresponding value of the Y attribute. For the purposes of simplicity, given that X and Y are sets of attributes in R, X → Y denotes that X functionally determines each of the member of Y - in this case Y is known as the dependent set. Thus, a candidate key is a minimal set of attributes that functionally determine all of the attributes in a relation. (Note: the "function" being discussed in "functional dependency" is the function of identification.) A functional dependency FD: is called trivial if Y is a subset of X.

The determination of functional dependencies is an important part of designing databases in the relational model, and in database normalization and denormalization. The functional dependencies, along with the attribute domains, are selected so as to generate constraints that would exclude as much data inappropriate to the user domain from the system as possible.

For example, suppose one is designing a system to track vehicles and the capacity of their engines. Each vehicle has a unique vehicle identification number (VIN). One would write VIN → EngineCapacity because it would be inappropriate for a vehicle's engine to have more than one capacity. (Assuming, in this case, that vehicles only have one engine.) However, EngineCapacity → VIN, is incorrect because there could be many vehicles with the same engine capacity.

This functional dependency may suggest that the attribute EngineCapacity be placed in a relation with candidate key VIN. However, that may not always be appropriate. For example, if that functional dependency occurs as a result of the transitive functional dependencies then that would not result in a normalized relation.

Irreducible function depending set

A functional depending set S is not irreducible if the set has three following properties:

1. Each right set of a functional dependency of S contains only one attribute.

2. Each left set of a functional dependency of S is irreducible. It means that reducing any one attribute from left set will change the content of S (S will lose some information).

3. Reducing any functional dependency will change the content of S.

Sets of functional dependencies with these properties are also called canonical or minimal.

Properties of functional dependencies

Given that X, Y, and Z are sets of attributes in a relation R, one can derive several properties of functional dependencies. Among the most important are Armstrong's axioms, which are used in database normalization:

1. Subset Property (Axiom of Reflexivity): If Y is a subset of X, then X → Y
2. Augmentation (Axiom of Augmentation): If X → Y, then XZ → YZ
3. Transitivity (Axiom of Transitivity): If X → Y and Y → Z, then X → Z

From these rules, we can derive these secondary rules:

Union: If X → Y and X → Z, then X → YZ
Decomposition: If X → YZ, then X → Y and X → Z
Pseudotransitivity: If X → Y and YZ → W, then XZ → W
Accumulation: If X → YZ and Z → V, then X → YZV
Extension: If X → Y and W → Z, then WX → YZ

Equivalent sets of functional dependencies are called covers of each other. Every set of functional dependencies has a canonical cover

Multivalued Dependencies Recall that when we discussed database modelling using the E-R Modelling technique, we noted difficulties that can arise when an entity has multivalue attributes. It was because in the relational model, if all of the information about such entity is to be represented in one relation, it will be necessary to repeat all the information other than the multivalue attribute value to represent all the information that we wish to represent. This results in many tuples about the same instance of the entity in the relation and the relation having a composite key (the entity id and the mutlivalued attribute). Of course the other option suggested was to represent this multivalue information in a separate relation. The situation of course becomes much worse if an entity has more than one multivalued attributes and these values are represented in one relation by a number of tuples for each entity instance such that every value of one the multivalued attributes appears with every value of the second multivalued attribute to maintain consistency. The multivalued dependency relates to this problem when more than one multivalued attributes exist. Consider the following relation that represents an entity employee that has one mutlivalued attribute proj:

emp (e#, dept, salary, proj)We have so far considered normalization based on functional dependencies; dependencies that apply only to single-valued facts. For example, e# -> dept implies only one dept value for each value of e#. Not all information in a database is single-valued, for example, proj in an employee relation may be the list of all projects that the employee is currently working on. Although e# determines the list of all projects that an employee is working on, e# -> proj is not a functional dependency. So far we have dealt with multivalued facts about an entity by having a separate relation for that multivalue attribute and then inserting a tuple for each value of that fact. This resulted in composite keys since the multivalued fact must form part of the key. In none of our examples so far have we dealt with an entity having more than one multivalued attribute in one relation. We do so now. The fourth and fifth normal forms deal with multivalued dependencies. Before discussing the 4NF and 5NF we discuss the following example to illustrate the concept of multivalued dependency.

programmer (emp_name, qualifications, languages) The above relation includes two multivalued attributes of entity programmer; qualifications and languages. There are no functional dependencies.

The attributes qualifications and languages are assumed independent of each other. If we were to consider qualifications and languages separate entities, we would have two relationships (one between employees and qualifications and the other between employees and programming languages). Both the above relationships are many-to-many i.e. one programmer could have several qualifications and may know several programming languages. Also one qualification may be obtained by several programmers and one programming language may be known to many programmers.

The above relation is therefore in 3NF (even in BCNF) but it still has some disadvantages. Suppose a programmer has several qualifications (B.Sc, Dip. Comp. Sc, etc) and is proficient in several programming languages; how should this information be represented? There are several possibilities.

Other variations are possible (we remind the reader that there is no relationship between qualifications and programming languages). All these variations have some disadvantages. If the information is repeated we face the same problems of repeated information and anomalies as we did when second or third normal form conditions are violated. If there is no repetition, there are still some difficulties with search, insertions and deletions. For example, the role of NULL values in the above relations is confusing. Also the candidate key in the above relations is (emp name, qualifications, language) and existential integrity requires that no NULLs be specified. These problems may be overcome by decomposing a relation like the one above as follows:

Enroll now!

Theoretical Paper

Practical Paper

Industrial Training

Database Normalization