Note 5 Relational Database Design and Normalization

Author: Zhen Tong 120090694@link.cuhk.edu.cn Lecturer: Jan YOU

Before Everything

This note will mainly focus on what is a suitable schema.

A Good Design

Before learning the idea of a good design, let’s see what is bad first.

Update Anomaly

Take the example from the lecture:

If we change to another design like this:

EMP_PROJ

EmpSsn

Pnumber

Ename

Pname

No_hours

We combine the Employee with the Project, which can be a bad design, because, now when we want to change the name of this project from "Billing" to "Customer Accounting" (presumably because the project's focus has changed), this update operation may have unintended consequences.

The anomaly suggests that making this change for "P1" might result in the name of the project being updated for all the employees who are working on the project "P1," even if they are still working on the original "Billing" project. This is an anomaly because the update operation is affecting more data than what was initially intended.

This problem is called Update Anomaly.

Insertion Anomaly

Still this schema, it can cause Insertion Anomaly

Cannot naturally insert a project unless an employee is assigned to it

Conversely, cannot insert an employee unless he/she is assigned to a project

This part of the statement suggests that you can't add information about an employee to the database unless that employee is already assigned to a project and vice versa. In other words, employee records depend on project assignments, and you can't add employees unless they are working on a project.

Redundancy

Besides insertion and update anomaly, redundancy is also a problem. Look at this table below:

In the EMP_PROJ table, the Ename, Pname, and the Plocation are redundantly stored. I guess someone may come up with a question. “These data are not redundant, they are useful! You can visit them directly without using join, therefore it is time-saving!”

To answer this question, let’s do some math. Suppose you have $M$ employees and $N$ projects, and the redundancy data number is $3MN$ . As long as you increase one new entry in either employees or projects, you will need to consume $3M$ or $3N$ space to store the redundant data. However, in the original design, only $O(1)$ space is used.

First Normal Form

After we know what is a bad design, let's introduce the main idea of normalization in this note.

In 1NF, a relational schema is organized in a specific way to ensure that data is stored and retrieved efficiently and without redundancy.

The key idea of the First Normal Form is the concept of an "atomic" domain, which means that the values in a column (attribute) of a relational table should be indivisible or considered as single, discrete units. In other words, each value in a column should be a simple, elementary piece of data, not a collection of related or grouped data.

For example, the university has courses like “csc3170”, “mat2050” the first letters are the subject, and the numbers are unique number in that subject. Then, this is not a atomic domain, because it can be further decomposed.

Also, let’s see how can we improve the example we have above EMP_PROJ

Do you still remember the update problem we have? Decomposition of EMP_PROJ into relations EMP_PROJ1 and EMP_PROJ2 by propagating the primary key. Then we can update either of them independently.

Functional Dependency and Decomposition

From the 2 example above, we know the first normalization always decompose larger design into smaller one. But does the decomposed structure always be a good design? What problems can it cause, and how can we detect them?

First, let’s define an important relationship among attribute, functional dependency

A functional dependency is denoted as $X \rightarrow Y$ , where X is the determinant attribute(s), and Y is the dependent attribute(s). This means that for every unique combination of values in X, there is a unique value of Y.

Here's an example to illustrate functional dependency:

Let's say we have a table called "Students" with the following attributes:

Student_ID (unique identifier for each student)

Student_Name (the name of the student)

Student_Age (the age of the student)

Student_Grade (the grade or class the student is in)

In this scenario, we can observe functional dependencies:

Student_ID →Student_Name:
- Every unique Student_ID is associated with a unique Student_Name. For a given Student_ID, there is only one corresponding Student_Name.

Student_ID → Student_Age:
- Similarly, each Student_ID is associated with a unique Student_Age. For a given Student_ID, there is only one corresponding Student_Age.

Student_ID → Student_Grade:
- Again, there is a functional dependency. Each Student_ID is associated with a unique Student_Grade.

Lossless Decomposition

Then, let’s come back to the question: “does the decomposed structure always be a good design? “ Well, the answer is no. only the lossless decomposition is good.

For example

If we decompose the employee schema into two to store. When we need to use the merge data, we join them, then if two people have identical names, the join result will be in a number of $|A|\times|B| = 4$ , however, the right answer is 2, because there is only 2 person.

Formally, the decomposition of a relation shema is $R$ , and if we decompose it into $R = R_1\cap R_2$ , and join it back. Any entry $r$ in the $R$ , then decomposed into $\Pi_{R_1}(r)$ and $\Pi_{R_2}(r)$ should join to only as $r$ . i.e. $\Pi_{R_1}(r)\bowtie\Pi_{R_2}(r) = r$ . Conversely a decomposition is lossy if $|r|<|\Pi_{R_1}(r)\bowtie\Pi_{R_2}(r) |$

Closure

Closure is a set conception, and I want to spare more time and effort to define everything clear.

A "closure" refers to the set of all functional dependencies that can be logically implied or derived from a given set of functional dependencies. The closure helps us determine all the relationships between attributes (columns) in a database based on the initial set of dependencies.

Here's an explanation of the concept:

Set of Functional Dependencies (F): This is the initial set of functional dependencies that you have in a database. Each functional dependency in this set consists of an implication between two sets of attributes. For example, "A -> B" means that attribute A determines attribute B.

Closure of F (F+): The closure of F, denoted as F+, is the set of all functional dependencies that can be logically derived from the initial set F. In other words, F+ contains all the dependencies that are implied by the transitive and reflexive properties of functional dependencies.
- Transitive Property: If you have functional dependencies A -> B and B -> C, you can infer the dependency A -> C. This means that if A determines B, and B determines C, then A determines C.
- Reflexive Property: Every attribute determines itself, so if you have a dependency A, you can consider it as A -> A.

By applying these properties, you can determine all the implied functional dependencies within the database. The closure F+ ensures that you have a complete and consistent understanding of how attributes depend on one another in your database schema.

Here's an example to illustrate the concept:

Suppose you have the following set of functional dependencies:

A → B

B v C

From these dependencies, you can derive the following implied dependencies:

A → C (using the transitive property)

So, in this case, the closure F+ includes:

A → B

B →C

A → C

The closure F+ provides a comprehensive understanding of the relationships between attributes based on the initial set of functional dependencies.

Keys and Functional Dependencies

Hope you still remember the conception of superkey, candidate key, primary key, and secondary key. Here is a recap:

K is a superkey for relation schema R if and only if K → R

K is a candidate key for R if and only if
- K→R, and
- for no $\alpha \sub K, \alpha\rightarrow R$
- One of the candidate keys is designated to be the primary key, and the others can be called secondary keys

All data fields other than the pk can be called secondary key

Here introduce some new concepts:

A prime attribute is a member of some candidate's key
- If there is only one candidate key – the primary key – then a prime attribute is an attribute that is a member of the primary key

A nonprime attribute is not a prime attribute — that is, it is not a member of any candidate key
- If there is only one candidate key – the primary key – then a nonprime attribute is an attribute that is not a member of the primary key

Trivial Functional Dependencies

When $\alpha\rightarrow\beta$ , and $\beta\sub\alpha$ , then it is trivial

It is called trivial because it doesn’t tell you any extra information, because it is self-contained in the original FD.

for example:

$a, b\rightarrow a$

$a\rightarrow a$

Lossless Decomposition and Functional Dependency Property

From above, we know that $r = \Pi_{R_1}(r)\bowtie\Pi_{R_2}(r)$ is a lossless decomposition. This can be equal to this property.

A decomposition of R into $R_1$ and $R_2$ is lossless decomposition if at least one of the following dependencies is in the closure of functional dependency of $R_1$ and $R_2$

$R_1\cap R_2\rightarrow R_1$

$R_1\cap R_2\rightarrow R_2$

Let me further explain this condition. First $R_1\cap R_2$ is the part we can according to when do the join operation. I would like to metaphor this part as the zipper-zigzag part of the operation of two tables

Either can we find the unique right part $R_2$ on the right hand side, or should we able to find the unique left part of $R_1$ according to the zigzag intersection. Image if we can find more than one counterpart, what will the joined table $R'$ like? The zipzag will be much longer isn’t it. which means the cloth will broke with this wrong zipzag.

Here is an example:

$R = (A, B, C)\\$ $F = (A\rightarrow B, B\rightarrow C)$

$R_1 = (A, B), R_2 = (B, C)$

Because $R_1\cap R_2 = B$

and $B\rightarrow BC$

This is a lossless decomposition.

Second Normal Form

Full functional dependency: an FD Y → Z where removal of any attribute from Y means the FD does not hold anymore.

A relation schema R is in second normal form (2NF) if it is in first normal form, and every nonprime attribute A in R is fully functionally dependent on the primary key (assuming it is the only candidate key).

The thing we improve from partial FD 1NF design to fully FD 2NF design is doing decomposition again. We should iteratively check the secondary attributes and separate the partial FD, and change them into fully FD.

Eliminating Partial Functional Dependencies

2NF helps reduce data redundancy by ensuring that non-prime attributes depend on the entire primary key. This means that for a relation to be in 2NF, non-prime attributes are functionally dependent on the entire primary key, and there are no partial dependencies. By adhering to 2NF, you store data efficiently, without duplication.

Third Normal Form

Transitive functional dependency: an FD X→Z that can be derived from two FDs X → Y and Y → Z, where Y is a nonprime attribute

3NF:

A relation schema R is in third normal form (3NF) if it is in 2NF and no nonprime attribute A in R is transitively dependent on the primary key (assuming it is the only candidate key)

Let’s further look at this two example.

A tip in doing 2NF normalization task is look secondary attribute one by one to see if it is partial FD. The Hours is dependent on the Ssn and Pnumber. The Eame is only depend on the Ssn, and Pname, and Plocation is depend on the Pnumber. Therefore there are 3 kinds of secondary keys. Because there is no transitive functional dependency, so we stop at the 2NF.

A tip in doing 3NF normalization task is look the dependency among secondary attributes. In the example of (b), we do the same thing, check the secondary attribute one-by-one, and we can find out the EMP_DEPT is already a 2NF. Because the Dnumber is not a pk in EMP_DEPT , and Dname, and Dmgr_ssn depend on it, we decomposite it into 3NF.

In this example, we first check partial FD, and find the County_name →Tax_rate is partial FD. Therefore do 2NF. Then we find out the Area → Price is a transitive FD. Continue do 3NF

General Normal Form

All the definitions of 1NF, 2NF, and 3NF are based on the primary key only. Then, what if there multiple candidate keys? The following statements are broadening the definition into more general world.

prime attribute: any attribute that is a subset of a candidate key is a prime attribute.

All other attributes are called non-prime attributes

A relation schema R is in second normal form (2NF) if it is in 1NF, and every nonprime attribute A in R is fully functionally dependent on every candidate key of R. You can refer to the situation of LOTS above.

To be more specific, consider another example:

A is pk of R

(B, C) is a candidate of R

D depends on A, and B, i.e. (A, B)→D

Then, does this structure violate 2NF? No, it doesn’t although the candidate B, C is not fully depended by D explicitly, A→D itself is enough.

Boyce-Codd Normal Form(BCNF)

Before introduction of the BCNF, let’s first recap what is a superkey:

Superkey of relation schema R - a set of attributes S of R that contains a candidate key of R

Then, the BCNF is defined based on the concept of Superkey:

A relation schema R is in BCNF with respect to a set F of functional dependencies if for all functional dependencies in F+ of the form: $\alpha\rightarrow\beta$ where $\alpha\sub R$ and $\beta\sub R$ , at least one of the following statement holds:
- $\alpha\rightarrow\beta$ is trivial
- $\alpha$ is a superkey for $R$

If we call the left hand side of an FD a determinant, like $\alpha$ attributes, BCNF roughly says that every determinant is a superkey

For example:

in_dep

name

salary

dept_name

building

budget

If there is one FD: dept_name → building, budget

This in_dep is not a BCNF, because dept_name itself is not a superkey, the superkey is ID, dept_name. Therefore to BCDF this in_dep, we need to decomposite into instructor and department:

instructor

name

salary

dept_name

department

dept_name

building

budget

Decomposing a Schema into BCNF

With the above example, we can now translate the things we did into math language. Let R be a schema R that is not in BCNF. Let $\alpha \rightarrow\beta$ be the FD that causes a violation of BCNF, i.e. $\alpha$ is not a superkey of R. To make the decomposited schema be BCNF, and maintain the functional dependency, we need to clean the $\alpha\cap\beta$ . Here is again a metaphor, the violation of BCNF is like a cancer tumour on the body of the original schema. And the only way to fix is to cut the bad part $\alpha\cup\beta$ , and remain the good part $\beta-\alpha$ .

Therefore, the $R$ is decomposed into:

$R-(\beta-\alpha)$

$\alpha\cup\beta$

However, although we tried our best to maintain the dependency preserving, i.e. remain the original functional dependency closure, we still cannot guarantee the BCNF decomposition is perfectly dependency preserving.

Third Normal Form Revisited

2NF is 1NF, 3NF is 2NF, then, does BCNF a subset of 3NF? The 3NF is related to the Transitive functional dependency. The BCNF is related to superkey, then let’s see what is the definition of 3NF:

A relation R is in 3NF if for all $\alpha\rightarrow\beta$ in $F^+$ , it is not a transitive functional dependency, which means at least one of the following holds:
- $\alpha\rightarrow\beta$ is trivial because a trivial dependency is not a transitive functional dependency $(\beta\sub\alpha)$ .
- $\alpha$ is a superkey for $R$ , because the determinant in transitive functional dependency is a non-primary attribute. If $\alpha$ is a superkey, then, it cannot be partial FD, and cannot be a non-primary attribute.
- Each attribute A Each attribute A in $\beta– \alpha$ is contained in a candidate key for R. For example, $\alpha$ and $\beta$ are all candidate key, but exclusive. (NOTE: the third condition does not say that a single candidate key must contain all the attributes in $\beta– \alpha$ ; each attribute A in $\beta-\alpha$ may be contained in a different candidate key)

Therefore, if a relation is in BCNF it is in 3NF (since in BCNF one of the first two conditions above must hold). Third condition is a minimal relaxation of BCNF to ensure dependency preservation

However, a relation that is in 3NF but not in BCNF requires that for any nontrivial $\alpha\rightarrow\beta$ , $\alpha$ is not a superkey but $\beta$ is a prime attribute

For example, schema dept_advisor

s_ID

i_ID

dept_name

The two candidate keys are {s_ID, dept_name}, {s_ID, i_ID }

i_ID → dept_name

s_ID, dept_name → i_ID

dept_advisor is not BCNF because i_ID is not a superkey

However, it is a 3NF,

i_ID → dept_name and i_ID is NOT a superkey, but

{ dept_name} – {i_ID } = {dept_name } and

dept_name is contained in a candidate key

Pros and Cons in 3NF

It is always possible to obtain a 3NF design without sacrificing losslessness or dependency preservation

By definition of 3NF, this is 3NF

However, repetition of information (the association between L and K is repeated several times) Also, it needs to use null values (e.g., to represent the relationship $l_2, k_2$ where there is no corresponding value for J

Pros and Cons in BCNF

There are no non-trivial functional dependencies and therefore the relation is in BCNF. Insertion anomalies – i.e., if we add a phone 981-992-3443 to 99999, we need to add two tuples

(99999, David, 981-992-3443) (99999, William, 981-992-3443)