Data Normalization

 

Start with Views

  1. Design the forms (input) and reports (output) your user wants to see
  2. try to anticipate user needs
  3. What questions will user want to ask?
  4. What information will they need as part of the response?
  5. How might the data change in the future?
  6. Each of these is a view.

Entity - Relationship model

  1. Break view into entities
  2. Describe relationships between entities
  3. Frequently use some sort of diagram tool
  4. This should imply a set of tables
  5. The tables will (probably) not yet be normalized
  6. They will have problems

Un-normalized view (Schedule)

Section Course # Course name # seats # enrolled Room  time days  Instructor Students
B513 N335 Adv. VB 33 24 SL247 4:00 T, R Harris Andrea
Bob
Carolyn
Deanne
Problems with this approach
Could enter incorrect course number
Course name and number might not match
# enrolled could not match the actual number
We need much more information on students, instructor
Many students per record - very confusing, potential problems here
 
 

First Normal Form (1NF):

Section Course # Course name # seats # enrolled Room  time days  Instructor Students
B513 N335 Adv. VB 33 24 SL247 4:00 T, R Harris Andrea
B513 N335 Adv. VB 33 24 SL247 4:00 T, R Harris Bob
B513 N335 Adv. VB 33 24 SL247 4:00 T, R Harris Carolyn
B513 N335 Adv. VB 33 24 SL247 4:00 T, R Harris Deanne
 
 

Discussion:

Second Normal Form (2NF)

Goal of 2NF - Make tables with single field keys
Strategy:

Third normal form (3NF)

Goal of 3NF - Separate tables into functional dependancies
strategy: The problem: The tables I may need: Obviously some of these (esp. students) will require same treatment. Let's just get sections working correctly for now.

Section

section (PK) course Room Time Days Instructor
B513 N335(FK) SL247(FK) 4:00 T, R A. Harris (FK)
The sections table is now in 3NF. It contains only the data needed at this level. Whenever possible, a foreign key to another table is provided.

Later when we build views, we can re-construct values from the constituent tables

Courses

Number (PK) Name Lab type needed
N335 Advanced Visual Basic Windows with VB 6.0
This table contains information about any particular course. We only need to enter (and update) the course title in one place.

Rooms

Number (PK) Equipment type Number seats
SL247 Windows NT 33
This table contains the data needed for classroom information. If any table contains a reference to the room number, any details can be garnered from this table

Instructors

Instructor ID First Name Last Name Office Phone Skills Current courses
A. Harris Andy Harris SL280D 274-8491 N100, N201, N207, N241,
N301, N331, N335, N341,
N345, N355
N301, N335, N341
Note that this table is not normalized yet, as the skills and current courses fields contain listed data. How might we solve this?

Students

Student ID First Name Last Name Courses Taken Current courses
111-11-1111 Andrea Anderson N241, N331 N335
222-22-2222 Bob Billingsly N241, N331, N341 N335
333-33-3333 Carolyn Carmicheal N331 N341, N335
444-44-4444 Deanne Doyle N241, N331, N341, N351 N311, N335
Again, this table is not yet normalized, as it has listed fields.

Link Tables

In circumstances like this where you still have lists inside otherwise normalized tables, you can often utilize a special table that consists entirely of foreiegn keys. For example, consider the student table above. I could completely remove the courses_taken field from this table, and create another table, like this:

Students_Courses

ID Student_ID Course_ID
1 111-11-1111 N241
2 111-11-1111 N331
3 222-22-2222 N241
This type of table is called a "link table", because it is used to link two other tables. It really doesn't need any new information. Its job is to supply linkages between existing data. To get a listing of all students currently taking N241, you could write a query like this:
SELECT student.LName FROM 
students INNER JOIN students_courses ON
student.ID = students_courses.student_ID WHERE
students_courses.course_ID = "N241"
To complete the normalization process, you would make two different tables to link teachers to courses - one to identify courses an instructor is currently teaching, and another to indicte courses an instructor can teach.

Advantages of normalization

Simpler updates

If a database is well designed, you update data in only one place

integrity

Each entity is described only once, so it is impossible to have one student with two different names, for example

No need for 'dummy data'

In an unnormalized database, you are sometimes required to enter dummy data. For example, what if you hired a teacher but had not yet assigned her to a section. In our first table, you could not put instructor information in without a section number, as the section was the primary key. In our 3NF version, the section number is fine.

Some Views of this data:

Scheduler View
Field Section Course Course Name Seats avail Enrolled Room Time Days Instructor
Derivation Section.section Section.Course Courses.name Rooms.seats Students.count(Current courses = section.section) Section.Room Section.time Section.Date Instructors.LName
Data B513 N335 Adv. VB 33 24 SL247 4:00 T, R Harris
Note that students turned out to not be necessary in this view. Also, although this is based on the section table, that table contained fewer fields than this. We were able to derive many fields from other tables, or through a formula.