The slide deck for the presentation I plan to give at Buffalo Barcamp on Febuary 2, 2013 is at http://cjwertz.files.wordpress.com/2013/01/jobs.pdf
Maybe, I’ll get back to this and write something later on. If you look around this blog,you will find several other entries on this general topic.
These are notes for a presentation on Database Normalization I will be giving at the Database Seminar meetup on January 16, 2013. This is a relatively concise discussion, suitable for an hour or hour and a half talk. More detailed discussions with examples can be found in most database texts such as: Introduction to Database Systems, C.J. Date, Addison-Wesley; Data Management, Richard T. Watson, Wiley.
The slides for this talk are available here.
According to my Random House College Dictionary, the word normalization means, “to cause to conform to a standard or norm.” In mathematics this usually means to make variables comparable to or compatible with each other. In computer hardware, normalization is the process of aligning floating point numbers to facilitate arithmetic. In database design, normalization reduces a collection of data to its most basic form.
Slide 3 is a representation of a database table containing the columns empno, empname, deptno, and deptname. This is not normalized because the fact that a particular deptn and deptname go together can appear in multiple rows. Should deptname for a particular deptno change, it must be updated in more than one row (slide 4). Deleting an employee could result in the deletion of a department (slide 5). Adding a new department can result in a malformed row (slide 6). These are referred to as update, deletion, and insertion anomalies.Slide 7 shows the normalized version. The anomalies are eliminated and the original data can be recovered by joining the tables. If many queries are performed, performance may suffer. If many updates are performed, there may be a performance gain. Most importantly, the normalized version represents the true nature of the data.
Let’s review some basic notions about relational databases. A table (relation) contains columns (attributes) that belong together because they are somehow related. Each row of a table (relation) is unique. No two rows(tuples) are identical. The key is defined as the column(s) that can be used to tell one row (tuple) from all others. This is a sticky point. We tend to want the key to identify the data we are searching for. This is not always the case. Foreign keys are used to form relationships with other tables (relations). We can say that the key determines the non key elements or the non key elements are functionally dependent on the key. In simpler English, the non key columns or attributes should represent single valued facts about the key. This could mean one and only one employee first name and one and only one department name are associated with an employee number. It’s that simple, but you do have to understand the data. Is it true that one and only one department name is associated with an employee number? What about history (slide 14)?
The normal forms are rules for avoiding ill-formed relations. These rules can also be used to correct ill-formed relations – they are the don’ts – but why not do things right the first time if you can?
First normal form (slide 18) is achieved when there are no repeating columns or groups. Each non-key element must be determined by the key.
Second normal form (slide 20) is achieved when each non-key element is determined by the entire key. No non-key element should be determined by part of the key.
Third normal form (slide 22) is achieved when each non-key element is determined by the key and not by some other element which is, in turn, determined by the key. There should be no transitive dependencies.
Boyce/Codd normal form provides a rule for resolving a special case involving two or more composite overlapping keys. Since this doesn’t come up that often, we’ll gloss over it in this presentation.
The forms we’ve discussed so far deal with single valued facts. The two that remain deal with multi-valued facts. The connection between employees and skills might be an example of a multi-valued fact. Each employee may possess multiple skills and each skill may be possessed by multiple employees.
Fourth normal form (slide 24) prohibits storing two or more multivalued facts in the same table. There are problems with the key if this rule is broken.
Fifth normal form (slide 26) prohibits a misinterpretation of what is and is not a multivalued fact. If an employee performs a particular task on a particular project, this is really a three way relationship between employee, project, and task. Misinterpreting this as two or three two way relationships will result in the loss of information. It will not be possible to reconstruct the original facts by joining the tables. This is referred to as a lossy join situation. (I used to force my students to sit through an elaborate demonstration of this; I’ll just ask you to take my word for it.)
We can restate all of this in another way. A database is normalized if each non-key attribute in each table is determined by the key, the whole key, and nothing but the key.
We had some interesting discussion and questions about normalization. I’d like to reiterate my considered answer to a couple of the questions. It will always be beneficial to first discover the fully normalized form. Subsequently, some will feel they have valid reasons for creating surrogate or fake keys that are not actual data. These may, for example, make queries easier. If, however, you want to maintain data integrity and consistency, you will have to enforce uniqueness and referential integrity based on the keys identified by normalization. This might be accomplished by creating unique indexes or by creating database triggers. Using application code is the least desirable. We can say similar things about denormalization to achieve performance. I think I may have been more of a purist than I should have been during the presentation.
© Charlie Wertz, January 2013
Relational database theory, generally credited to E. F. Codd, provides a complete and mathematically coherent way of storing, viewing, and manipulating data. (See A Relational Model of Data for Large Shared Data Banks; E. F. Codd, IBM Research Laboratory, San Jose, Ca; Communications of the ACM, June 1970.) This so-called relational model is a collection of rules and definitions that provide for storing and manipulating data. SQL is a language that was subsequently developed for expressing relational manipulations. SQL does not always adhere strictly to relational theory. SQL also contains features and facilities that are not part of the theory.
Oracle, DB2, Access, Sybase, and Informix are examples of database management systems that provide access to data by means of SQL. These and others are specific software products that may be purchased from specific vendors. Since it isn’t such an orderly world, there are different dialects of SQL. While the basics are the same, details do vary from one product to another. The examples shown in this document utilize an SQL dialect provided by Oracle in the past. Oracle does now support a newer standard, but to the best of my knowledge, these examples will work with a newer version of Oracle.
IMHO (in my humble opinion), you cannot really understand SQL unless you understand some relational database concepts.
Codd’s rationale for the relational model goes like this (in his own words):
Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).
Three of the principal kinds of data dependencies which [still] need to be removed are: ordering dependence, indexing dependence, and access path dependence.
Application programs which take advantage of the stored ordering of a file are likely to fail to operate correctly if for some reason it becomes necessary to replace that ordering by a different one. Similar remarks hold for a stored ordering implemented by means of pointers.
Application programs and terminal activities [should] remain invariant as indices come and go
Application programs developed to work with [older] systems tend to be logically impaired if the trees or networks are changed in structure.
Consider the case of Customers, Orders, and Products, discussed in the last article.(The following will be clearer if you look at the slides for this article.)
Using a hierarchical representation, we might consider Products as components of orders and orders as related to Customers if our goal is to learn what Products have been ordered by a particular Customer. (Customer –> Order –> Product) Alternatively, if our goal is to learn which Customers have ordered particular products, we might proceed from Products to Orders to Customers. (Product –> Order –> Customer) To be useful, a hierarchical database must somehow maintain both representations. It is necessary to either replicate the data or deploy a fair dose of smoke and mirrors.
A network model providing one linkage between Customers and Orders and another between Orders and Products is conceptually simpler and allows for direct access to either Customers or Orders. (Customer –> Order <–> Product) Unfortunately, this approach requires a program to be aware of the structure and contain a strategy for traversing the connections.
And, in either case, if the structure is changed, the program will no longer work.
A relational database may be viewed as a collection of named tables (referred to as relations), consisting of rows (referred to as tuples) and columns (referred to as attributes). Rows are identified by data content. Columns are identified by names.
Within a given table, no two columns may have the same name. Within a given table, all data contained within a given column is of the same type. Relational theory refers to this as a domain. A particular column might contain product identifiers for example. Two different tables may contain columns with the same name. Column name qualified by table name will still be unique.
Within a given table, no two rows may contain the same data. (SQL systems sometimes allow this requirement to be violated and have been criticized for this.) The only way to tell one row from another is by specifying the data values for one or more columns. Thus, if two rows are allowed to contain the same data, they cannot be identified or manipulated individually. (SQL systems often provide for a fake column called row id, but this is not part of relational theory; in fact, it is contrary to the theory.)
The column or set of columns that can be used to differentiate one row from another is referred to as the primary key. One column may serve as an adequate key. Or, it may take several. It may be necessary to view the combination of all columns in the table as the key. A table containing duplicate rows really has no key. It is possible for a table to contain more than one possible key. A column or combination of columns that might serve as a primary key is called a candidate key.
Relationships between tables are implemented solely by data values. Two rows from two different tables, each containing identical values for one or more specified columns will be considered related. If a column or set of columns of table A contains values referring to the primary key of table B, those columns in table A are referred to as a foreign key. Foreign keys usually maintain significant relationships.
Rows of a table are generally not considered to be stored in any particular physical sequence. (Mathematical sets aren’t ordered.) Often they will appear in a sequence reflecting the order of entry. Even then, there is no guarantee that the results of a specific query will appear in any particular sequence. Relational database management systems often provide mechanisms for returning the data into a particular sequence. This is usually accomplished through indexing or sorting. This is not a component of relational theory.
Relational theory general deals with the logic, semantics, and meaning of the data. Physical details are left to the implementers of specific database management systems.
The following examples refer to a database discussed in Chapter 4 of the Fourth Edition of Data Management: Databases and Organizations by Richard T. Watson © 2004. It looks like this.
STOCK (STKCODE, STKFIRM, STKPRICE, STKQTY, STKDIV, STKPE, NATCODE) Each row represents an individual stock. STKCODE is a unique identifer. NATCODE is a reference to the NATION table.) NATION (NATCODE NATNAME EXCHRATE) Each row represents an individual nation. NATCODE is a unique identifier.
Before the relational approach was introduced, database programming and access were procedural, navigational, and record by record. This means it was necessary to develop a program or step by step procedure and a specific record accessing strategy like the following.
open stock open nation open report while not eof(stock) get next row of data from stock search for matching nation data from nation put report line() end while close all files
This might be replaced by SQL similar to this.
select * from stock, nation where nation.natcode = stock.natcode;
Well, that’s not 100% accurate. The SQL will return as many rows of data as satisfy the query. If this is embedded in a program and multiple rows are returned it will be be necessary to examine the result row by row. Details will vary from one programming environment to another. It might go something like this.
execute query open report while not eof(result) get next row of data from result put report line() end while close report
Using the relational approach, we write a statement that specifies the desired output and leaves it to the system to handle the details. This is often referred to as a nonprocedural approach. It also provides for the manipulation of sets of data. While this approach is very powerful, it does mean that we must be very exact in writing the specification.
In the stock and nation example, stkcode is the primary key for stock, nation.natcode is the primary key for nation, stock.natcode is a foreign key indicating which row from nation is related to a specific row from stock.
Relational algebra and relational calculus provide a mathematically complete collection of operations for manipulating relations. The subtle differences between relational algebra and relational calculus are not important to us here. Both provide the same set of operations: select (sometimes called restrict), project, join, product, union, intersection, difference, and division. Both the algebra and the calculus are based on mathematical set theory and treat tables or relations as mathematical sets. They allow us to specify manipulations on sets or collections of data.
The possibility of NULL values is another distinguishing concept. NULL means there is no value. NULL is not the same as zero or blank.
For more detailed discussions of relational theory, see:
An Introduction to Database Systems, Volume I, 6th Edition
C. J. Date
The Relational Model for Database Management
E. F. Codd
An understanding the nature of select (sometimes called restrict), project, join, product, union, intersection, difference, and division provides a good foundation for understanding SQL. Each of these is a fundamental operation that may be performed on one or more relations. The result of applying one of these operations to one or more relations is always another relation. This is often referred to as closure. It is an important property.
The select operator produces a new table or relation containing some subset of the rows contained in the original table. The sql statement
SELECT * FROM SHR WHERE SHRPE < 12;
creates a new relation containing only the rows or tuples from the table SHR that match the condition SHRPE < 12. In SQL, “*” means show all the columns or attributes in the table.
SELECT * FROM SHR;
creates a new relation containing all rows or tuples from SHR. SHR can be thought of as a subset of itself. The SQL keyword SELECT really identifies any query. It is the addition of some selection criterion that makes the query a relational select. Omitting any criterion results in the selection of the entire table. Mathematically, this is OK, but it may be confusing. The term restrict is sometimes used instead of select to avoid confusion. Observe that we apply select or restrict to an entire table or relation. Depending on the selection criterion and the data the resulting relation may contain all rows from the original, some of the rows, a single row, or even no rows.
The project operator produces a new table or relation containing some subset of the columns or attributes contained in the original table. In SQL, this is specified by listing the desired attributes.
SELECT SHRFIRM, SHRPE FROM SHR;
This is one of the places where SQL and the theory diverge. The result of a project operation should be a relation. A relation does not contain any duplicate rows. Given just the right data the preceding SQL statement would produce a result containing duplicate rows. To be sure of a true project, it is necessary to write
SELECT DISTINCT SHRFIRM, SHRPE FROM SHR;
The keyword DISTINCT in this context calls for the elimination of any duplicate rows. In practice, it is necessary to understand the data, to know if duplicate rows might result, and to know if they should be displayed.
Restrict and project are often combined in one SQL statement
SELECT SHRFIRM, SHRPRICE, SHRQTY, SHRDIV FROM SHR WHERE SHRQTY >= 100000;
Union, intersection, and difference provide ways of operating on two tables.
Union calls for all rows that appear in either table.
You might write
SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRPRICE > 30 UNION SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRQTY > 100000;
This is really the same as writing
SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRPRICE > 30 OR SHRQTY > 100000;
The latter makes more sense if we are forming a union of subsets from the same table.
Intersection calls for any row that appears in both tables.
You might write
SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRPRICE > 30 INTERSECT SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRQTY > 100000;
This is really the same as writing
SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRPRICE > 30 AND SHRQTY > 100000;
The latter makes more sense if we are forming an intersection of subsets from the same table.
Difference calls for all rows that appear in one table but not the other
You might write
SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRPRICE > 30 MINUS SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRQTY > 100000;
This is really the same as writing
SELECT SHRCODE, SHRFIRM, SHRPRICE, SHRQTY FROM SHR WHERE SHRPRICE > 30 AND SHRQTY <= 100000;
If it’s <= 100000 it’s NOT > 100000.
The latter makes more sense if we are forming a difference of subsets from the same table.
The SQL UNION, INTERSECT, and MINUS operations automatically do away with duplicate rows. DISTINCT is not necessary.
You might think of union, intersection, and difference as vertical combinations of tables.
TABLE1 <combine> TABLE2
The join operation calls for a horizontal combination.
TABLE1 <combine> TABLE2
It’s necessary to specify some matching condition. (This is similar the first example we looked at).
SELECT STKCODE, NATION.NATCODE FROM STOCK, NATION WHERE STOCK.NATCODE = NATION.NATCODE;
Produces a result containing matching combinations. This is called the inner join. If some row in either stock or nation has no matching row, it will not appear in the result.
Suppose the data were as follows.
STOCK NATION STKCODE NATCODE NATCODE AA ZZ VV BB ZZ WW CC XX ZZ DD WW
The result of the inner join would be like this.
STOCK NATCODE AA ZZ BB ZZ DD WW
Notice that both AA and BB match with ZZ. The rows for stock CC and for nation VV are lost because they do not match with the other table. Sometimes this is what you want. Sometimes it isn’t.
The outer join will include rows that have no match. The outer join is usually classified as left or right.
Given the data above, the left outer join would produce this result.
STOCK NATCODE AA ZZ BB ZZ CC NULL DD WW
Some products provide for use of the terms inner, left and right in sql statements like this.
SELECT STKCODE, NATION.NATCODE FROM STOCK LEFT JOIN NATION ON STOCK.NATCODE = NATION.NATCODE;
Oracle does not use this syntax. The equivalent oracle query is like this.
SELECT STKCODE, NATION.NATCODE FROM STOCK, NATION WHERE STOCK.NATCODE = NATION.NATCODE(+);
A similar result could be produced by writing this.
SELECT STKCODE, NATION.NATCODE FROM STOCK, NATION WHERE STOCK.NATCODE = NATION.NATCODE UNION SELECT STKCODE, NULL FROM STOCK WHERE NATCODE NOT IN (SELECT NATCODE FROM NATION);
Product calls for all possible combinations of all rows from both tables. In SQL, it results from omitting the matching condition. Suppose the following.
SELECT STKCODE, NATION.NATCODE FROM STOCK, NATION;
Given the data above, the result would be.
STKCODE NATCODE AA VV AA WW AA ZZ BB VV BB WW BB ZZ and so on.
This isn’t particularly useful. It usually means you’ve made a mistake. Including it makes the mathematics complete.
Divide is not an SQL command. It is, however, defined as a component of the relational algebra and calculus. According to Date, “DIVIDE takes two relations, one binary and one unary, and builds a relation consisting of all values of one attribute of the binary relation that match (in the other attribute) all values in the unary relation.” That’s a mouthful! We won’t pursue this one.
Most relational database management systems maintain the so-called ACID properties:
Atomicity – a transaction updating multiple items will be completed in its entirety or it will not be done at all;
Consistency – a consistent initial state will be retained after a transaction;
Isolation -concurrent transactions will be processed as if the occurred one at a time;
and Durability – the system can recover from errors and failures.
We’ve discussed these features at greater length in a previous article. We’ve also discussed journaling, commit and rollback, locking, and the separation of the external and storage views. A common technique involves the mapping of one or more tables to one or more tablespaces which are in turn mapped to one or more files which may be stored on one or more devices.
Modern relational database management systems also provide for distribution, replication and sharding. They usually also provide for Database Procedures (program modules which are thought of as residing in the database) and Triggers (Database Procedures to be executed when particular events occur).
They also provide for referential integrity. This means that in a table containing a foreign key, the data value for any foreign key must match the primary key of a row in the table to which it refers.
Relational systems also typically incorporate security mechanisms.
Slides for the presentation are available at http://cjwertz.files.wordpress.com/2012/12/relations.pdf
© Charlie Wertz, November 2012
These are notes for the second meeting of the Database Seminar to be held at Buffalo Lab on Wednesday, October 24, at 7:00 PM. The topic is Files and Data Structures.
We noted last time that the use of the operating system’s file system frees us from the complexity and tedious detail of managing data at the hardware level and dealing with many different formats and addressing schemes.
It’s still useful to have some concept of what’s going on under the covers. We can view a file as if it were a sequential array of records or blocks or sectors or allocation units. If, however, data were stored this way, it would be very difficult to manage space effectively and efficiently. We’d have to plan in advance for each file to occupy a contiguous space and as files were added and deleted, odd sized amounts of space would become available resulting in wasted space. This is probably not as big an issue as it was in the “old days,” but it still matters. So, the operating system allocates space for a file in relatively small fixed size units, maintains a record of where each file is on the device and allows us the illusion of a contiguous array; this can be accessed sequentially or randomly by referring to a displacement from “the beginning.” Unix and Linux systems commonly use an inode table for this purpose; current versions of Windows have a Master File Table.
The point here is that this will thwart any attempt we make to be sure that two data items are physically adjacent unless we are able to make sure they reside in the same allocation unit. Given the nature of modern storage devices, the use of solid state devices, and the use of in memory caching, this is less of an issue than it was in “the good old days,” but we still look to minimize the number of trips to the data well when we can.
We typically categorize files as sequential, random, and indexed.
Sequential access is simple enough. We work our way through the file a piece at a time in order. The system will be kind enough to read or write the “next” record for us. Many “legacy’ systems, designed when the use of magnetic tape storage was common, use this technique. Several features are worth noting. It is only suitable for batch processing. The ordering of the data within the file is important. If we want to deal with a specific record, we examine the file a record at a time until we find what we are looking for. Updates are usually accomplished by creating a new copy reflecting the changes, additions, and deletions. This is still an effective way to apply large numbers of transactions to a file, if the updates can be processed as a batch. It is possible to maintain a logically sequential file as a linked list instead of relying on the physical sequence. As far as I know, this is a special case and is not common.
Random processing involves the utilization of an algorithm that translates a record identifier or key into a location within the file. This must be repeatable. The same key must produce the same address every time so we can retrieve the data when we want it. These are usually cousins of pseudo random number generators. This can be very fast for retrieving relatively small amounts of data when we know what we are looking for. Sequential access in key sequence is usually not possible because the the data will not be stored that way. It’s necessary to include extra room in the file space owing to the way the addresses are generated and it’s also necessary to deal with the possibility that two keys may map to the same address.
Indexing is a special way of gaining the advantages of random access. Here, the system maintains a separate file or table or partition that tracks the key values and the location of the data relating to each key. The data may still be stored sequentially by key but it wouldn’t have to be. We may, if we like, maintain multiple indexes, allowing access to the data by more than one key or search argument. Obviously, if we attempt to manage the sequence of the data within the file, it can only be ordered according to one key. Retrievals and Boolean types of searches can be extremely efficient and fast. The downside of multiple indexes is that the require additional storage and updates and additions require the maintenance of all the indexes. While a primitive index might be in key sequence, this would me expensive to maintain. Usually some sort of tree structure is used. This sort of thing is probably best left to the master programmers who work on file systems and operation systems.
Just a reminder: every time I’ve mentioned the sequence of the data, it’s really been about the illusory or logical sequence because the file system is likely deceiving us about the physical sequence; this has recently become even more the case. Solid state storage devices add another layer of deception. Here the device presents the illusion of a sequential array to the filesystem but is free by incorporating a Flash Translation Layer to employ a physical allocation that is quite different. This is for a couple reasons. One is wear leveling. Because each storage location within the device is only good for some finite number of write cycles, the device attempts to assure that each cell receives the same use. It’s also necessary to deal with updates, additions, and deletions within the device’s allocation units. This is where trim/unmap and garbage collection come in.
What does all this mean to us as database designers and administrators? Once upon a time, database administrators devoted a lot of energy to managing exactly where data items were stored on disk. Fortunately, modern devices are much faster because this is nearly impossible now. We can still attempt to limit accesses by making things easy to find and by forcing related items into the same allocation unit.
Now we’re on to the so-called data models built into database management systems. When we say data model here, we’re referring to a style or technique used for visualizing data and relationships. There are really only a small number of these. They’re easier to picture than to describe.
We want to take a look at these because some of the newer object oriented and nosql systems appear to embody some of the same ideas.
Hierarchical structures are tree-like. For example, Customers place Orders calling for specific products. Usually, these would be stored by following each Customer segment by the appropriate Order segments. Each Order segment is, of course, followed by the appropriate Product segments.
Unfortunately, it’s not as simple as it seems. There are a couple catches. First, a single hierarchy is usually not good enough. In addition to wanting to find products on orders from specific customers, we might also like to find Customers who ordered a specific product. A cumbersome solution would involve multiple indexes into the hierarchy, but, generally you’d really like to maintain more than one hierarchy. Second, you’d also like to see the description of a single customer stored but once, the description of a single product stored but once, and so on. The solution is to create the multiple hierarchies and provision many of the segments with only pointers to the one copy of the data. IBMs one time premier IMS DB/DC system referred to these alternate hierarchies as logical databases. It was actually fairly straightforward for a programmer to access the appropriate hierarchy and find all desired information about a given customer or product. To make this possible, the data base administrator would have created an elaborate structure of pointers, smoke, and mirrors that made it possible without storing multiple copies of the data. The important thing to remember about this is that if the database design doesn’t provide explicitly for a desired connection, you’re sunk; there is no easy way to get from there to here and a restructuring of a large and complex database is difficult and time consuming.
The main alternative approach deployed prior to the advent of relational systems, was the so-called network database. Here there would be a single Customer record, a single Product record, and a single order record. The relationships between specific instances or each would be maintained by linked lists or pointer chains. Conceptually, it seems simpler than the hierarchical approach and it is easier on the database administrator. The database is, however, a maze of pointers; it’s also harder on the programmer. The thought process is one of navigating the database. by more or less hopping from record to record. Find the desired customer. Find the first order. Find the first product. Find each product in sequence. Find the next order and so on. The important thing to remember about this is that if the database design doesn’t provide explicitly for a desired connection, you’re sunk; there is no easy way to get from there to here and a restructuring of a large and complex database is difficult and time consuming.
Another thing we’d like to note about both network and hierarchical database management systems is that the user or programmer views may be quite different from the way the data is stored. This notion was addressed in the 1970’s by the ansi/sparc committee. (American National Standards Institute, Standards Planning And Requirements Committee) As Wikipedia tells us it’s an abstract design standard for a Database Management System, calling for three views of the data: a conceptual level describing the semantics and meaning of a database; and internal or physical level describing the way the data is stored in all its gory detail; one or many external or user views describing the way a particular program or user sees the data; and mappings that translate one to another. I don’t know that anyone ever really implemented this in software, but some folks claimed to have done so and we had a lot of fun talking about it.
Slides for this presentation are available at http://cjwertz.files.wordpress.com/2012/10/structure.pdf.
© Charlie Wertz, October 2012
These are my notes for a presentation on database integrity to be given in October of 2012 or later as part of a database seminar series I am conducting at Buffalo Lab.
At about the same time relational databases became popular, people began discussing the desirability of a database system affording the so-called acid properties: atomicity, consistency, isolation, and durability. Atomicity means that a transaction involving multiple updates will either be completed in its entirety or not done at all. If, for example, funds are transferred from one bank account to another, both account balances will be altered or neither will. Consistency means that the database will always be left in a correct and consistent state. Given that we begin in a consistent state, atomicity and the other properties assure that it will remain in a consistent state. Isolation means that if if two or more transactions are interleaved, the result will be as if they were accomplished one at a time. Durability means the system provides ways to recover from error or failure.
The first line of defense is the provision of some way to commit a transaction and mark it as complete and some way to undo or rollback to a previous consistent state. Thus a program can change its mind based on circumstances and rollback whatever it has done or it can signal completion by committing. In addition a return code is tested for success or failure after each operation and the program can then determine whether to rollback or proceed to commit. This is usually achieved via some sort of journalling mechanism whereby the database system has a record of both the prior state and new state of each data item involved in a transaction. The journal can be played against a backup copy of the database to recover from a catastrophic failure. Because the latter can be problematic for a truly large database, replication is often a better solution.
The next problem that must be addressed is the concurrent update problem. If two independent transactions attempt to update the same record at the same time, one may overwrite the other. The traditional solution allows the first transaction to lock the record, requiring the second to wait for the record to be released. This is fine if a transaction updates a single record, but when transactions update multiple records, it is possible for each to end up waiting for the other in a so-called deadlock or deadly embrace. In this case, the DBMS must return an error to one or the other and the transaction receiving the error must rollback so the other can proceed.
The traditional method of dealing with this is referred to as pessimistic locking. Assuming that conflicts are likely, it can introduce considerable overhead and can become impractical in large or highly distributed systems.
Optimistic locking does away with locks as such. Here, when a transaction makes an update to a record, a timestamp is updated. Then, each transaction must check the timestamp to assure that the record has not been changed by some other transaction in the interim since it was read. An alternative involves checking all data in a record against a previous known state. Whichever alternative is chosen, this technique generally requires more sophisticated programming.
The preceding has been the briefest possible description. Specific details vary from DBMS to DBMS and programming language to programming language.
Slides for this presentation are available at http://cjwertz.files.wordpress.com/2012/10/acid.pdf
© Charlie Wertz, October 2012
I’m writing this particular blog entry in order to collect my thoughts for my presentation at the kickoff of a database seminar to be held at BuffaloLab on September 26, 2012. I may also present this at Rochester BarCamp next month.
In simpler times, when I had hair, computerized data were stored directly into whatever filesystem was available on the hardware to be used. (The very earliest systems had no filesystem as such and data was written directly to devices.) Files or named collections of data could be sequential, random, or indexed. Sequential files are read or written, one piece or record at a time in the order in which they added to the file. Updates are generally made by creating a new copy of the entire file. In so-called random access, a mathematical algorithm translates a key or identifier into a specific location on the storage device or in the file in a repeatable manner. An indexed filesystem maintains a separate index or directory that keeps track of the connections between identifiers and storage locations. In each of these situations, updates are accomplished by replacing the old data with the new. While there are variations on these themes and ways of combining them, this is really all there is. The important point here is that any program accessing the data must contain a detailed description of the data and its structure and must also contain detailed procedures for storing and accessing the data. This was usually inflexible; any change to the data or its structure would require changes to all programs involved. It was also necessary to write special programs for reporting and querying.
In the 1960s and 1970s, software vendors began to develop database management systems to simplify data storage and access. Such a system usually incorporates a specific data model or approach to organizing data. The usual suspects are hierarchical, networked, and indexed. Hierarchical systems organize data into a tree like structure similar to a typical organization chart. Network systems create a spider web of data wherein one record “knows” the physical location of another. Indexed systems, sometimes referred to as inverted file systems, employ multiple indexes to allow access by a variety of data elements. Each of these has advantages and drawbacks. What’s significant is that, for each, it is necessary to develop a database design employing the specific data model at hand. This usually involves torturing the logic of the information to make it fit the required structure. While many technical details could move out of the application programs and into the DBMS, it was still necessary for the program to contain detailed knowledge or the manner in which the data had been structured. We talked and wrote about navigating the database and changes to the structure still required sometimes painful program changes. There were some general tools for reporting and querying, but they were limited. These database management systems usually also provided for concurrency control, recovery from errors, backup and restore, and the like. They also moved concerns about data storage locations and other technical issues out of individual programs; external or programmer/user views of the database may be very different from the internal or storage view. Typical products you might research include IMS, IDMS, Total, System 2000, and Adabas; this is an incomplete list.
In the 1970s and 1980s, Edgar F. (Ted) Codd (our hero) introduced the concepts of relational databases. Codd was a mathematician who developed a coherent and logically consistent approach to viewing and manipulating data. The relational model, based on mathematical set theory, provides well defined operations for manipulating collections of data organized into a tabular format. An important feature of this design is that the system is mathematically complete; this means that an operation on two or more tables or relations produces another table which may then by processed by the same set of operations. Collections or sets of data are manipulated rather than individual rows, tuples, or records. Once again, the external or programmer/user views of the database may be very different from the internal or storage view.
Relational database management systems typically contain all the features of the older style systems combined with a dictionary or metadata store that is itself a relational database. There is usually a powerful data manipulation language based on the relational operations. SQL is by far the most common. More modern systems usually provide for distributed data, replication, and sharding. They generally also provide for stored procedures which are executed within the database management system and provide consistency and enhance performance by eliminating code from application programs. Oracle, DB2, MySQL,SQL Server, and Postgres are typical examples.
There have been two styles of relational database management systems. Some, such as Microsoft Access are complete program development systems containing relational style database management in addition to tools for creating input and output. Today, M$ Access is the best know of these; there have been others such as dBase and Foxpro. (Access is sometimes used as a client side system that communicates with a back end or server.) These usually do not play well with external programming languages and usually lack many of the features of what might be called full featured database management systems. The full featured systems are designed to live in a separate address space or on one or more servers separate from the application code and are typically capable of interacting with a variety of programming languages.
Be the database management system relational or based on one of the older data models, database design is an important issue. In those early days, performance, response, and throughput were key issues resulting in designs that did not accurately reflect the true logic of the data. Codd is generally credited with developing the concept of normalization, a mathematical approach to storing the data in accordance with its meaning and logic.
Relational systems are frequently used to provide Data Warehouses or Data Marts. This are usually created by extracting data from production systems, scrubbing or editing it, then transforming it into a standardized form, then loading it into a separate database. These databases are usually considered a coherent and consistent snapshot that can be used for analysis. They are generally large collections of historical data. The criteria for and effectiveness of techniques for scrubbing, standardizing and merging data from disparate sources becomes a big issue.
Since the introduction of relational systems, object-oriented programming has become widespread. Object-oriented code generally deals with objects representing real world entities such as customers and products. Typically, an object maps to multiple tables or record types in the database. People talk about the object/relational impedance mismatch; it becomes necessary to to update many separate pieces when anything is changed; consistency and management of concurrency become major issues. There have been and still are so-called object-oriented database management systems such as Gemstone. In my experience this frequently look like network systems internally, have not been widely adopted, and are used for specialized applications.
As time has passed and as the use of the internet has expanded, systems have changed. In some cases, database sizes, numbers of users,and transaction volumes have grown to a scale unthinkable just a few years ago. Internet based systems have also become highly interactive. Think Facebook, Amazon, and Google. As these truly massive systems have grown, it’s become necessary to develop highly distributed systems running on hundreds and possibly thousands of servers. Databases are now partitioned or sharded and duplicated or replicated. Managing concurrency, consistency, and synchronization then becomes a major challenge. Some simpler approaches include denormalization and cached or in memory databases. Implementation of systems in the cloud(s) also raises new and unique challenges.
Vendors of more traditional relational systems such as Oracle and Postgres have made many performance oriented additions to their systems. There are also many new database management systems, commonly lumped together and labeled NOSQL (often thought of as not only SQL). These are variously identified as Document Stores, Graph Databases, and Key Value Stores or column databases. Some use SQL; some do not. Their names are legion: Akiban, BigTable, Cassandra, Couchbase, FlockDB, Hadoop, MongoDB, Neo4J, Prism, and RIAK – to name some of the better known.
It’s like the days when Chairman Mao asked for a thousand flowers to bloom. I hope the results won’t be as catastrophic and I wonder if these new systems are for everybody. There are surely applications requiring extraordinary measures, but not everyone is like Google, Facebook, and Amazon. Denormalization is a mixed bag that can make maintenance and change more expensive and difficult. So-called SQL systems are relatively easy to program to; they provide well for querying and reporting; and when appropriately scaled, they provide well for consistency and concurrency control. My recommendation is to look before you leap and determine whether or not the features, reliability, and support available with a new product suit your needs.
Slides for this talk are at
© Charlie Wertz, September 2012
I am writing these notes as preparation for a 10 minute talk I expect to give at Buffalo Barcamp #5 on May 30, 2012. (It is unlikely I will say exactly what you see here; it is unlikely I will include every point I have here; it is likely I will add something not included here.)
As the election draws closer, we are likely to hear more and more about the economy, jobs, the trade deficit, taxes, the federal budget deficit, who is to blame for our problems, and how to fix all of our economic troubles with one easy action. I will be more and more tempted to close my ears and mind because most of this will be overly simplistic and very misleading. While we do have some significant economic problems, neither the causes nor the solutions are simple and easy to implement. Complex systems are difficult to diagnose and repair. I don’t intend this to be a political talk; I would like to point out a couple things we should be thinking about.
For some, jobs are hard to find right now – especially well-paying and rewarding jobs. At the same time, prospective employers are complaining qualified employees are hard to find. Why is this? More than one thing is behind it. Many of our difficulties result from two things: the United States is not alone in the world and there is a long term trend to replace human labor with the work of machines. We need to adjust our thinking to deal with these.
The United States is not alone in the world. As recently as one hundred years ago, the oceans separated us from the rest of the world and it was easy to view our economy as a distinct and separate entity. When World War II ended in 1945, The manufacturing capabilities of most, if not all, European and Asian countries had been bombed out of existence. Thanks to those oceans, our plants were intact. Thanks to that war, there was a ready market here and abroad for anything we could produce. There was also a tremendous pent up demand as a result of five or more years of deprivation. The good times rolled for American Corporations and both management and labor seemed to think this would last forever. This really did not catch up with us in a significant manner until the 1970s and 1980s.
There are plenty of different ideas about why we have a significant trade deficit and how it might be rectified. I wanted to mention it, but I will now resist the temptation to go on about it at great length and move on to the main topic for this blog and talk. The United States is still a major manufacturer; we are first or second in the world depending on who does the counting and how they do it. Also, it is likely that even if all of the manufacturing that has gone overseas came back, all the jobs that have disappeared would not.
There is a long term trend here. In the late 1800s and into the 1900s, mechanization and automation replaced a lot of workers in agriculture. After some troubled times, more and more people found work in manufacturing and in services. In the later 1900s and on into the 2000s, mechanization and automation have been eliminating jobs in manufacturing. A lot of other jobs have disappeared as well. I can remember when people were employed as elevator operators and attendants who pumped gas for you, to give a couple homely examples.
The evidence suggests this will continue. In one of myrecent blogs, Can we stop robots from eating our lunch, I noted that recently, Farhod Manjoo authored a series of articles at slate.com based on the theme, Will Robots Steal Your Job. He looks at the possibilities for automated pharmacists, doctors, writers, lawyers, and scientists and his findings are sobering. A recent CNN article, Could a computer write this story, tells us that Chicago-based company Narrative Science has set out to prove that computers can tell stories good enough for a fickle human audience and it has created a program that takes raw data and turns it into a story, a system that’s worked well enough for the company to earn its own byline on Forbes.com. An Inside Higher Ed article, Score One for the Robo-Tutors, tells us that in experiments at six public universities, students assigned randomly to statistics courses that relied heavily on “machine-guided learning” software — with reduced face time with instructors — did just as well, in less time, as their counterparts in traditional, instructor-centric versions of the courses. Take my word for this – I have many more articles and links I could cite here, but I’ll curb the urge to go on and on.
The direct labor content of goods and services is decreasing and the evidence suggests it will continue to decrease. If a market economy would work as perfectly as some economics texts suggest, one of two things would happen; we’d all have to work less to enjoy our present standard of living or we’d all be able to enjoy more goods and services. Both things have partially happened; 60 and 80 hour weeks are gone for most of us and we do have many things our parents didn’t dream of.
But, as has been the case in the past, this transition isn’t happening as smoothly as we’d like to see. Some people are benefiting; others are being left behind. A Bloomberg Businessweek article, The Recovery Squeezes the Middle Class, observes that “There’s no better time to be a talented entrepreneur,” [but,] “There’s no worse time to be a middle-class person without special skills.”
Is there a solution to this? What is it? It may be that things have to work out slowly and somewhat painfully as has been the case in the past. There may be some things we can do.
Probably the most important thing is to understand what is happening. The global economy, the housing bubble, and the fiscal crisis are all things that need our attention. That’s not enough. We also need to understand that the ways goods and services are produced are changing and the ways the fruits of work are distributed may also need to change. In recent times, many people who still have jobs are working harder and harder while many others who are not working at all rely on unemployment insurance or welfare for survival.
It’s likely not possible to simply reduce hours for the employed, move the unemployed into those positions, and adjust wages appropriately, but we do need policies that will encourage this over the long term.
Education clearly seems to be a significant part of the long term solution. This works two ways. More people need to be appropriately educated to fill the so-called knowledge jobs. We could also do a much better job of preparing people for hands on kinds of jobs such as plumbing and drywall. We should also recognize that education can better prepare people to take advantage of any increased leisure. Finally, people need to be better prepared for a lifetime of learning.
While this seems a little bit of a dream, we need to understand that we are not dealing with a simplistic case of productive workers and lazy bums who won’t work. I could go on about this at greater length, but I don’t have time. My goal is to get you thinking about it.
Several of my other blog entries deal with this in one way or another.
- Can we stop robots from eating our lunch?
- manufacturing + regulation = dilemma
- Can Our Manufacturing Really Be Gone?
- Economics: There Are Other Countries Out There
The slides for the presentation are here.
© Charlie Wertz, May 2012