030416

Data models

by J. W. Rider
http://www.jwrider.com

Data models vary in both complexity and richness. However, all data models are equivalent as far as their ability to model information is concerned. What is more important as far as selecting a model is concerned is matching the inherent structure of the problem being modeled. This structure varies as the problem is investigated and refined. In the beginning, when little is known about what the final model will be, the simplest, most flexible and least structured scheme provides the greatest freedom of expression. As time passes, it becomes more important to fashion the model using a scheme that closely matches the final implementation.

It's important to remember that data models are used for both conceptual and implementation purposes. Emphasizing one over the other may distort how one perceives the way models fit together. No one size fits all situations. Each model has strengths and weaknesses.

Typical data models

Flat
Hierarchical
Network (CODASYL)
Relational
Entity-Relationship
Object-Oriented
Multi-dimensional

Text file

Althought it's not a typical approach, this stems from the premise that all information can be stored in written format. As data storage is concerned, the text file is unsurpassed in the quality and variety of information that may be stored. A text editor can be used to access/update information. There are no restrictions on what kinds of data may be stored or how long the individual data items may be.

No one usually thinks of a text file as a data base. However, the scheme shows the basic notion of storing information in files, and text files capture the notion of holding the information as an integral whole. Other schemes look at information in other ways.


Flat file

If text files capture information as the whole loaf of bread, then flat files capture information as slices of the loaf. Loosely, flat files view information as an ordered collection of "records." Well-defined subsets of each record are known loosely as "fields." However, there might only be a single field for each record.

Flat file fields may contain data of any data type. There is no requirement for fields to be named.

While record and fields may be fixed in size, there is no requirements for records or fields to be all the same size. Variable-length records and fields are usually implemented with some kind of separator between items. Some simple implementations of flat files are stored in text files with each record constituting a single line of text separated by end of line characters (linefeed characters or return-linefeed character pairs).

No requirements for records to be all the same type (that is, to contain the same kinds of fields in the same order). However, mixing record types with a single file is usually done to implement some other kind of data model.

The order of the records may be important within a flat file. The record "key" (a way to distinguish individual records from others in a collection) is usually the index of the record into the file. Duplicate records (records that contain the same content) are acceptable.

Flat files may be implemented in either text or binary form. There is no requirement to make flat files text-editor readable.

Examples of flat files include shopping lists, configuration files and single-page spreadsheets.


Hierarchical

In flat files, records are deemed to exist at the same level. Hierarchical data files permits records to be grouped together. This allows a superior-subordinate or parent-child relationship (a single one-to-many relationship) to be defined between records. In simple forms, the superior or parent records are used to collect information that is common to all the subordinate/child records of the same group. This has an immediate effect of reducing redundancy with the data base.

Record "key" usually includes the "key" of all superiors/parents. Keys tend to recognize the order of a particular record within a group rather than within the data file. Duplicate records are acceptable.

Examples: topical outlines, organizational telephone directories, Microsoft Windows INI files, Microsoft Windows registry, automated report formats. Hierarchical data bases have a natural alignment with report formats, even if the information is being extracted for data bases implementing other models.

Continuing with the staff of life analogy, while flat files look at information in slices, hierarchical data bases look at information as a ordered collection of different kinds of slices, like a club sandwich.


Network (CODASYL)

While hierarchical data bases emphasize the use of a single "path" to access all records, network data bases may provide multiple paths to locate individual records and sets of records. The term "network" has little to do with communications between computers. Instead, the "network" refers to the ways in which records may reference other records. The term "CODASYL" (from the abbreviated name of the group that formalized the use of such data bases) is often used, but it does not connote the multi-relationship nature and is usually inaccurate.

Records and fields tend to be fixed in size, but are storable at arbitrary locations within the data file (or files). The record "key" represents a physical location (e.g., file, block, and offset) in the data-base file structure.

Network data bases allows records to participate in multiple relationships or sets. Performance tends to be extraordinarily efficient. A well-designed network data base would permit application programs and extemporaneous queries simply to follow key links between records, without having to look individual records up through a separate index or directory.

Because of the efficiency, this model tends to be followed in DBMS implementations (in constrast with DB implementations) for disk storage. However, the network-like structure usually is hidden, quite deliberately, from DBMS users.

If hierarchical data bases look at information like an assembly-line sandwich, network data bases look at information as a food service table in a delicatessen where sandwiches may be ordered custom-made and substitutions are welcomed. There's more than one way to make a sandwich.


Relational

Before the relational data model, existing data models had no particular good way to separate the conceptual designs from implementations. Pre-relational models depended upon being able to determine explicitly where and how individual records were stored. Early relational proponents argued that the relational data model viewed information logically rather than physically, but this is not quite correct. Earlier data models associated the logical and physical aspects of information together; logically-related information was stored in physical proximity within a data file. The relational data model first separated the logical from the physical aspects.

The relational data model looks at information as an unordered collection of "relations." Each relation is populated with unordered "tuples" of the same unordered "field" structure. Fields may only contain values of a well-defined ("atomic") domain or the null value. The unordered aspect needs to be emphasized. For expository purposes, relations are often viewed as "tables". The tuples constitute the "rows" of the table; values for a specific field constitute "columns". However, the "table data model" tends to impose a very non-relational ordering on both tuples and fields. Relations are an abstraction of how data is stored; tables are just one of many possible implementations.

Some of the relational terms are crafted to emphasize the distinction between logical and physical features, to avoid confusing one concept with another. However, vocabulary leakage from other disciplines has sprinkled into the conversation of relational proponents. There is a strong tendency to refer to an individual tuple/row as a "record" because collections of fields in other models are called records. "Attribute" is often used synonomously with field.

To be sure, "unordered" implies neither "chaotic" nor "random". Relations and Fields are named uniquely and identified easily. Distinguishing between tuples is more subtle since the order is not pre-defined. Rather than depending upon relative (as in hierarchy) or absolute (as in network) locations, tuples may only be differentiated according to their contents.

Consequently, duplicate tuples are not permitted within a single relation. Even more strongly, distinct tuples must have a unique "key" (some combination of a relation's named fields). The set of minimal keys includes one "primary key"; the rest are "candidate keys". Within a tuple, references to other tuples are expressed as a "foreign key," which should contain the values of the referenced tuple's primary key.

Relational theory provides a firm mathematical foundation for data management. Set theory could be applied to relations using relational algebraic operations (union, intersection, join, projection, etc.). Assertions about the existence or non-existence of some condition with a data base could be proven with a rigor unachievable with earlier models.

Implementation of vendor-specific RDBMS has created some confusion about what features are required in a relational model. Most vendor RDBMS take a decidedly "table-oriented" view that is not strictly relational.

Among the delicatessens of data management, relational data models represent an open buffet with trays of breads, meats and cheeses ready for customers to make their own sandwich.


Entity-Relationship (E-R)

The abstractness of the relational data model was an essential part toward eliminating the reliance of data models upon machine implementations. However, the abstractness also obscured how tuples in one relation were associated with tuples in other relations. In some ways, the entity-relationship model (or ERM, as long as you don't confuse it with enterprise resource management) can be viewed as an extension of the relational model where the associations between relations is made explicit. Relational purists have suggested that ERM is totally unnecessary. Just the same, the entity-relationship model exists independently of the relational model and should be judged on its own merits.

The ERM approach presumes that all information can be stored in entities and relationships between entities. A "entity" is similar to a relation (of the relational data model), except that any references to other entities is removed. This would include all foreign keys definitely and may include other association information as well. What constitutes a tuple in the relational model is called an "entity instance" by ER purists. In practice, both schematic entities and entity instances are considered "entities."

It's in the area of relationships where the ERM approach takes a decidedly different twist from the relational. For one thing, it's important to distinguish between the relational "relation" (a collection of tuples with the same structure) and the ERM "relationship" (an explicit association between two entities). The closest parallel to an ERM relationship in the relational model is the notion of foreign keys. In ERM, relationships are much richer. For instance, relationship are characterized by cardinality, whether a relationship is one-to-one, one-to-many, or many-to-many.

In practice, many ERMs are implemented as relational models within an RDBMS. This has been a remarkably successful approach, but has been known to the blur the distinction between ERM and Relational models.

If the relational model is like a buffet of different sandwich ingredients, ERM is like a picture of a sandwich on the wall over the buffet that reminds users how the sandwich is supposed to look.


Object-Oriented (O-O)

It should not be surprising whatsoever by now that the OO approach to data modeling is that all information can be stored in objects. The problem arises in trying to get OO proponents to agree upon exactly what an "object" is supposed to be. Objects tend to be defined at a very general level. There are many things that an object could be; many features that may be implemented. There are relatively few that an object is required to be.

For instance, in some ways, an object looks a lot like an "entity instance" from the ERM or a "tuple" from the relational model. Objects may also include some kind of behavior that manipulates its fields or attributes. On the other hand, an object need not have any such behavior, which makes entities and tuples perfectly good examples of objects. Unfortunately, this provides a distorted notion of what objects are to practitioners who have mastered the ERM or relational model. It would quickly raise questions about the value of OO approaches if it did not seem to provide anything different from what could be done with existing models.

Consequently, the principles for the OO data model are not as well-established as for other models. Even the vocabulary is sometimes at odds. There are at least two distinct viewpoints for OO that have emerged: an analysis view that defines a class as the common intersections of features shared by distinct objects, and the development view that defines a class as a blueprint for instantiating objects with common features. To confuse the issue further, another definition of class means the collection of all objects either instantiated from the same blueprint (development view) or that happen to possess the same shared properties (analysis view). Implementation of OO programming languages (where OO has been remarkably successful) has created some confusion about what features are needed in an OO data model. Programming languages are definitely in the development camp.

OO masters deem something to be an object because it is useful to consider that thing as if it were an object. Objects include state (attributes or fields) and dynamic behavior (methods). Object behavior is triggered by receiving a "message" or "event". These triggers may originate externally or within the object. In regards to data models, some value has been shown to use objects as shown in the following table:

Object may be...
...a value in a fieldBinary large objects (blobs). OO programmers consider blobs to be "snapshots" of some object instantiated in a computer's memory. OO data modellers consider the blob to be the object and a clone of that object may be constructed transiently in memory to implement behavior.

Some RDBMS vendors implement this alone and then advertise that they support OO. This is technically correct, but it's about as useful as equating tuples with objects. Prospective RDBMS users should look closely at the features to determine whether the RDBMS OO means the same to them.
...a field type (or column)One of the hallmarks of OO is the ability to create user-defined types. Most RDBMS permit only a limited number of well-defined types for fields (TEXT, NUMERIC, VARCHAR, DATE, CURRENCY, etc.).
...a tuple or rowObjects may include complex states using variables filled with simple values. Objects without significant behavior defined are little more than wrappers around some record of information. The behavior usually enforces consistency constraints when the state is changed.
...a relation or table or viewAn object may incorporate information contained within a whole relation (base or derived)
...a whole data baseAn object may incorporate information about all relations contained within a data base. The notion that an object is like a miniature data base goes a long way to characterize the complexity that objects are able to encapsulate.

The OO approach relies upon a number of features that have become recognized at defining how OO is different from other approaches.

OO features
encapsulationObjects form natural collections of state variables and the behavior needed to manipulate such variables on an object-wise basis.
inheritanceDifferent objects can be defined in a way that allows common parts to be shared rather than duplicated.
persistenceOnce an object is constructed, the object exists until it is no longer required. This feature is not as important in programming languages where objects may be either persistent or transient (the same as other types of variables).
polymorphismDifferent objects may exhibit different reactions to the same message invocation. This permits objects that are known to respond to the same set of messages to be implemented in different ways.

OO practitioners have identified a few fields that are valuable but not always essential in data modeling efforts. One is a reference to the object's class (but there is disagreement about what a class is). Another is an object's "key," system-wide unique identifier. The key is a little controversial; relational theory abhors system-generated values, but objects may be so large in scope that it's not reasonable to address their contents all at once.

Examples include composite word-processing documents with embedded graphics. The OO approach always tends to make situations that involved complicated sequences of data types more manageable.

The flat file model took a loaf of bread and sliced it up in order to make the bread useful for sandwiches. The OO approach would be to start with bread-like objects (e.g., biscuits, rolls, crackers, buns, bagels, muffins).


Multi-dimensional

The multi-dimensional approach starts with the premise that all information can be stored in multidimensional spreadsheet-like structures. This is no surprise; it's been possible to store information into a two-dimensional spreadsheets ever since flat files were first introduced. The real story comes when we start trying to recognize that storing information in multi-dimensional spreadsheet-like structures "makes sense." That is, there is some value in storing information in that format. It allows the data modeler to manipulate the information in a useful way that could not be accomplished as easily with other models.

The multi-dimensional approach may be considered an extension of the relational model where denormalization has been carried to an extreme. In addition to the relational data algebraic operations, operations for "slicing," "pivoting" and other transforms are included. This make the multi-dimensional approach useful for trend-line analyses and statistical correlations within data warehouses.

In comparison with ERM and network models, the multi-dimensional approach is much more flexible about the way the user may want to relate information. The flexibility costs, however. Generally, ERM will out-perform multi-dimensional because defined pre-relationships can be optimized for access. The multi-dimensional approach may not be what you want to use for routine operations on your data, but becomes more important when you are trying mine information for those elusive and lucrative nuggets of gold.

Many multi-dimensional data models are implemented as relational models. Others are usually implemented in some proprietary network data model. Neither implementation hampers the use of the multi-dimensional model for conceptual purposes.

Examples include decision support software for online analytical processing (OLAP).

Most other data models are going to look at a fruitcake as a sweetbread with fruit and nut mixture. The multi-dimensional approach says that sometimes that makes sense, but sometimes you want to be able to look at fruitcake as a piece of fruit surrounded by sweetbread.


Future

There is no reason to expect that all the important data models that there ever will be have already been engineered. The evolution of data models has shown remarkable ingenuity on the part of data modelers to apply foreign disciplines to their craft. When one problem is solved, modelers compete to optimize some other aspect. It's not a case of trying to store new kinds of informations; being able to represent all kinds of information has been fundamental to data models since it was shown that data could be stored in files. The battle is on to match data model structure with domain-related problem structure.

For example, the object-oriented paradigm encapsulates both data and behavior in a way that obscures the difference between the two. For programmers, this is perfect. It means that you don't have to know how an object is implemented in order to send an object messages. However, obscuring information and processes is not always well suited to many business management practices, where managers prefer seeing the process made quite visible. It's plausible that in the not-so-distant future, a data modelling scheme derived from object-oriented principles that models processes separate from information could become an important aspect of strategic planning.

Of course, we have no idea what it might be called. Someone will always be able to create a new kind of sandwich.


Copyright 2003, J. W. Rider
Webcrafted by J. W. Rider
hitcount
JWR 030416