Flat file database

From Citizendium
Revision as of 00:14, 7 June 2012 by imported>Olav Næss (New section: Simple substitutes)
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

A flat file database is a database with only one data table, and its file representation is usually: a plain text file. A hierarchical or relational data structure may contain several data tables, and hence be regarded as comprising several flat file databases.

A database should not be confused with the software to manage the data it contains. The latter is a database management system (DBMS).

Logical structure

This structure is as simple as possible for a database: There is one table - a list of database records, all with the same number of fields. A field has a certain data type, which may or may not be recognizable by inspection of the data. The most common field types (with recognition criteria) are:

  • Integer number (starting with a digit or a sign)
  • Floating-point number (a number with a decimal point or the letter 'e' indicating exponential notation, e.g. 1e6 (=1000000))
  • Text (delimited like 'this' or "this")
  • Date (digits with special delimiters like '/', ':' or multiple '.')

All the records in a table must have the same field type sequence, and each record must be identifiable, normally by an identifier like a number or a text field (a name) in the beginning of the record.

In relational database terminology, a flat file database is called a relation. A relational database consists of one or more relations. The theory for relational databases demands databases to conform to certain normal forms. The first normal form requires rectangular relations: All records must have the same length. It also requires unique (non-colliding) identifiers for the records. Database management software often permit a sequence of values within a field, where the records may have different numbers of values. A relational database system must handle this by creating a new relation (table) having separate records for these values, but then the database ceases to be a flat file database.

File structure

A flat file database is commonly implemented as a simple text file, where each record occupies one line of text. Such a file is commonly referred to as comma-delimited, tab-delimited etc., depending upon which character is used as field separator. Some common field separators:

  • a comma ,
  • a colon :
  • a tab character (for jumping to a preset tabulator position)
  • a sequence of one or more spaces

In the first three cases, missing data may be represented as really missing, causing consecutive delimiters to indicate this fact without confusion. In the last case, missing data must be represented by a unique value or character string. Such a representation is also required if tab characters are translated into spaces. This is commonly done, but may be done in two different manners:

  • a tab is translated into a certain number of spaces
  • a tab causes a jump (with insertion of spaces) to the next tabulator position

The second alternative is the sofisticated word processor treatment, and it is likely to cause confusion when no spaces need to be inserted.

Database creation

The flat (text-based) databases are very convenient to use, because so many different kinds of programs can deal with such text files. The file can be written by:

  • a simple text editor, provided the user inserts the correct field separator characters
  • a spreadsheet program, or a word processor which can handle tables, provided a text file export command is available
  • a program written in virtually any programming language, whether or not the program really stored the data table as a relevant data structure (array, matrix, vector of records, complex object...).

These simple tools can be used for tables with several thousand rows, corresponding to e.g. items collected or club memberships, as long as only one person needs to work with the file at a time. A flat file can easily be converted back and forth - to be formatted for any of the abovementioned simple tools, or for database software.

The text file is typically a simple sequential text file. This means it will not allow modifications inside the file without a complete rewrite, and new data may only be added by appending at de end. Although the file is intended for sequential access, direct access for reading will often be possible in certain cases:

  • All records have the same length, so it is easy to calculate the start position of a record with a given sequence number.
  • The data file has an associated index file, containing a table with pointers to the start positions of the records.

History

Handwritten tables on paper sheets may be regarded as the first flat file databases, but the first machine readable ones were stacks of punched cards, introduced for the 1890 US Census. Such a card has 80 hole columns for an 80 character text line. This was suitable for describing a US citizen, and processing the database implied putting stacks of cards into machines which sorted and counted cards on the basis of the values in specified columns.

In the sixties, text processing was done by punching each text line on a card, and 80 column text was for a while the standard for text processing, as well as for the text tables which had now become real flat file databases for computer analysis. When data were stored on disks, it was common to store them as 80 column card images (text lines), and the computer file database was born. In the eighties and nineties, flat file database programs were popular among computer users. But in the 21st century, computer users are likely to have (and to know) spreadsheet programs which are well suited for most flat file database operations. The main advantage of the "real" database system is the ability for interlinking several tables into a complex model, as well as to display and analyze such interlinked structures.

Observation matrices

Research data commonly constitute an observation matrix, in which each row is an observation vector, containing a standardized set of measurements for each case/subject to be investigated. The table format is now appropriate for database technology, but database technology will usually be of little interest in research because:

  • Data are collected during a data collection process during which little access to stored data is needed.
  • After the collection process, few file modifications are needed, and there is little time pressure (counting milliseconds), so a file rewrite is OK.
  • Data analysis is normally done by sweeping through the whole data file.

The data are often unsuitable for database analysis, even though the table shape seems right. This can be explained for two very different types of research data.

Time series analysis

This is commonly done in geophysics, where a few variables are measured with fixed time intervals through perhaps a long period. The little observation vector has no identifier which could be used as a database access key, as the position in the series is regarded as a valid identifier. This disqualifies this file from standard database processing.

Questionnaire analysis

When questionnaire responses are gathered, very many variables are measured. In psychological research, when subjects may answer questionnaires repeatedly, and the questionnaire response variables may be duplicated in transformed versions, there may be hundreds of variables. In such investigations, a separate flat file database will be needed for describing the variables, and a record describing a variable such as a questionnaire question, must be long enough to contain the question text, and also describe transformations done on the original variable. But when this list of texts has been prepared, data files conforming to this setup can be collected repeatedly as simple flat files.

A database system assuming a programmer-configured set of variables with simple names, will be unsuitable for dealing with a vector of long names. The standard structure of a research data record for case data (not time series data) is: A text field identifying the case, followed by a (floating point) observation vector. Some of the variables may be only on a nominal scale (i.e. indicate one out of several unrankable alternatives), but still have to be numerically coded in order to conform to the numerical standard for observation vector values. Missing data are coded as impossibly small or large values.

Simple substitutes

A flat file database may be stored as a spreadsheet, or even as a plain text file. This may be practical for personal use, as familiar text search and edit operations may now be used in familiar spreadsheet and text editing programs. But with several thousand records, the time needed for sequential search in these programs will become objectionable. The index mechanism of a database will make the search much faster - even if it simply points into a large sequential file, or even a text file. And any changes in the database will entail a cumbersome rewriting of this large file.

But the use of such simple substitutes will really collapse when the database is to be used by several users simultaneously. Simultaneous updates made by two or more users necessitate the complex logic built into the database software.

When the flat file structure breaks down

The previous paragraphs described a need for a supplementary database table. The flat file structure starts getting weak now, but the serious problems come when data values no longer indicate an alternative, but try to describe properties of the alternative. An example: A list of employees could have a code indicating which department they belong to, but should not try to put the name of the department head in an employee record. If this is done, a change of department head would necessitate extensive changes in many employee records, in stead of a single change in a separate department table.

Related Topics

  • MySQL, a Relational Database Management System (RDBMS)