The use of text format files to store and transmit data for use in a spreadsheet involves three somewhat complex decisions which determine how the file expresses and separates each data value. These complexities must be understood for a user to be able to use the Text Import druid effectively. These complexities exist because of the limitations of early computers and because or the historical development of computer systems by different manufacturers and programmers, in different countries, targeting different types of users, speaking different languages.
The first complexity involves the different systems which relate the contents of a computer file to the characters in a written language. All text files on a computer consist of a long sequence of binary digits. Text files are files in which these digits are used to indicate different textual characters. Character 'encodings' are standardized systems which relate the binary digits in a computer file to a formal system of characters which includes both text glyphs (shapes) and formatting indicators. Each encoding defines a way to interpret the binary digits and uses the characters from a particular character set. The alternative character encoding strategies are explained in greater detail in Section 14.4.1.1 ― Character Encodings, below.
The second complexity involves the decision of how to separate the characters in a file into different lines. Text files explicitly determine the end of each line of a file with a specific character or sequence of characters. The complexity involves the particular character sequence used to determine the end of each line. Different conventions have been used in different computer systems. The alternative line breaking strategies are explained in greater detail in Section 14.4.1.2 ― Line break delimiters, below.
The third complexity involves the decision of how to separate the characters in each line into separate value fields. Again, different strategies exist. These can be separated into two broad categories: strategies which use a character or sequence of characters to separate the values, so called 'delimited' or 'separated' strategies, and strategies which use the position of the character in the line to separate the values, so called 'fixed-width' strategies. The alternative data structuring strategies are explained in greater detail in Section 14.4.1.3 ― Data Structuring Strategies, below.
Fortunately, the Gnumeric Text Import druid provides users with a way to preview the information in a text file. This enables users to change the settings which determine each of these three conventions until the text in the preview correctly shows the contents of the data file. Therefore, while the details of these three steps are complex, the practical impact on users is minimal. Users can simply experiment until the file appears correct without having to understand each of these complexities in detail.
The use of text files to store data in a structured fashion for use by spreadsheet programs, and more generally all text files, require some scheme to relate the binary number in the computer file itself to the characters of a written language. Such schemes are called 'encodings'.
The origin of computers led to the invention of a number of different encoding schemes. Due to the limitation of early computer hardware, these encoding schemes all restricted themselves to character sets which contained only the most essential characters of the English language. The desire to support characters which were not in this basic set of characters led to the creation of new encoding schemes, many of which restricted themselves to the characters in specific languages. One encoding scheme, called UTF-8, has now emerged as the best encoding scheme for the future for a multitude of reasons including its ability to co-exist with current operating systems and its ability to encode all of the characters in the largest set of characters which has been consistently defined, the Universal Character Set. However, the existence of the diversity of encoding schemes means that for the foreseeable future, files will be created and distributed using several different schemes. This is especially true for files containing text in languages other than English.
This complex situation generally does not impact users. Gnumeric has been designed to deal with most of the complexity. Many kinds of flies, such as the Gnumeric file format itself, describe their encoding scheme internally in such a way that it can be easily recognized. Gnumeric also provides an easy approach to changing the encoding scheme in case this proves necessary.
Encoding schemes merely prove a hindrance to users when opening files. There is no danger that data be lost or that any other serious problem arise by selecting the wrong scheme. If the wrong scheme is selected, either the file will contain characters which are nonsensical and Gnumeric will open an error dialog asking the user to select a different encoding scheme, or the preview area will display nonsensical characters. These nonsensical characters may simply be characters grouped together which do not occur in any language, such as "åÕÛÛÞ", or may be characters for which a graphical representation (a glyph) does not exist in the font being used and is therefore displayed using a small box with four numbers inside. Each of these errors indicates that the encoding scheme used to read the file was not the same encoding scheme as was used to create the file. The difficulty is then to determine what encoding scheme to use. A simple process of trial and error should lead to picking the right scheme.
A basic strategy to find the right encoding for a file being imported into Gnumeric is, first, to use the scheme proposed by Gnumeric and, then, to hunt for the correct encoding. The default encoding scheme is the one defined by the locale setting of the user and this is also the default scheme Gnumeric uses to create text files. If the default encoding is incorrect, the correct encoding must be found by trial and error. One strategy to use is to examine the major western encodings and then the major regional encodings. The major western encoding schemes are ASCII, ISO-8859-1, and UTF-8, but ASCII is a subset of the other two so it does not need to be tried on its own. The major regional encodings are the IS0-8859-x schemes since these have become quite popular in GNU operating systems. Alternatively, the various character sets used by the Microsoft operating systems can be attempted. The encoding schemes are listed under "Western", "Unicode", and the alphabet names.
The World Wide Web has many resources dedicated to explaining encoding systems and other related information. One of the best sites discussing UTF-8 and Unicode is the UTF-8 and Unicode FAQ for UNIX/Linux page maintained by Markus Kuhn. The Unicode project has a web site which includes an online copy of their standard character set. A discussion of the ISO-8859 family of encodings can be found at a page titled: "The ISO-8859 Alphabet Soup", which may alternatively be found here. A similar discussion on Wikipedia, focusing on the western alphabets, can be found here.
The use of text files to store data in a structured fashion for use by spreadsheet programs requires a scheme to separate each line of the file. Structured text files rely on the files having explicitly defined rows within the file as one component in the structuring system. Each of these rows is defined by a character sequence indicating the end of a row.
Two characters that are part of the ASCII code, an early encoding that became a widely followed standard, were included to help define the end of the line. These are the 'linefeed' character and the 'carriage return' character, named after the two processes which occur when a typewriter starts a new line: first the typewriter barrel rolls - the linefeed - then the whole carriage with the sheet of paper moves back to the starting point -the carriage return. In the same way that different computing systems have used different encoding schemes, three different approaches became common for defining the end of the line.
In GNU operating systems and other systems that inherit from the UNIX legacy, the end of a line was defined simply using the 'linefeed' character. The pre-OSX Macintosh operating system chose instead to use only the 'carriage return' character. The Windows operating system uses both characters in the sequence 'carriage return' then 'linefeed'.
A user opening a file into Gnumeric will see, in the preview area of the Text Import druid, whether or not the line breaks have been recognized correctly and will be able to alter the recognition settings. An incompatible setup will either yield a single unbroken line of text, lines of text with extra, empty rows between them, or lines of text with extra symbols at the start or end of each line.
The correct line break delimiters can be established by checking or unchecking the alternatives. The preview area will then show the result of the file interpreted with these settings.
The use of text files to store data in a structured fashion for use by spreadsheet programs also requires some scheme to separate each value within every line. Two different approaches are used to separate these values. The first strategy, uses a particular character or character sequence to denote the start and end of each value. Such strategies are called 'Separated Value' or 'Delimited Value' systems. The second strategy places each value stating at a specified position in the line. Such strategies are called 'Fixed Width' strategies because they inherently require that each value have a pre-determined size.
Separated Value structuring systems distinguish the contents of each value using pre-determined characters to separate the values. Certain characters have become common in such schemes, for-example 'Comma Separated Value' files use a comma character to separate values while 'Tab Separated Value' files use a tab character. Gnumeric allows the user to define the value separator to be any one of several common characters or a specific sequence of characters, either on their own or in combination. For example, a file could use both space characters and tab characters to separate values. Similarly, a file could be read which used the entire word 'STOP' to separate values like the common scheme to separate sentences in a telegram.
Separate Value structuring systems often also include a method to surround a single text value which may itself contain the character used to separate values. The quote character is often used in this role but Gnumeric allows users to configure any character in this role. For example, a file which used the comma to separate values could nonetheless contain a value like "Zoe, Sally, Dodji" if this value had appropriate text indicating characters at either end.
Fixed Width structuring systems are common formats for the output of database tables since the contents of these tables have often been defined as variables of a particular size. To import these files, users must specify exactly the start of each column so that the importer can separate the values on each row.