PHP Classes

Using the PHP RTF Parser to Process Word Processing Documents Part 1: the RTF File Format - PHP RTF Tools package blog

Recommend this page to a friend!
  All package blogs All package blogs   PHP RTF Tools PHP RTF Tools   Blog PHP RTF Tools package blog   RSS 1.0 feed RSS 2.0 feed   Blog Using the PHP RTF Par...  
  Post a comment Post a comment   See comments See comments (0)   Trackbacks (0)  

Author:

Viewers: 827

Last month viewers: 38

Package: PHP RTF Tools

RTF is a portable file format for representing the content of word processing document, such as those generate by Microsoft Word, OpenOffice and others.

Read this article to understand the RTF file format, so you can understand the next articles on which you will be able to learn how to use the PHP RTF Tools packages for good purposes, such as using RTF templates, merging multiple documents, or simply extracting text from a document.




Loaded Article

Introduction

This article is the first of a series that will cover the RTF file format, and explain how you can use the PHP RTF Tools package classes to perform several types of useful document processing in PHP.

A Bit of History

The Rich Text Format (RTF) is a Microsoft proprietary file format, whose specifications were first published in 1987. It was originally intended to facilitate document interchange between different Microsoft products over different platforms, but gradually gained some form of popularity among software editors.

The reason was simple: the Microsoft Word .DOC binary format specifications remained unpublished until 1997, when they became temporarily available under certain conditions.

This lack of transparency regarding the .DOC file format led software editors to see RTF as an alternative for supporting documents created with Microsoft Word and thus, to provide end users with better interoperability. The RTF format evolved until its latest version 1.9.1 in March 2008.

Why caring about the RTF file format ?

And why caring about programming in Cobol, Fortran or RPG ? Simply because there is an installed base of companies that still have to deal with it. This is part of their history and, as a professional, you may have to deal with such environments.

Of course, chosing RTF as the preferred document interchange format across different platforms and/or systems would not be my preferred choice, unless I have to face with strong technical constraints coming from the IT environment, or from a strong company history, or both.

On the other hand, the RTF file format is really easy to parse and is human-readable (although in this particular case, "human-readability" may sometimes become a highly subjective topic).
From a syntactic point of view at least, it is easy to read and it is easy to generate, so sometimes, it may be a good compromise.

Moreover, parsing RTF documents does not require you to use a complex framework of C/C++ (put your favorite language(s) here) objects or libraries, as it is sometimes the case for PDF files. Creating an RTF parser requires only a few dozen lines of code.

What You Will Find Here

This series of articles focuses on the RTF document structure.

They will not explain how to create RTF contents such as headers, footers, tables and embedded images,  but rather they will focus on what you need to know if one day you have to read RTF documents and try to extract some useful information from them.

They will also explain how the PHP RTF Tools package classes can be used to address some of your needs when it comes to processing RTF files.

This first article will introduce you to the Rtf file format, providing you with the basic information you will need in order to understand an Rtf document when viewing it in a text editor, or when processing it through a script.

The RTF File Format by Example

Maybe the simplest way to introduce the RTF file format is to create the simplest possible document using the simplest word processing software, then have a look at the generated RTF data.

We have chosen the Microsoft Wordpad application for that. This small application, bundled with all versions of the Microsoft Windows operating system, has the great advantage of generating very simple RTF output. Of course, it lacks many of the most common features that modern word processor programs offer, but it is ideal for our purpose.

The screenshot below shows a Wordpad document containing some very simple text, without any formatting :

For those of you who are familiar with the Windows API, the Wordpad application uses the RichTextEdit control for all its operations.

We used a few common characters, such as angle quotes, curly braces and backslashes. We will see later that such so common characters require some interpretation when translated into RTF format.

First, let's have a look at the RTF file generated by Wordpad after saving the document and reopening it using some text editor such as Notepad or Notepad++, we will get the following output. The text coming from the original document above has been highlighted in red:

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1036
{\fonttbl
{\f0\fnil\fcharset0 Arial;}
{\f1\fnil\fcharset0 Calibri;}
}
{\*\generator Riched20 10.0.10586}
\viewkind4\uc1\pard\fs26\'ab\f1\fs22\lang12 Hello world\f0\fs26\lang1036\'bb\par
\f1\fs22\lang12 Special characters : \{ \} \\ \par
}	

At first glance we can notice a few things :

  • RTF data can be enclosed within curly braces, in fact, the whole document itself is enclosed within curly braces. These are known as groups in the RTF specification, and can be indefinitely nested.
  • Lots of data start with a backslash, followed by a keyword (\rtf1, \ansicpg1252, etc.). All those items are referred to as Control words in the RTF specification. They generally provide document formatting information, such as \pard, which resets paragrah formatting parameters to their default values, or \ansicpg1252, which defines the code page to be used throughout the document.
    They are also referred as tags in the PHP RTF Tools documentation.
  • Our original text still appears here, at least partly. The angle brackets, for example, have been replaced with the special constructs \'ab and \'bb, which give the hexadecimal equivalent of the angle brackets in the current codepage.
    Such constructs are referred to as Control symbols in the RTF specification, and as escaped character symbols in the PHP RTF Tools documentation.
  • And our special characters (curly braces and backslashes) are themselves prefixed with a backslash. They are escaped because they have a special meaning as RTF syntactic elements.
    Such elements are still referred as Control symbols in the RTF specification, but the PHP RTF Tools package makes a further distinction and calls them Escaped expressions.

There are other notions that cannot be simply deduced from the above data: for example, a RTF document has a header part. Line breaks are completely optional, and other specific that this series of articles will explain in greater detail.

Overview of the RTF File Format: Elements of Syntax

RTF documents usually contain data encoded in 7-bits ASCII, consisting of groups, control words, control symbols and plain text. Line breaks (CRLFs) can be present in the document, but they have no other purpose than providing a better readability of the RTF raw data: they will never be included in the document text.

The following sections describe the various components that can be found in an RTF document ; they provide the official terms used in the Microsoft Rtf Specification, as well as their equivalent in the PHP RTF Tools package, where more detailed distinctions have been made for better clarity.

Control Words

A control word can be regarded as an instruction that affects the way characters are displayed, or modifies the settings of a page, section or paragraph.

Control words can also define elements that can be later referred to from inside the document contents. They include for example footnotes, which are not displayed at the place they are defined in the RTF document, but are rather generally referenced from within the document contents.

Such control words are called Destination control words in the RTF specification. The PHP RTF Tools package simply calls both forms as Control words.

A control word has the following syntax :

  • It always starts with a backslash
  • It is followed by the name of the control word itself, which is a set of alphabetic characters. Note that names are case-sensitive.
  • It can be followed by an optional integer value, which can be negative.

Examples :

  • \pard : the control word pard, which resets a paragraph to its default values.
  • \ansicpg1252 : the control word ansicpg, followed by the integer parameter 1252 (defines the codepage to be used throughout the document, unless otherwise stated).
  • \margl-200 : the control word margl, which defines the width of the left margin to be -200 twips.

A control word ends when a character that cannot be part of the control word itself has been encountered. Such characters are :

  • A backslash
  • An opening or closing brace
  • A space or a line break

If the control word is followed by a space, then the space is considered to be part of the control word itself, not part of the document text contents. This may come from an effort to improve the readability of Rtf contents !

Note however that if a control word is followed by two or more spaces, then:

  • The first space will be part of the control word
  • The second and subsequent spaces will be part of the document contents

Control words can also be prefixed by the control symbol \*, such as in : \*\background. The RTF specification state that it is used for destination control words.

The basic purpose of such a construct is to tell an RTF processor that it should ignore the control word if it does not recognize it (furthermore, the control word will not be included in the output if the RTF processor is capable of writing back RTF documents).

The PHP RTF Tools documentation refers to both Control words and Destination control words using the same term: Control words.

Groups

Groups start with with an opening brace ({) and end with a closing brace ((}). Inside a group, any paragraph or character-formatting properties can be specified, along with the document text they apply to. Grouping is also used for destination control words, such as fonts, styles, footnotes, headers and footers, etc.

Groups can be nested. When applied to text formatting, you can think of them as a way to push the current section, paragraph, character formatting options onto a stack before temporarily modifying some local settings. The opening brace of a group will push such settings ; the contents of the nested group will define some specific settings, such as font weight, font size, text color. Then the closing brace will restore the settings that have been pushed when encountering the opening brace.

The following example will output the string "Hello " in bold, "gentle " as normal text, and "World !" in bold again (the \pard control word inside the nested group resets paragraph settings to their defaults) :

{\b Hello {\pard gentle } world !}
	

The next article will describe the meaning of each space in the above RTF contents. The ones being part of the control words, and the ones being part of the document text.

Control Symbols

Control symbols, like control words, start with a backslash character, which is followed by a non-alphanumeric character. Unlike Control words, Control symbols are never followed by an optional space. If a space is present after a control symbol, it will be considered to be part of the document contents.

Although the Microsoft RTF Specification do not make any distinction between the various kinds of control symbols, the PHP RTF Tools package divides them into three categories, which are described below.

Escaped Expression or Escaped Symbol

The basic syntactic elements of an Rtf file consist in only three characters : opening brace ({), closing brace (}) and backslash (\). With only these three characters, you should be able to parse any Rtf document (in the case of the backslash, of course, you will need some additional effort to parse what follows - a control word or a control symbol).

But what happens if you are using such characters in the document's contents ? the answer is simple : they need to be escaped. This process of escaping is handled automatically by your Rtf document processor. Remember our example and the comments following in the RTF by Example section.

Normally, you should find only the following escaped symbols in an Rtf document :

  • \{
  • \}
  • \\

However, the PHP RTF Tools package will correctly handle escaped symbols where the character following the backslash is neither a quote (escaped characters) nor a control symbol character (see below).

Escaped Character

An escaped character is presented with a backslash followed by an apostrophe (ASCII character 0x27) and two hexadecimal digits. It allows for specifying an 8-bits character using the hexadecimal notation, such as in following example, which will map to the Euro character (€) in certain codepages:

\'80

Control Symbol

A Control symbol (as the PHP RTF Tools package recognizes it) is neither an escaped expression nor an escaped character, which can be handled at a lexical analysis level.

Control symbols carry some extra meaning that is intended for RTF viewer software. They are therefore handled separately. You will find below a list of the currently recognized control symbols:

  • \~ : non-breaking space.
  • \- : optional hyphen.
  • \_ : non-breaking hyphen.
  • \: : specifies a subentry in an index entry.
  • \| : Formula character (used by Word 5.1 for the Macintosh as the beginning delimiter for a string of formula typesetting commands, probably not used anymore).

The RTF specification also include the \* control symbol, which is used to mark a destination control symbol whose text should be ignored if not understood by the RTF reader. The PHP RTF Tools package, however, considers that a sequence such as :

\*\background
		

is a control word  ("background") with a special attribute saying that it is special, due to the presence of the \* control symbol just before it.

Conclusion

This article covered a part of the RTF file format, introducing the basic notions that will allow you to become familiar with raw RTF contents, at least from a syntactic point of view.

It also described the basic entities that form RTF contents: control words, control symbols, groups and destinations, along with their counterparts in the PHP RTF Tools package.

The next article of this series will describe how an RTF document is structured: header, body, fonts, styles, color tables and so on. It will also present some syntactic elements that were not discussed here and that may require some additional "intelligence" from a lexical parser.

These include the handling of optional spaces after control words, and special control words such as the \bin or \pict, which require some specific processing.

Useful links

You will find below some useful documents from Microsoft about the Rich Text Format :

And some useful links collected from elsewhere :

  • RTF Pocket Guide.

    A small cookbook about the RTF language, by Sean M. Burke. It will not help you to become a specialist, but it will rapidly give you an overview of what's happening inside.

    Sean M. Burke is the author of the RTF Pocket Guide, which is an ideal complement to the Microsoft RTF specifications, and gives useful information on how to generate RTF code such as tables, paragraph formatting, etc..

    If you have to deal with RTF files generation, then this small book is a must-have, because it gives concrete examples that you will never find in the Microsoft specifications. And you won't need heavy tools to test the examples: Notepad for writing RTF contents, and Wordpad for displaying them.

  • Rich Text Format

    The page of Sean M. Burke's website giving additional information and links about the RTF file format.

  • Wikipedia page

    A comprehensive article on Wikipedia about the RTF file format.



You need to be a registered user or login to post a comment

1,611,040 PHP developers registered to the PHP Classes site.
Be One of Us!

Login Immediately with your account on:



Comments:

No comments were submitted yet.



  Post a comment Post a comment   See comments See comments (0)   Trackbacks (0)  
  All package blogs All package blogs   PHP RTF Tools PHP RTF Tools   Blog PHP RTF Tools package blog   RSS 1.0 feed RSS 2.0 feed   Blog Using the PHP RTF Par...