Network Working Group
Request for Comments: XXXX
Ge'ez Frontiers Foundation
This memo defines an Experimental Protocol for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
Ethiopic web publishing is burdened with a legacy of numerous disparate systems. The probability is then very high that a document published in one encoding system becomes inaccessible to the majority of users who will wish to read the document. A conversion service is proposed to alleviate this problem.
This document describes a Common Gateway Interface API and Hypertext Markup Language extensions for the conversion of web documents on remote servers from one encoding system into another.
Although Ethiopic character set / character code conversion software is available in the public domain it remains inaccessible to the average user wishing to publish with it. At issue in the foreseeable future are web hosting services reluctant to install said software. The need then arises for such users for access to a remote processing service.
Although the mechanisms specified in this document are all described in prose, most are also described formally in the augmented BNF notation of RFC 822. Implementors will need to be familiar with this notation in order to understand this document, and are referred to RFC 822 for a complete explanation of the augmented BNF notation.
All CGI parameter and HTML markup and parameter names as defined are case-insensitive. However, parameter values are case-sensitive unless otherwise specified for the specific parameter.
FORMATTING NOTE: Notes, such at this one, provide additional nonessential information which may be skipped by the reader without missing anything essential. The primary purpose of these nonessential notes is to convey information about the rationale of this documents or to place this document in the proper historical or evolutionary context. Such information may in particular be skipped by those who are focused entirely on building a conformant implementation, but may be of use to those who wish to understand why certain design choices were made.
RFC 2045 provides our definition of character set:
The term "character set" is used in MIME to refer to a method of converting a sequence of octets into a sequence of characters. Note that unconditional and unambiguous conversion in the other direction is not required, in that not all characters may be representable by a given character set and a character set may provide more than one sequence of octets to represent a particular sequence of characters.
It is noted in the RFC that the above definition is intentionally very general and in some communities will be interchanged with the term "character encoding". A general definition for character sets is practical for discussion of Ethiopic encoding methods as a plethora of different methods have been applied to electronically encode the Ethiopic system of writing. The major systems addressed by this specification are encodings in computer fonts, transliteration, and transcription. A review of these approaches follows.
Ethiopic fonts have been encoded on personal computer systems in one of three approaches. The governing design constraint in devising an encoding systems has been the address space limitation for encoding glyphs that many font systems impose.
1. Diacritic Based Systems whereby the character set is broken down into base glyphs and diacritic symbols. Usually more than one of a given diacritic symbol is given to enhance screen and printed rendering against the geometry of the base glyph. The use of diacritic symbols is a means to encode more characters into available space. Less frequently used characters may be omitted.
2. Whole Glyph Systems whereby glyphs remain intact, again less frequently used characters may be omitted to meet font space limitations.
3. Distributed Encoding Systems whereby approach (1) or (2) is taken but characters are spread over multiple physical fonts. In this way the less frequent characters need not be omitted from the character set.
It is important to note before continuing that in Approach 1 the resultant encoding scheme will be multibyte while using a single physical font. Base glyphs are themselves valid characters and will need only a single byte of encoding. While as many as 3 diacritic symbols along with a base glyph (4 bytes total) may be needed to encode other characters. In Approach 3 an Ethiopic "character set" becomes a collection of character encodings from each of the independent fonts. When diacritic symbols are in a different font file from the base glyphs the encoding of a single character becomes not only multi-byte but multi-typeface (encoded by two coded character sets).
John Clews provides an apt definition of transliteration:
Transliteration is the process which consists of representing the characters of an alphabetical or syllabic system of writing by the characters of a conversion alphabet, this being the easiest way to ensure the complete and unambiguous reversibility of the conversion alphabet in the converted system.
Again John Clews writes:
Transcription is the process whereby the sounds of a given language are noted by the system of signs of a conversion language.
A transcription system is of necessity based on the orthographical conventions of the conversion language. Transcription is not strictly reversible.
Transcription may be used for the conversion of all writing systems. It is the only method that can be used for systems that are not entirely alphabetical or syllabic and for all ideophonographical systems of writing like Chinese.
A "Transfer Variant" refers to an additional encoding of a character set for purpose of information interchange. For instance a document composed in the "Geezigna" character set might be transfered as UTF-7, UTF-8, Quoted-Printable, or UUEncoded.
The UTF encodings may invoke some confusion as the Unicode character set is implied implicitly though it is not, ultimately, the target character set of the document.
Text may be formatted under the writing practices of a given linguistic region and period. The LiveGe'ez Remote Processing Protocol is designed to be language and locale aware for such services as date conversion and document formatting. No requirement is made however on the locales and languages supported by a LiveGe'ez implementation.
The LiveGe'ez Remote Processing Protocol exploits the HTTP Common Gateway Interface standard as a means to receive processing requests over the Internet. CGI services specified in this version of the protocol are those of document and string encoding and date conversions.
The grammar for LiveGe'ez processing requests through CGI is given as follows:
parameter := attribute "=" value attribute := token ; Matching of attributes ; is ALWAYS case-insensitive. value := token | token "." token ; The second token is some ; modifier of the first. token := 1*<any alpha-numeric (US-ASCII) CHAR>
Essential parameters for an arbitrary stream translation are those that will inform the processing agent of:
Table 3.1 lists parameters that may be used throughout the LiveGe'ez Remote Processing Protocal in combination with other parameters associated with character code conversion.
pragma directives can be given by combining pragma requests in a comma deliminated list or by appending digits to the pragma attribute (e.g.
pragma3=value). This is intended to simplify sending multiple pragma directives through HTML forms vs the former approach.
The macros given in Table 3.1 can be expanded following the grammar shown below. Terms correspond to the definitions given in Section 2 and a comprehensive list of character set "
SYSTEM" tokens is given in Appendix A.
LIVEGEEZSYS := CHARACTERSET | CHARACTERSET "." TRANSFERVARIANT CHARACTERSET := COMPUTERFONT | TRANSLITSYSTEM | TRANSCRIPTSYSTEM TRANSFERVARIANT := "UTF7" | "UTF8" COMPUTERFONT := SYSTEM TRANSLITSYSTEM := SYSTEM | SYSTEM "." LANG TRANSCRIPTSYSTEM := SYSTEM | SYSTEM "." LANG SYSTEM := token ; Any implementation supported system. ; Appendix A lists defined system tokens. LANG := ISO639_LANGUAGECODE | ISO639_LANGUAGECODE "." ISO3166_COUNTRYCODE | LOCALE LOCALE := <Any Implementation Supported Locale>
Essential parameters for a document translation are those that will inform the processing agent of:
The Universal Parameters combined with Table 3.2 provides attribute definitions for the LiveGe'ez Remote Processing Protocol that meet the above requirements:
http://www.xyz.com/Xlate?file=/dir1/dir2/file.html&sys=AGF-Zemen http://www.xyz.com/Xlate?file=http://www.abc.edu/dir1/dir2/file.html&sysIn=Ethiop&sysOut=Geez-3 http://www.xyz.com/Xlate?file=/dir1/dir2/file.html&sys=Brana.UTF7&lang=gez.et
NOTE: In general a document will in some form specify its own encoding system(s) and the "
sysIn" parameter will not be required. When the
sysIn parameter is given it should take precedence over
the document specified encoding.
String translation service is also specified in the LiveGe'ez Remote Processing Protocol and is analogous to the file translation service described in Subsection 3.2. The "
string" attribute takes the place of the file attribute in Table 3.2 and may be any URL encoded string.
The essential parameters for a date translation are those that will inform the processing agent of:
Table 3.4.1 provides attribute definitions in addition to the Universal Parameters of the LiveGe'ez Remote Processing Protocol that meet the above requirements:
CALSYS" macro expands according to:
CALSYS := "ethio" | "euro" | "et" | "eu" ; Support for Ethiopic and European calendars are minimal requirements
http://www.xyz.com/Xlate?day=29&month=4&year=1991&cal=ethio http://www.xyz.com/Xlate?day=7&month=1&year=1999&cal=euro&calOut=ethio&sys=VG-Main&lang=tir http://www.xyz.com/Xlate?day=29&month=4&year=1991&cal=ethio&pragma=date-only
The response in the first example would be a formatted date in the European calendar of the form "Thursday January 7, 1999" though no formatting or language locale is specified as a default. The response in the second example would be the Tigrigna analog of the first example under "
VG-Main" font encoding. The final example applies a pragma directive. Pragma values for date queries are presented in Table 3.4.2:
||day "," month "," year "\n"||A comma separate list of digits terminated by newline.|
||A phrased response terminated by newline.|
||Boolean "0" or "1" response terminated by newline.|
||The two pragma directives passed together with a comma delimitor return a phrased response terminated by newline.|
NOTE: There is an implicit assumption that the Ethiopic and European (Gregorian) calendars will be the primary interest of the user community. Accordingly, when the "
calOut" attribute is omitted it is assumed that dates in the companion system are returned.
English to Ethiopic numeral translation service is specified in the LiveGe'ez Remote Processing Protocol. The "
number" parameter with a positive or negative numeric value may be combined with the "
xfer" and "
phrase" attributes to request queries.
enumber" paramter for Ethiopic to English numeral translation is not a required part of the specification at this time. The paramter name will remain as a reserved term until its usefulness is further studied for a futre version of the specification.
In both service areas (file translation and calendar conversion) it is left to the server administrator to set default language settings for when the "
lang" attribute is omitted. It is also left to implementers to decide how to handle error responses when a parameter is left out.
The use of persistant cookies is practical for storing preferences such as "
sys" and "
lang" but explicit requests in a CGI string will take precedence over cookie values.
The LiveGe'ez Remote Processing Protocol also specifies how the body of an HTML document is to be interpreted and provides useful extensions to the markup language itself.
When remote documents are being processed (those requested through the "
file" parameter and containing the "protocol://" structure) the document root will be set using the "
base" tag with the "
file" cgi parameter value:
The base tag will not be inserted into the "
head" section of the document only if one has not been provided.
After setting the base url in remote files, LiveGe'ez implementations will
continue to process the content of the document "
body" section. LiveGe'ez implementations will be required to:
sysIn" cgi parameter) into the requested "
sysOut" character set.
The scope of an Ethiopic character set will of necessity fall between the "font" markup tags for TrueType font systems or the "span" tag for transliteration and transcription systems will be discussed shortly in Subsection 4.2.
NOTE: While the "
basefont" markup support is
available in many browsers however the "
face" attribute of
basefont" is less universal. The
is not supported in this version of the LiveGe'ez Remote Processing Protocol
for typeface setting. The
basefont markup does not effect
font typefaces used within tables nor headings. Further, using the
basefont markup would invert the problem to one of using "
font" tags for changes to Roman script (common for numerals and Roman punctuation mixed with Ethiopic script). Since there is no certainty of a specific Roman typeface
being available on all window systems this version of the LiveGe'ez specification leave Roman script as the default underlying script.
The Hypertext Markup Language provides the generic "
which has no function by itself. Its functionality comes through the attributes
used with the entity.
LiveGe'ez document processors supporting transliteration and transcription
systems should be enabled to detect the "
lang" and "
options" attributes in the span markup as per:
<span trans="TRANSLITSYSTEM">`selam `alem</span> <span trans="TRANSLITSYSTEM" lang="LANG">`selam `alem</span> <span trans="TRANSLITSYSTEM" lang="LANG" options="OPTIONS">`selam `alem</span>
The attribute order is not important. The "sera" tag with no attributes should be recognized by LiveGe'ez implementations supporting the
SERA transliteration system. The "sera" markup dates from early 1995 and support for
the markup is suggested for backwards compatibility with legacy systems.
<span trans="sera">`selam `alem</span>
<span trans="ethiop.amh">'salAm `Alam</span>
<span trans="mainz" lang="amh.et">MalAm OAlam</span>
<span trans="dehai" lang="tir.er" options="key=value&key=value">selamat Alem</span>
Note that after the initial "
lang" and "
options" attributes are set they need not be respecified in subsequent markups.
<span trans="ethiop.amh.et">'salAm `alam</span>
<span trans="ethiop">^cAw `alam</span>
lang" and "
options" attributes would be employed again only to reset the attributes.
To simplify implementation, HTML documents should be preconditioned such that the scope of the Ethiopic character set is nested within markups where HTML renderers will reset typefaces.
<sera> <h1>This is my Title</h1> This is some body text... <ul> <li> Item 1 <li> Item 2 <li> Item 3 </ul> </sera>Good:
<h1><sera>This is my Title</sera></h1> <sera> This is some body text... </sera> <ul> <li><sera>Item 1</sera> <li><sera>Item 2</sera> <li><sera>Item 3</sera> </ul>
Where the "
sera" tag above might also be "
font" or "
span". In the case of the Unordered List the scope of the "
sera" markup does not include the list item to insure that the "bullet" glyph is rendered with the underlying Roman script font -as the bullet may not be found in the conversion Ethiopic character set.
The nesting of the HTML "
span" tags for transliteration
and the "
"font" tag should be avoided to make implementation
easier at this time.
<span trans="mainz">MalAm OAlam</span><span trans="sera">`selam `alem</span>
Unlike the generic "
span" tag the HTML language does not
provide an equivalent markup for in-place tags. Fortunately HTML interpreters will ignore unrecognized markups and we may introduce the "
markup for LiveGe'ez processors to detect and interpret.
LIVEGEEZ markup depends on its attributes to perform any
service. The "
date" and "
fortune" attributes are the
only specified at this time.
date conversion attribute is analogous to the
year CGI parameters of
Subsection 3.4. The difference in using a single
date" attribute is that the
year attribute values are grouped in a comma separated list
to form the "
date" attribute value. The calendar system of the
date numeric values may be included as the last item of the list:
<LIVEGEEZ date="DATE"> <LIVEGEEZ date="DATE" lang="LANG"> <LIVEGEEZ date="DATE" lang="LANG" cal="CALSYS"> <LIVEGEEZ date="DATE" lang="LANG" cal="CALSYS" xfer="LIVEGEEZSYS"> DATE := "Now" | day "," month "," year | day "," month "," year "," CALSYS ; day, month, and year tokens were defined in Table 3.3.1 ; CALSYS here corresponds to the calIn parameter. ; Ethiopic calendar is assumed when CALSYS is omitted.
xfer" field would only be required when the markup is embedded in a block of non-Ethiopic text. Otherwise the current
LIVEGEEZSYS is applied in the formatted date output.
menu" attribute is provided such that implementation
support Ethiopic character sets can be conveniently inserted into HTML forms
as drop down selection menus. This approach is intended to facilitate
maintenance of Ethiopic documents. The document composer need only apply the
LIVEGEEZ" markup a single time and be relieved of the burden of
having to update documents following changes in supported systems on the server
side, or when translating the document at different servers.
menu attribute value other than "
standard" is specified at this time. The
menu attribute may be accompanied by "
name" and "
script" attributes for passing the values of
the selection field name in the form and option scripting codes.
<LIVEGEEZ menu="standard" name="sysMenu" script="onChange="openLink(this.options[this.selectedIndex].value);"> <LIVEGEEZ menu="standard" name="sysMenu" script="js-standard selected="Addis98" ">
select output given below:
<select name="sysMenu" onChange="openLink(this.options[this.selectedIndex].value);"> <option value=Addis>Addis One</option> <option value=AddisWP>Addis One</option> <option value=Addis98 selected>Addis One</option> <option value=Agaw>Agaw</option> <option value=AGF-Dawit>AGF - Dawit</option> <option value=AGF-Zemen>AGF - Zemen</option> <option value=AGF-Ejji-Tsihuf>AGF - Ejji Tsihuf</option> <option value=AGF-Rejim>AGF - Rejim</option> <option value=AGF-Yigezu-Bisrat>AGF - Yigezu Bisrat</option> <option value=ALXethiopian>ALXethiopian</option> : : : </select>
game" attribute is not a required part of the LIVEGEEZ
markup specification. The implementation is optional; an API specification
is given here for those devlopers wishing to provide an interface to available
BSD style "games" available for Ethiopic languages such a
<LIVEGEEZ game="fortune"> <LIVEGEEZ game="fortune" src="URLToAuthorDB" args="17" lang="amh"> <LIVEGEEZ game="banner" args="`selam `alem" lang="amh" xfer="LIVEGEEZSYS">
fortune" the "
src" is optional to specify a URL to a specific (vs random) fortune format data base file of Qine, Geter, Enqoqlesh, Qeldoch, etc. A number passed in the "
args" attribute requests a specific entry in the database.
banner" the expected "
args" value is the string to be bannerized. For any
game additional arguments may be passed through the
args value a comma deliminated list. Additional arguments are not specified here but are left to implementers to provide.
The literal term "
LIVEGEEZSYS" found anywhere within the bounds of a markup will be exchanged with the requested Ethiopic character set "
The Local Link: <a href="/cgi-bin/LiveGeez?day=29&month=4&year=1991&sys=LIVEGEEZSYS">Becomes:
<a href="/cgi-bin/LiveGeez?day=29&month=4&year=1991&sys=Washra">The Local Link:
LIVEGEEZLINK" literal is provided to mark hypertext links
that should reference the LiveGe'ez processor:
<a href="newFile.html" LIVEGEEZLINK>
http://www.xyz.com" and "
/dir1/dir2/" comes from the "
href" attribute value of the "
base" markup; and the "
sys" value is that of the file being processed.
Colors indicate interdependent encoding schemes spread over multiple typefaces. Generally red indicates the base encoding and green the complimentary encoding.
TrueType Font Systems
|Vendor||Encoding Name||Type Faces||CGI Value|
Hahu Lite Gothic
Hahu Lite Serif
Hahu Lite Times
|C.B. Hale (?)||C.B. Hale A||Amharic-A||Amharic-A|
|C.B. Hale B||Amharic-B|
|Ethiopian Computers & Software||Ge'ezEdit||GeèzEdit Amharic P||GeezEditAmharicP|
|Ethiopic (?)||Ethiopic1||ETHIOPIC PIC-1-regular||ETHIOPIC|
|EthiO Systems||Ethiopian ASCII Primary||Ethiopia Primary
Ethiopia Primary Slanted
Washrasl Primary Slanted
|Ethiopian ASCII Secondary||Ethiopia Secondary
Ethiopia Secondary Slanted
Washraxsl Secondary Slanted
|Fidel Software (?)||Fidelxtr1||FIDEL~`_SOFTWARE||Fidel|
|Ge'ez Frontiers Foundation||Ethiopian ASCII Primary||GF Zemen Primary
GF Abebe Giorgis Primary
GF Yigezu Bisrat Primary
|Ethiopian ASCII Extended-Secondary||GF Zemen Secondary
GF Abebe Giorgis Secondary
GF Yigezu Bisrat Secondary
|PFR he||ENH Zena he
GF Zemen2K Ahadu
|PFR le||ENH Zena le
GF Zemen2K Kileitu
|PFR He||GF Zemen2K Selestu|
|Ge'ez II||Geez II|
|Monotype||Monotype 1||Amharic 1
Amharic Book 1
|Monotype 2||Amharic 2
Amharic Book 2
|Monotype 3||Amharic 3
Amharic Book 3
ET-SAMI Normal Italic
ET-SAMI Bold Italic
|NCIC||Agafari||AGF - Zemen
AGF - Dawit
AGF - Rejim
AGF - Yigezu Bisrat
|Agafari -Ejji Tsihuf||AGF - Ejji Tsihuf||AGF-Ejji-Tsihuf|
|OmniTech||NCI||Amharic Kechin Normal
Amharic Yigezu Bisrat Normal
Amharic Gazetta Ordinay
|Phonetic Systems||Phonetic Systems||GeezType||GeezType|
|Power Ge'ez||Power Ge'ez||Ge'ez-1
|Power Ge'ez Number||Ge'ez-1 Number, etc|
|SIL||Et 1||Et 1 SILAddisAbaba||SILAddisAbaba|
|Et 2||Et 2 SILAddisAbaba|
|Et 3||Et 3 SILAddisAbaba|
|Visual Ge'ez||Visual Ge'ez||VG2-Agazian
Typeface Independent Systems
|Vendor||Encoding Name||Type Faces||CGI Name|
|Japanese Industrial Standard||JIS||NA||JIS|
|Japanese Unix Network||JUNET||NA||JUNET|
|Vendor||Encoding Name||Type Faces||CGI Name|
|Vendor||Encoding Name||Type Faces||CGI Name|
|Vendor||Encoding Name||Type Faces||CGI Name|
|Ge'ez Frontiers Foundation||Image||NA||Image
|Sun MicroSystems||Java (\uABCD)||NA||Java|
|U+||U Plus (U+ABCD)||NA||UPlus|