Lexer     «Computing»   «Home»   «Map&Rev»

 

ISO translator: 7-bit to print

I wrote the lexer in 2003 to translate Bengali text written in ISO 15919 scheme, from 7-bit, into 8-bit printable code. The compilation instruction is given in the header of the file. The translated output could then be procesed with the nialang.tex file, as an inpurt, in Plain TeX.

The lexer takes plain text file as input, normally, or through redirection symbol, and outputs the text in plain TeX file through redirection. The 7-bit transliteration flagged by \bgnbg ... \endbg is translated. It does not allow any TeX code inside the flagged stream of text. This is left to be accomplised, perhhaps, never.

The lexer can translate the long vowels (e and o) which are normally also unambiguous. It can also translate the intermediate v if the \strict flag is given after the \bgnbg flag. Switching between normal and strict transliteration is not still incorporated. This can, too, be accomplished.

A sample input file and the output


% loads Neo Indo-Aryan Language macros \input nialang.tex % following line is a plain TeX command \obeylines % flags beginning of conversion \bgnbg % flags strict conversion, bar over o or e, to mean length \strict ni.hsa;ngataa (ye tumi hara.na karo, aabula haasaana) % flags end of conversion \endbg \vskip1pc \bgnbg ata.tuku caa;yani baalikaa! ata ;sobhaa, ata svaadhiinataa! ce;yechila aaro kichu kama, aa;yanaara d~aa.re deha mele base thaakaa saba.taa dupura, ce;yechila maa bakuka, baabaa taara bedanaa dekhuka! ata.tuku caa;yani baalikaa! ata hai rai loka, ata bhii.ra, ata samaagama! ce;yechila aaro kichu kama! eka.ti jalera khani taake dika t,r.s.naa ekhani, ce;yechila eka.ti puru.sa taake baluka rama.nii! \endbg % plain TeX command \bye

The poem, by Abul Hasan, is in Bangla script below:

নিঃসঙ্গতা
(যে তুমি হরণ করো; আবুল হাসান)

অতটুকু চায়নি বালিকা!
অত শোভা, অত স্বাধীনতা!
চেয়েছিল আরো কিছু কম,

আয়নার দাঁড়ে দেহ মেলে দিয়ে
বসে থাকা সবটা দুপুর, চেয়েছিল
মা বকুক, বাবা তার বেদনা দেখুক!

অতটুকু চায়নি বালিকা!
অত হৈ রৈ লোক, অত ভীড়, অত সমাগম!
চেয়েছিল আরো কিছু কম!

একটি জলের খনি
তাকে দিক তৃষ্ণা এখনি, চেয়েছিল
একটি পুরুষ তাকে বলুক রমণী।

The \strict option will render a strict print transliteration as in the following:

niḥsaṅgatāā
(yē tumi haraṇa karō, ābula hāsāna)

ataṭuku cāẏni bālikā!
ata śōbhā, ata sbādhinatā!
cēẏēchila ārō kichu kama,
āẏnāra dā̃ṛē dēha mēlē
basē thākā sabaṭā dupura, cēẏēchila
mā bakuka, bābā tāra bēdanā dēkhuka!
ata hai rai lōka, ata bhīṛa, ata samāgama!
cēẏyēchila ārō kichu kama!
ēkaṭi jalēra khani
tākē dika tr̥ṣṇā ēkhani, cēẏēchila
ēkaṭi puruṣa tākē baluka ramaṇī

No such option will, by default, transliterate the poem into normal print transliteration as in the following:

niḥsaṅgatāā
(ye tumi haraṇa karo, ābula hāsāna)

ataṭuku cāẏni bālikā!
ata śobhā, ata sbādhinatā!
ceẏechila āro kichu kama,
āẏnāra dā̃ṛe deha mele
base thākā sabaṭā dupura, ceẏechila
mā bakuka, bābā tāra bedanā dekhuka!
ata hai rai loka, ata bhīṛa, ata samāgama!
ceẏyechila āro kichu kama!
ekaṭi jalera khani
tāke dika tr̥ṣṇā ekhani, ceẏechila
ekaṭi puruṣa tāke baluka ramaṇī

The lexer can be run with the < infile.txt > outfile.tex cammand and the file then can be prrocessed with the inclusion of the nialang.tex macro in the TeX file.


/* isotex.lex -- a text parser for ISO 15919 transliteration scheme, * translates 7-bit ASCII file to 8-bit printable through Plain TeX * * Copyright (C), 2003 * author: Abu Jar M Akkas * last modified: August 23, 2003 * * Users are free to modify the source and in that * case, they are requested to send a copy back to * the author. * * compile: * lex isotex.lex / flex isotex.lex * gcc lex.yy.c -ll -o isotex (for gcc on Linux) * USAGE: --> ./isotex < infile.txt > outfile.tex * gcc lex.yy.c -lfl -o isotex.exe (for Mingw on Windows) * NOTE: if flex use <-lfl>, if lex use <-ll> * USAGE: --> isotex < infile.txt > outfile.tex */ %{ #include #define version "0.1" #define TEX_DAMB '\x2d' #define TEX_NDAMB '\x3a' void putbg(int); int chars = 0; int method = 0; %} /* SGN ,|\.|;|~ */ /* CON b|c|d|g|h|j|k|l|m|n|p|r|s|t|v|y */ /* VOW a|e|i|o|u */ SGNONE [,\.;~] CONONE [dhlmnnrsty] SGNTWO [,\.] CONFTWO [dlrt] CONSTWO [hlr] VOWSNG [aeiou] VOWDBL [aiu] VOWSTWO [aiu] CONSNG [bcdghjklmnprstvy] CONFTHR [bcdgjkpt] CONSTHR [h] SGNSNG [~] ISO_DAMB \x3a /* rule (regexp) definitions */ BBG "\\bgnbg" EBG "\\endbg" %s BGTXT %% {BBG} { BEGIN BGTXT; } { "\\strict" { method = 1; } /* ,l ,r ;m ;n ;s ;y .d .h .n .r .s .t ~n */ /* \ring{l} \ring{r} \.{m} \.{n} \'{s} \.{y} \d{d} \d{h} \d{n} \d{s} \d{t} \~{n}*/ {SGNONE}{CONONE} { if (yytext[0]== ',' ) { switch(yytext[1]) { case 'l': printf ("\\ring{l}"); break; case 'r': printf ("\\ring{r}"); break; } } else if (yytext[0] == ';' ) { switch(yytext[1]) { case 'm': printf ("\\.{m}"); break; case 'n': printf ("\\.{n}"); break; case 's': printf ("\\'{s}"); break; case 'y': printf ("\\.{y}"); break; } } else if (yytext[0] == '.' ) { switch(yytext[1]) { case 'd': printf ("\\d{d}"); break; case 'h': printf ("\\d{h}"); break; case 'n': printf ("\\d{n}"); break; case 'r': printf ("\\d{r}"); break; case 's': printf ("\\d{s}"); break; case 't': printf ("\\d{t}"); break; } } else if (yytext[0] == '~' ) { if (yytext[1] == 'n' ) { printf("\\~{n}"); } } /* else {printf("x");}; */ chars += yyleng; } /* ,ll ,rr .rh .dh .th */ {SGNTWO}{CONFTWO}{CONSTWO} { if (yytext[0]== ',' ) { if ((yytext[yyleng-2] == 'l') && (yytext[yyleng-1] == 'l')) { printf("\\ring{\\=l}"); } else if ((yytext[yyleng-2] == 'r') && (yytext[yyleng-1] == 'r')) { printf("\\ring{\\=r}"); } } else if (yytext[0] == '.' ) { { if ((yytext[yyleng-2] == 'r') && (yytext[yyleng-1] == 'h')) { printf ("\\d{r}h"); } else if ((yytext[yyleng-2] == 'd') && (yytext[yyleng-1] == 'h')) { printf ("\\d{d}h"); } else if ((yytext[yyleng-2] == 't') && (yytext[yyleng-1] == 'h')) { printf ("\\d{t}h"); } } } /* else {}; */ chars +- yyleng; } /* b c d g h j k l m n p r s t v y */ {CONSNG} { switch(yytext[0]) { case 'b': printf ("b"); break; case 'c': printf ("c"); break; case 'd': printf ("d"); break; case 'g': printf ("g"); break; case 'h': printf ("h"); break; case 'j': printf ("j"); break; case 'k': printf ("k"); break; case 'l': printf ("l"); break; case 'm': printf ("m"); break; case 'n': printf ("n"); break; case 'p': printf ("p"); break; case 'r': printf ("r"); break; case 's': printf ("s"); break; case 't': printf ("t"); break; case 'v': if (method == 1) printf ("v"); else printf ("b"); break; case 'y': printf ("y"); break; } chars += yyleng; } /* a e i o u */ {VOWSNG} { switch(yytext[0]) { case 'a': printf ("a"); break; case 'e': if (method == 1) printf ("\\={e}"); else printf ("e"); break; case 'i': printf ("i"); break; case 'o': if (method == 1) printf ("\\={o}"); else printf ("o"); break; case 'u': printf ("u"); break; } chars += yyleng; } /* aa ai au ii uu */ {VOWDBL}{VOWDBL} { if (yytext[0] == 'a' ) { switch(yytext[1]) { case 'a': printf ("\\={a}"); break; case 'i': if (method == 1) printf ("\\t{a\\i}"); else printf ("ai"); break; case 'u': if (method == 1) printf ("\\t{au}"); else printf ("au"); break; } } else if (yytext[0] == 'i' ) { if (yytext[1] == 'i' ) { printf("\\={\\i}"); } } else if ((yytext[0] == 'u' ) && (yytext[1] == 'u' )) { printf("\\={u}"); } /* else {printf("x");}; */ chars += yyleng; } /* bh ch dh gh jh kh ph th */ {CONFTHR}{CONSTHR} { if (yytext[yyleng-1]=='h') switch(yytext[yyleng-2]) { case 'b': printf ("bh"); break; case 'c': printf ("ch"); break; case 'd': printf ("dh"); break; case 'g': printf ("gh"); break; case 'j': printf ("jh"); break; case 'k': printf ("kh"); break; case 'p': printf ("ph"); break; case 't': printf ("th"); break; } chars += yyleng; } /* ~ */ {SGNSNG} { if (yytext[0] == '~') { printf("\\cb"); } chars += yyleng; } /* ~a ~e ~i ~o ~u */ {SGNSNG}{VOWSNG} { if (yytext[0] == '~') { if (yytext[yyleng-1] == 'a') { printf ("\\~{a}"); } else if (yytext[yyleng-1] == 'e') { printf ("\\~{e}"); } else if (yytext[yyleng-1] == 'i') { printf ("\\~{\\i}"); } else if (yytext[yyleng-1] == 'o') { printf ("\\~{o}"); } else if (yytext[yyleng-1] == 'u') { printf ("\\~{u}"); } } chars += yyleng; } /* ~aa ~ai ~au ~ii ~uu */ {SGNSNG}{VOWSNG}{VOWDBL} { if (yytext[0] == '~') { if ((yytext[1] == 'a') && (yytext[2] == 'a')) { printf("\\nasika{a}"); } /* { switch(yytext[2]) { case 'a': printf ("\\nasika{a}"); break; case 'i': printf ("\\nasika{a\i}"); break; case 'u': printf ("\\nasika{au}"); break; } } */ else if ((yytext[1] == 'i') && (yytext[2] == 'i')) { printf("\\nasika{i}"); } else if ((yytext[yyleng-1] == 'u') && (yytext[yyleng-2] == 'u')) { printf("\\nasika{u}"); } } chars += yyleng; } {ISO_DAMB}/[ ] { putbg (TEX_NDAMB); } {ISO_DAMB} { putbg (TEX_DAMB); } .|\n { printf ("%s", yytext); chars++; } {EBG} { BEGIN 0; chars += yyleng; } } .|\n { printf("%s", yytext); } %% /* user C code */ void putbg (int code) { printf("%c", code); } void verno(){ fprintf(stderr,"\nisotex, ajmakkas, 2003, version: %s\t", version); } int yywrap (void) { return(1); } int main (int argc, char *argv[]) verno(); if (argc > 1 ) { FILE *file; file = fopen (argv [1], "r"); if ( !file ) { fprintf(stderr, "Could not open %s\n", argv[1]); exit(1); } yyin = file; } yylex(); if (chars == 0) { fprintf(stderr,"\nNo token parsed, probably text not flagged.\n"); } else if (chars > 0) { fprintf(stderr, "\n\n%d characters processed\n", chars); } return 0; }

% ********************************************* % Neo-Indo-Aryan Language Transliteration Codes % ********************************************* % Filename: nialang.tex % Author: Abu Jar M Akkas % % January 2002 % ********************************************* % Vowels % Usage: % a \={a} i \={\i} u \={u} \ring{r} % \ring{\=r} \ring{l} \ring{\=l} % e(\=e) ai (\t {ai}) o(\=o) au (\t {au}) % Consonants % Usage: % k kh g gh \.{n} % c ch j jh \~{n} % \d{t} \d{t}h \d{d} \d{d}h \d{n} % p ph b bh m y r l % \'{s} \d{s} s h \d{r} \d{r}h \.{y} % \.{m} \d{h} \~{} % % \nasika{a} \nasika{i} \nasika{u} % long nasalised % \cb \cm % ********************************************* % ,r ,rr ,l ,ll % Usage \ring{r} \ring{\=r} \ring{l} \ring{\=l} \def\ring#1{{\ifx#1r\oalign{\relax#1\crcr\kern-.11em\hidewidth% ring (below r) \vbox to.2ex{\hbox{\char"17}\vss}\hidewidth}% \else\oalign{\relax#1\crcr\hidewidth% ring (below l) \vbox to.2ex{\hbox{\char"17}\vss}\hidewidth}\fi}} % for lslanted/italic version \def\iring#1{{\ifx#1r\oalign{\relax#1\crcr\kern-.4em\hidewidth% ring (below r) \vbox to.2ex{\hbox{\char"17}\vss}\hidewidth}% \else\oalign{\relax#1\crcr\hidewidth% ring (below l) \vbox to.2ex{\hbox{\char"17}\vss}\hidewidth}\fi}} % ~aa, ~ii, ~uu % Usage \nasika{a} \nasika{i} \nasika{u} \def\nasika#1{{\ifx#1i\leavevmode\setbox0\hbox{h}\dimen0\ht0\advance\dimen0-1ex% \rlap{\raise.3\dimen0\hbox{\kern-.2ex\char"7E}}\=\i% \else\leavevmode\setbox0\hbox{h}\dimen0\ht0\advance\dimen0-1ex% \rlap{\raise.3\dimen0\hbox{\char"7E}}{\=#1}\fi}} % candrabindu % Usage \cb \def\cb{\leavevmode\lower0.6ex\hbox to .5em{\hss\char"7E\hss}\relax} % could be \lower1ex % candrabindu over m % Usage \cm \def\cm{\leavevmode\setbox0\hbox{m}\dimen0\ht0\advance\dimen0-.5ex% \rlap{\raise.3\dimen0\hbox{\kern.66ex\char"5F}}\u{m}} % ********************************************* \endinput % *********************************************

 

Rev.: vii·xi·mmxxii