Write eastern language Persian/Farsi/Arabic

uleysky · January 23, 2021, 12:17am

The ticks on the axes are in Iranian numbers, but without a degree sign and with a direction indicated by a Latin letter?
Symbols U+066A (Arabic Percent Sign, “٪” U+066A Arabic Percent Sign Unicode Character) and U+061F (Arabic Small Kasra, “◌ؚ” U+061A Arabic Small Kasra Unicode Character)? They were not in the farsi.txt, so they were not replaced. I added them. But there is a problem with them. Both characters are missing in fonts BZar, but the percent can be replaced with a Latin percent. IRANSans.ttf contains arabic percent, but not kasra. I found kasra only in the font NotoSansArabic. See attached pict.zip (642.5 KB), positions CA and CB.

saeedsltm · January 23, 2021, 8:25am

@uleysky, OK, something is strange to me. I’ve check inside the BZar font using both testtable (your script) and https://fontdrop.info/ and notify of some differences. There are 221 glyph and 18 ligatures showed in the website, but using testtable the total is 184 character. So, it seems some characters are missing like the Farsi question mark with Unicode 061F (https://www.compart.com/en/unicode/U+061F) and also few others. Could you please check if all 239 (221 + 18) characters are being red correctly from BZar.ttf using testtable?

uleysky · January 23, 2021, 9:34am

@saeedsltm, these are two different things. The testtable shows the correspondence of the font to the used encoding, that is, there can be both characters required by the encoding, but absent in the font (yellow crosses in table), and characters in the font that are not used in the encoding.

Let me describe in a little more detail the relationship between a TrueType font and its use in a GMT. TrueType font is a collection of glyphs, a glyph is a visual representation of a character. Each glyph in the font file is numbered from zero to the Nglyphs-1. Also, each glyph has a name - a string set by the font author or automatically generated. Also, the font file contains the encoding - mapping the glyph number to a unicode code point. There may be glyphs in the font that do not have a unicode match, but we are not covering them. This is structure of font.

GMT uses PostScript as backend. From the point of view of the PostScript, a font is a dictionary with key-value pairs, where the key is the name of the glyph, and the value is the procedure for its rendering (image of glyph, for simplicity). When the PostScript draws a string, a glyph is assigned to each byte(!) of the string. Which glyph will correspond to which byte is determined by the encoding vector containing, as you can easily understand, 256 elements, which are glyph names. You can bind different encoding vectors to the same set of glyphs, thereby gaining access to all the glyphs of the font. But we need less then 128 glyphs, so one encoding vector is enough.

Thus, our problem is reduced to the problem of creating an PostScript encoding vector to display a certain subset of the glyphs of the font. The main difficulty here is that the encoding vector must contain the names of the glyphs, and they can and will differ for different fonts.

For the solution, we introduce an intermediate font-independent entity, mapping bytes to unicode code points. This is file farsi.txt. Having the mapping (code point - glyph number) in the font, we get the mapping (code point - glyph name), and using the our farsi.txt with (byte - code point) mapping, we get the mapping, that we need - (byte - glyph name).

The second part of the task is a Unicode code point to byte translator, this is also created from the farsi.txt, but it is font-independent.

Sorry for the long text, I wanted to get a fairly clear description of what was done. Ask if something is still not clear.

saeedsltm · January 23, 2021, 10:13am

@uleysky, OK, thanks for the clarification, i got it now, changing question-mark(?) Unicode from 0x003F to 0x061F in farsi.txt, solved the problem so i can do the same for others if required.

uleysky · January 23, 2021, 10:23am

I would advise you not to touch for now the characters with codes from 0x20 to 0x7E (0x7F, DELETE, not needed and can be replaced too), in order to maintain compatibility with ASCII. There are still some free spaces at the end of the vector, better insert there. Well, the final version will probably need to be posted somewhere so that others can use it.

saeedsltm · January 23, 2021, 10:30am

Yes, your’re right, so i will continue testing the code until reach a stable version.

saeedsltm · January 30, 2021, 11:54am

Dear @uleysky, after doing some tests, the following obtained:
1- figure frame annotations need to add some characters to Farsi.txt file like degree sign (https://www.compart.com/en/unicode/U+00B0). For quote (minute) and double quotes (second) also wrong characters are displayed.
2- Some characters are also missed like Farsi-question-mark (https://www.compart.com/en/unicode/U+061F).

and when we have somehow a heavy file text input (let say more the 10 lines), calling functions for reformatting the text into Unicode point takes large time to be evaluated. So, i’m thinking if it could be possible to merge all functions and prepare the text file before running GNT code, and then pass it to pstext directly. Do you have a better idea?

uleysky · January 31, 2021, 3:00am

I think we need to create a localization file for the GMT (as in /usr/share/gmt/localization/) and a separate encoding vector for annotations with reduced set of symbols for localized annotations only.
Missed in encoding or missed in fonts? In the first case, you already know how to add a character to the encoding vector, in the second, the only option to change the font.
The string re-encoding is fast and ugly, so there is place for optimization. Send me a test case, I’ll see what I can do.