Write eastern language Persian/Farsi/Arabic

saeedsltm · January 14, 2021, 7:52pm

Hi,

Does anyone know how we can write eastern language like Persian/Farsi/Arabic in GMT pstext? It seems there is a problem around encoding/decoding strings.

Regards

seisman · January 14, 2021, 8:12pm

Others may provide more help if you could provide your example script and the image (although they don’t work as expected).

saeedsltm · January 14, 2021, 8:17pm

Ok, here is a very quick sample:

echo “10 10 15 10 0 2 سعید” | pstext -R-30/30/-10/20 -Jm0.1i -P -B5 -S0.5p > plot.ps

the text i want to be shown in the output figure is a Persian name (سعید) but there is problem with it.
plot.ps (23.2 KB)

seisman · January 14, 2021, 8:21pm

I think you need to find some fonts for Persian languages and then tell GMT to use it.

This example may be helpful:

saeedsltm · January 14, 2021, 8:43pm

Thanks, i will try it

saeedsltm · January 15, 2021, 8:20am

Well, i’ve tried something like example 31 and it works only for border labels and not for string text inside the map! Strange!

i zipped the script, fonts and pdf output here.
test.zip (78.1 KB)

CraneBoba · January 15, 2021, 5:27pm

I tried same mathod, I used for Russian
letters< but could not cheack it in russsian windows gmt version
gmt begin GMT_tut_1

gmt set PS_CHAR_ENCODING ISO-8859-6
gmt basemap -R10/70/-3/8 -JX4i/3i -B -B+t"\621\622\623\624\625\626"
gmt end show

uleysky · January 16, 2021, 6:33am

@saeedsltm, this problem is complex and not the fact that it has a solution at all.
First, PS_CHAR_ENCODING ISO-8859-1 is not work at all, because ISO-8859-1 doesn’t contain Arabic characters. You must use PS_CHAR_ENCODING ISO-8859-6, but it is not functional in the GMT. Without going into details, you should create your encoding vector for the font you are using and insert it into the eps file before gmt end. It is tricky, but possible. I can show you how.
Second, in your example, the Persian name is given in unicode, this is guaranteed not to work. You must use a single-byte encoding, such as cp1256 or the ISO-8859-6. I cannot recode your example, iconv reported “illegal input sequence at position 8” for you file data.csv. Perhaps I was mistaken with the encoding, unfortunately I am not familiar with the Persian language.

In any case, it is necessary that the input text is in single-byte encoding. Then I can try to help solve your problem. However, I am almost sure that it will not be possible to solve the problem with writing from right to left.

P.S. The numbers in your example are correct because in the font you used the usual 0-9 are replaced with Persian. This will not happen with another font.

saeedsltm · January 16, 2021, 7:58am

@uleysky, many thanks for your very detailed information.
I’d tried iso8859-6 which contains Arabic encoding and it was failed with my example as you’ve already mentioned.
So, steps i have to do are like the followings (?)
1- creating an encoding vector for the Persian font (can you give me a hint how to do it? Is it something like ISO-*.ps files located in …gmt/share/pslib directory?)
2- finding the right encoding for my data.csv file which contains characters both in English and Persian (using command file -i data.csv it says utf-8).
3- I’ve tried to convert mt data.csv file from utf-8 encoding to cp1256 or ISO-8859-6 using the command "iconv -f utf-8 -t iso8859-6 data.csv > data_iso.csv" but it didn’t work iconv: data.csv:1:7: cannot convert
4- Regarding R-to-L, we can ignore it at the moment.

Just for your info, the Persian font style and characters are somehow like Arabic.

uleysky · January 16, 2021, 9:13am

Yes, a like. A description of the generation of an encoding vector and scripts for this are in this bug report: Fonts in GMT · Issue #4565 · GenericMappingTools/gmt · GitHub.

Yes, I figured it out, the problem is with the symbol Unicode Character 'ARABIC LETTER FARSI YEH' (U+06CC), for some reason it is missing in the encodings CP1256 and ISO-8859-6. Anyway, your font doesn’t have this character, I replaced it with Unicode Character 'ARABIC LETTER YEH' (U+064A)

This simplifies the task.

You modified script, data.csv and resulting pdf:
mod.zip (26.3 KB)

Since the font BZar does not contain the Latin alphabet, instead of the symbol ‘N’ on the figure there are squares.

saeedsltm · January 16, 2021, 11:02am

Your changes are great and thank you very much for that .
Regarding your reported bug, i found that making vectors is very complicated!
There are two points now, one is that in Persian, characters have a double behavior and sometimes they have to be stuck together. For example, the word "سلام" (which means “hello”) contains the letters of the alphabet س ل ا م, but in writing, these characters are in the stacked form which makes “سلام” .
The second point is the direction of each word itself. I thought you meant miss-direction of all words together in a paragraph, but now I realize that the miss-direction exists even inside the words themselves (e.g. سلام will be written as ملاس and of course not in stacked form). Therefore, with the current approach, words are written in reverse.
So there are two important objects:

Words must be written in stacked form, unless the white-space character separates them.
Each word must be written from right to left.

I’m not sure if by changing the font itself we can solve any above mentioned issues, (sometime it works, even in Microsoft Word we have something like that) but at least i can try it with different fonts.

uleysky · January 16, 2021, 11:56am

Please, check this variant with “hello”: ex31.pdf (43.9 KB)
If it suits you, I will explain how you can achieve this. But this will require quite a lot of work, which I cannot do due to my ignorance of the language.

saeedsltm · January 16, 2021, 12:10pm

Well everything is OK except the first character which is typed in isolated mode (س) instead of stacked (سـ). (actually سـ here is composed of two character, but in real is the stacked form of س). Other characters are in their correct form and the direction also is correct for the whole word.

uleysky · January 16, 2021, 12:39pm

Okay, maybe not everything is as hopeless as it seemed at first. So what i did:

I revert word “سلام” manually (just rewrite it in reverse direction).
I find that “لا” is a lam-alef ligature, and I replace two symbols on one https://www.compart.com/en/unicode/U+FEFC
Codepoint U+FEFC missing in encoding ISO-8859-6, so I added this code point in the encoding to an empty space (F3) and generated a new encoding vector.
Since the iconv can no longer recode text from utf8 to our new encoding, I used direct access to characters, as the @CraneBoba demonstrated above. String “سلام” became “\345\363\323”.
The font of the string is different here, I took the NotoSansArabic-Regular for the test, but it does’nt matter.

As I undestand from there https://stackoverflow.com/questions/33718144/do-arabic-characters-have-different-unicode-code-points-based-on-position-in-str, the Arabic letter has a different drawning (and unicode code point!) depending on where in the word it appears. Thus, we have a difficult but solvable problem of translating a sequence of Arabic letters into a sequence of unicode code points. This is something that only you can do. I can help with creating an encoding vector for this sequence and embedding it in the EPS file. If you are interested in this, let’s try it. I think that the instructions for using Arabic characters in the GMT will be useful to many because now there is no adequate way to do this.

saeedsltm · January 16, 2021, 12:55pm

That is a great news, and I AM IN As you rightly pointed out, many users need to use Persian or Arabic writing. let start it I don’t know if this is helpful, but there is possibility to write Persian in Latex, but i don’t know how it works. BTW, please tell me what should i do next, and i will do it asap.

uleysky · January 16, 2021, 1:25pm

As I said, the first step is to transform the Arabic text into a sequence of unicode code points. In other words, an algorithm is needed by which an Arabic text typed from the keyboard turns into an image on the screen. Formal description can be found here http://www.unicode.org/versions/Unicode8.0.0/ch09.pdf
I think that there are ready-made solutions for this task, it will be easier for you to find them than for me. Maybe this https://mpcabd.xyz/python-arabic-text-reshaper/ is what we need.

saeedsltm · January 16, 2021, 1:56pm

Yes using arabic_reshaper we now have access to all unicode points of Persian alphabet based on their position [isolated, initial, medial,final]. The next is using this unicode to make an encoding vector, right?
We could also use two simple function to get RTL and corrected form of any Persian word.

uleysky · January 16, 2021, 3:05pm

Seems to work ex31.rs.pdf (43.8 KB)
Please check.

Now another problem arose. I counted the number of code points used by the arabic_reshaper, and it turned out to be 372, which is significantly more than the maximum length of the encoding vector (256, but in fact even less, since it is necessary to leave room for numbers and punctuation). I suspect that in practice not all code points are needed, and only those that are actually used should be left. I will try to prepare a rendered list of these code points in the near future so that you can select the practically used ones.

saeedsltm · January 16, 2021, 3:20pm

Perfect , Yes you are right, we may not necessarily to use all of them. I’v checked https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/letters.py and count only LETTERS_ARABIC. it includes 77 character, so adding numbers and some special characters, i think the total number would be less than 256 (hopefully).

uleysky · January 16, 2021, 3:34pm

Also check ligatures.py, most of the code points are there. If you can figure it out and make a list of the code points you need, it will be good. If not, then, as I said, I’ll make a graphical list later for convenience.

And let’s make some more complicated examples for the proof of concept. Write a few test lines, and I will try to display them in the program. But tomorrow, I have a time zone further east than yours )