[R] International phonetic symbols in R.

Wed Dec 4 01:30:37 CET 2013

Well, thanks for sending me the files but I'm sorry to be rather
pessimistic as for now...

that's exactly what I was suspecting after a first look at the data
in your first email... The short answer is: an obsolete IPA
transcription system is used in the files so the student should rework
the data.

The long answer follows...

To summarize, there are (were, only 2 are still conventional) 3 ways of
displaying phonetic transcriptions (well there are several other
systems but no need to complicate things and these conceptually fall
into either of these 3 categories):

- the old one consists in using specific fonts that would display
specific characters from the same range (256 positions in the font
table) as regular fonts. So one had to change font in order to display
phonetics AND if one didn't own the specific font used by the
original author, one could never be sure that a replacement font would
do the job as the same characters may sometimes correspond to different
positions in the encoding when using different fonts...

- the current one (since at least 10 years I would say) consists in
using unicode fonts, and to take advantage of the IPA range for which
several fonts provide glyphs (among which Sil Doulos and DejaVu which
respectively provide serif and sanserif IPA fonts along with the
"standard" (=lots of) characters.

- an alternate solution (especially good for computer manipulation)
stands in SAMPA and X-SAMPA (http://www.phon.ucl.ac.uk/home/sampa/), two
related solutions using only characters in the ascii range and,
provided one knows the conventions for coding, will let anyone
transcribe phonetics even with a typewriter! This is often a good
choice for analysing data by computer as one does not need to know the
Unicode hexadecimal number to type when manipulating the characters.
But it is sometimes desirable to have both SAMPA and Unicode coding in
the same file (automatic generations from one to the other are rather
easy) as SAMPA is easier to use when manipulating character strings on
the keyboard but IPA unicode glyphs are easier to interpret for most
linguists when reading / looking at the data.

Depending on what you plan to do with the phonetic transcripts in the
analysis process, there may be arguments in favor of either
SAMPA/X-SAMPA or IPA or both.

So... Apart from the fact that the tabulated data will be a real pain to
organize due to what seems to be incoherent data coding with
statistical analysis in mind (but that was not part of the question), I
see that the font which is used to display phonetic characters is:
"Ipa-samd Uclphon1 SILDoulosL" (no technical relationship at all with
the Sil Doulos mentionned above). Here, libreoffice does not display
anything else than "squares". Though obviously I haven't got this font
on my computer, I can read the expected font name, so I had a quick
look on the net and found this page:

http://www.phon.ucl.ac.uk/shop/fonts.php (where it obsiously  from
as this was, years ago, a font that was disseminated by the
speech community at UCL, as its name may imply).

which states, with clear warnings that:

"Please note: These fonts are now "legacy fonts": obsolete,
symbol-encoded fonts. Their use in new documents is discouraged. If you
decide to download and use these, please note there is no user
support for them. If your university or organization requires the use
of these fonts, please request they change their requirement to one of
the Unicode-encoded font which contains the complete IPA repertoire.
Many such fonts are now available, and several are supplied with all
new computers. Others are available from SIL."

Unfortunately, this clearly corresponds to the first case
mentionned above: usage of an obsolete IPA transcription system
requiring a specific font, but most of all, making data transfer
particularly difficult if not impossible due to discrepancies between
positions in the font encoding and "standard" glyph (or shape)
representations.

I'm certain that this message has been on UCL web site for several
years now! Though one may discuss the opportunity of keeping such fonts
available for download, one cannot say it's not clear from their
web page that it should not be used.

So, first step... tell the student to use "state-of-the-art" font
coding for phonetic transcriptions (which is either IPA with unicode
encoding, either SAMPA) which means that he/she must rework all
the transcriptions in his/her files.

Perhaps, while doing that, tell him/her to think about a better
solution for storing data than these tables where 90% of the cells
are empty...

Sorry to be of no help here but I really see no point at trying to
solve issues when obsolete solutions are the main reason of these
issues...

Of course someone on the list may be more optimistic than I am.

Anyway, once the student has come back with either SAMPA or unicode
encoding, I would happily provide advice to working with IPA
characters within R.

Yours sincerely.
Olivier.

-- 
  Olivier Crouzet, PhD
  Laboratoire de Linguistique -- EA3827
  Université de Nantes
  Chemin de la Censive du Tertre - BP 81227
  44312 Nantes cedex 3
  France

     phone:        (+33) 02 40 14 14 05 (lab.)
                   (+33) 02 40 14 14 36 (office)
     fax:          (+33) 02 40 14 13 27
     e-mail:       olivier.crouzet at univ-nantes.fr

  http://www.lling.univ-nantes.fr/