Determine language of body of text?

Mark B · Oct 23, 2008

Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.

Arto Viitanen · Oct 23, 2008

Mark said:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.

I thought of following: take dictionary of the common words of the
language you interested in. Then for each word of the text, calculate
times the word occures. But, this needs several versions of the word;
for example "word" and "words". On some languages this is not possible,
since there can be so many variations of a single word.

But, check article "Language Trees and Zipping" by Dario Benedetto,
Emanuele Caglioti and Vittorio Loreto, downloadable from
http://xxx.uni-augsburg.de/format/cond-mat/0108530 . It seems there is
also perl implementation of the algorithm :
code.activestate.com/recipes/355807 . If I understood it right, zip
archiver is based on the idea that it tries to learn the sequence and
the more it learns (i.e. the bigger the text), the better it compresses.
When you teach zip with English text and then give it two texts A and B;
if A is english it is compressed better than B which is italian.

Mark B · Oct 23, 2008

http://code.google.com/apis/ajaxlanguage/documentation/#Detect

MC · Oct 23, 2008

Each language has a few words that are extremely common, such as "the", "a", "an", "of", "for" in English. You could look for the 5 or 10 most common words in each language, and see which language wins.

To find the most common words, use a word-frequency-table program to analyze some text samples.

rossum · Oct 23, 2008

Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.

As well as the other suggestions have a look at the characters used:
umlauts or the ss/beta character appear in German and acute or grave
accents in French.

Digram and trigram frequencies can also be a good indicator.

Essentially you are going to have to use some sort of statistical
method.

rossum

JP · Oct 23, 2008

There's some code here that writes the internet headers of an email to
a text file, which you could then parse for the language. For example,
some email headers include a "Content-Type" line which indicates the
character set used.

i.e.: Content-Type: text/plain; charset=US-ASCII

http://blogs.technet.com/kclemson/a...internet-headers-of-a-message-in-outlook.aspx

HTH,
JP

Dmitry Streblechenko · Oct 23, 2008

Look at the MailItem.InternetCodepage property (corresponds to the
PR_INTERNET_CPID MAPI propety).

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-

Mark B · Oct 23, 2008

Do all incoming emails have this? Even if they are not originating from an
Outlook client?

(It's for an Outlook 2007 Add-in C# BTW).

Dmitry Streblechenko · Oct 23, 2008

Most of them. But that really tells you more about the defaut code page of
the sender.

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-

Arne Vajhøj · Oct 26, 2008

Mark said:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.

If you are willing to write some code, then you can detect
the language (but probably not the regional dialect).

* dictionary with common words
* special letters (forward and backward accents, umlauts etc.)
* distribution of letters
* distribution of pairs of letters

Arne

Determine language of body of text?

Mark B

Arto Viitanen

Mark B

MC

rossum

JP

Dmitry Streblechenko

Mark B

Dmitry Streblechenko

Arne Vajhøj