Determine language of body of text?

M

Mark B

Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.
 
A

Arto Viitanen

Mark said:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.

I thought of following: take dictionary of the common words of the
language you interested in. Then for each word of the text, calculate
times the word occures. But, this needs several versions of the word;
for example "word" and "words". On some languages this is not possible,
since there can be so many variations of a single word.

But, check article "Language Trees and Zipping" by Dario Benedetto,
Emanuele Caglioti and Vittorio Loreto, downloadable from
http://xxx.uni-augsburg.de/format/cond-mat/0108530 . It seems there is
also perl implementation of the algorithm :
code.activestate.com/recipes/355807 . If I understood it right, zip
archiver is based on the idea that it tries to learn the sequence and
the more it learns (i.e. the bigger the text), the better it compresses.
When you teach zip with English text and then give it two texts A and B;
if A is english it is compressed better than B which is italian.
 
M

MC

Each language has a few words that are extremely common, such as "the", "a", "an", "of", "for" in English. You could look for the 5 or 10 most common words in each language, and see which language wins.

To find the most common words, use a word-frequency-table program to analyze some text samples.
 
R

rossum

Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.
As well as the other suggestions have a look at the characters used:
umlauts or the ss/beta character appear in German and acute or grave
accents in French.

Digram and trigram frequencies can also be a good indicator.

Essentially you are going to have to use some sort of statistical
method.

rossum
 
D

Dmitry Streblechenko

Look at the MailItem.InternetCodepage property (corresponds to the
PR_INTERNET_CPID MAPI propety).

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
 
M

Mark B

Do all incoming emails have this? Even if they are not originating from an
Outlook client?

(It's for an Outlook 2007 Add-in C# BTW).
 
D

Dmitry Streblechenko

Most of them. But that really tells you more about the defaut code page of
the sender.

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
 
A

Arne Vajhøj

Mark said:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.

If you are willing to write some code, then you can detect
the language (but probably not the regional dialect).

* dictionary with common words
* special letters (forward and backward accents, umlauts etc.)
* distribution of letters
* distribution of pairs of letters

Arne
 
Top