How to list Hebrew words by order of frequency in a document

G

Gadi Spero

Hi all,

I'm looking for a way to list all the words in a document by order of
their frequency of appearance, so the most frequently used word gets
listed at the top, and the rest in descending order below.

The macro at http://tinyurl.com/2upbp4e does the job nicely (after
correcting some carriage returns that I assume got added when pasting
the code), however it only works with English words - words in Hebrew
are not counted.

Is there any easy way to set this up so it works for Hebrew words as
well? I don't know VBA, so I can't implement a solution on my own at
this point, but was hoping perhaps it would be simple enough for
someone to point me in the right direction.

MTIA,
Gadi
 
J

Jay Freedman

I'm not completely certain of this, but I think you need to change only one line of the code.

This line appears to be the one that prevents Hebrew words (or any "words" that sort outside the Unicode range from "A" to "z") from being counted:

If SingleWord < "A" Or SingleWord > "z" Then SingleWord = ""

To change it, I need to make some assumptions. Looking at the Arial Unicode font's Hebrew range, the letters run from alef (character number 05D0) to tav (character number 05EA). However, there are
many vowel marks between character numbers 0591 and 05C4, plus a few characters from 05F0 to 05F4, and I'm not sure how these affect the way words are sorted. You can try replacing the line above with
one or the other of these variations to see whether it makes any difference:

If SingleWord < ChrW(&H5D0) Or SingleWord > ChrW(&H5EA) Then SingleWord = ""
or
If SingleWord < ChrW(&H591) Or SingleWord > ChrW(&H5F4) Then SingleWord = ""

The first one retains only words that sort from alef to tav, while the second one retains words that sort anywhere within the larger range.

Also, these character ranges show only upper case letters in Arial Unicode and Lucida Sans Unicode. If you're using a font with lower case letters or a non-Unicode font, you'll have to determine the
corresponding numeric values (the '&H' means 'hexadecimal', or you can omit the '&H' and use decimal values).
 
G

Gadi Spero

Nice - works perfectly. I ended up using the first version, because I
don't need to work with the vowel marks, etc.
Thanks a lot!
Gadi
 
J

Jay Freedman

Excellent! Thanks for lett me know what worked.
Nice - works perfectly. I ended up using the first version, because I
don't need to work with the vowel marks, etc.
Thanks a lot!
Gadi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top