Collating sequence: feature or bug?

F

Fernando Cabral

My documents have accented characters. I found an inconsistency that
completely destroyed my performance. In my machine the letters
"a" (lowercase 'a') comes before "à" (lowercase 'a' with a grave accent)
when sorted. That´s how it should work. Fine.

Nevertheless, when I sort "a tarde" against "à tarde" the positions
are reversed! For those of you could are not familiar with accent
(or perhaps can´t seem them on your display), it is like having

"a"
"b"

but

"b c"
"a c"

That is, the collating sequence for "à" changes place depending on
the character that follows it!


Now for the practical problem.

I have two lists. One is comprised of single words. It is usually huge,
like hundreds os elements. The other may have either words or
sentences. This last one may be as small as a single word or as big
as several thounds words/sentences.

My mission is to find in the first list every occurrence of words and
sentences that are in the second list.

Exemple: if first list has a,b,c,d,e and second list contains b, d, then I´ll
have to find them (b and d).

Now, if the first list has 100000 elements and the second one has 10000,
this entails 100000 x 10000 = 1,000,000,000 comparisons. For strings
sometimes as long as 50 characters, this takes a long time to complete.

Now, the simplest way to improve this is "shortening" the second list
at each pass. Since both lists are sorted, I should be able to say: well,
next word from the first list begins with letter "d". This means from now
on I can skip all the elements in the second list that are less than "d".

Additionally I can stop comparing as soon as the word in the first
list is bigger than the last word in the second list.

So, instead of having to do 1,000,000,000 comparisons, I may be
able to make do with 100,000 or even less. More reallistically,
perhaps 500,000.

It works fine, as long as I don´t have words that begin either
with an "a" or with an "à". Alas! I can´t expect that to happen in any
real text.

For the time being I am stuck with the inefficient solution.

Question: is there a way to chance that behaviour, that is,
change the collating sequence from VBA, Word or perhaps
wearing the Windows XP administrator´s hat?

- fernando
 
J

Jezebel

I can't replicate the problem. On my machine, the sort sequence is
consistently in this order, whether (and whatever) the following text --
a
á
à
â
ä
ã
å
 
H

Helmut Weber

Hi Fernando,

what sorting algorithm are you using?

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
J

John Nurick

Hi Fernando,

It sounds as if you are sorting your list and then using a sequential
search. ISTM it would be much faster to put the first list into a
Dictionary object and then just look up the words in the second list.
Pseudoaircode:

Dim WordList As Object
Dim WordsFound As Object

Set WordList = CreateObject("Scripting.Dictionary")
Set WordsFound = CreateObject("Scripting.Dictionary")

'Build dictionary of words in first list
For Each word In first list
WordList.Add word
Next

'Compare words in second list with dictionary
For Each item In second list
For Each word in item
If WordList.Exists(word) Then
WordsFound.Add word, word
End If
Next word
Next item

'WordsFound now contains the words in the second list that
'exist in the first list

On Sat, 15 Jul 2006 18:42:01 -0700, Fernando Cabral

[snip]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top