Encoding

H

Hans L

I am trying to learn about encoding, and have read a lot about it. However,
what I can't find much on is some very practical issues that would explain a
lot (at least to me).

I figured that the programmers' forum would be the place to ask this simple
question. It may seem naive, but I do not think it is.

I have Word 2000 (Office 2000). When I open a new document, what is it that
determines the encoding of that document?

And, a follow-up. I can open rtf files with a text editor and see the
encoding (or code page, a term I understand some people do not like). But
how do I see the same for a .doc file?

Believe me, this means a lot to me. Thanks for your serious answer.

Hans L
 
K

Klaus Linke

Hi Hans,

Word docs use Unicode, which can encode about every character from every
language (plus lots of symbols...).

Encoding issues used to cause trouble back ten years ago (Word 6...), when
each character was encoded by one byte.
Since one byte allows 256 different characters, you had to work with
different code pages for different languages.
If you used say Greek or Cyrillic characters, the programs (Word back in the
old times) needed to switch back and forth between different code pages.

The only cases where you run into encoding issues today are when you save in
another format (HTML, text files...).

The RTF that Word2000 saves also uses code pages, to ensure that the RTF
files can be read by older programs.
But usually you don't need to worry about that, unless you want to edit the
RTF by hand.

Regards,
Klaus
 
H

Hans L

Hello Klaus, and thank you for your response.

Okay, so MS Word is accunted for :)

Now, I have read quite a lot about Unicode, encoding, languages, fonts, code
pages, etc. and still, I, and a lot of translators that I know (my
professional peers) are still stumped when we export a file (usually rtf or
doc) from our translation program and we get squares instead of, for
instance, the Swedish letters åäöÅÄÖ, and when someone changes the font to
something else, they see, instead of squares, Asian-looking letters.

And in spite of having read all these things about Unicode, code pages, etc,
we ask ourselves the only questions that is relevant in that situation:
what do I do to get my Swedish letters back?

I had a case the other day with an rtf file, and by experimenting (talk
about fumbling in the dark), and opening the file with a text editor, I
figured out that "deff0" had been changed to "deff30". By changing it back
to "deff0", I had my Swedish letters back. (Now, I do not know if this is the
only thing that can screw up characters in rtf.)

However, if the same happened in a doc file, I still do not know what to do,
and that is why I have started to ask what I call "practical" questions about
encoding, for instance

- what is it that determines the encoding of a file when I create it with an
application (not only MS Word)?

- can I change the encoding of any file or is it impossible for some files
(I know that I can now do it with rtf files)?

- when I get a file that is out of whack (so to speak), what do I do?

I know, this is a mouthful, but for the million of us rather sophisitcated
computer users, these are the questions that counts, and I personally think
that these are the questions one should begin with when learning about
Univoce and code pages and such.

Best regards,

Hans L
 
J

Jonathan West

However, if the same happened in a doc file, I still do not know what to
do,
and that is why I have started to ask what I call "practical" questions
about
encoding, for instance

- what is it that determines the encoding of a file when I create it with
an
application (not only MS Word)?

That is a matter determined by each individual application, and no useful
generalisation can be offered here.
- can I change the encoding of any file or is it impossible for some
files
(I know that I can now do it with rtf files)?

Again, that depends on the individual application.

- when I get a file that is out of whack (so to speak), what do I do?

If it is a doc file, make sure that you have MS Arial Unicode font
installed, and format the entire document with that font. Although Word
couments are unicode, not all fonts offer characters for all the defined
unicode codes, and so the characters that don't have glyphs defined for them
will be displayed as squares. MS Arial Unicode does have most of the defined
glyphs included.

For other applications and file formats, you need to check the way the
individual appliction works.
 
H

Hans L

:

If it is a doc file, make sure that you have MS Arial Unicode font
installed, and format the entire document with that font. Although Word
couments are unicode, not all fonts offer characters for all the defined
unicode codes, and so the characters that don't have glyphs defined for them
will be displayed as squares. MS Arial Unicode does have most of the defined
glyphs included.

For other applications and file formats, you need to check the way the
individual appliction works.

Thank you, Jonathan!

Hans L
 
K

Klaus Linke

Hi Hans,

It sounds as if your translation software writes buggy docs/rtfs. Are there
other formats you could try?
Or maybe there are options settings in the software you can play with.

The swedish letters should be no problem at all, since they are in the old
Western Windows code page 1252.
The \deff you are seeing problems with defines the default font when the
font hasn't been specified.
The reader software (in your case Word) is then supposed to find a font that
has the necessary characters (say \deff161 for Greek: Word looks for a font
that has Greek characters).

I haven't found anything about the character set 30...
130 is "Korean Johab", which might explain the Asian characters you see.
0 should be ANSI (Windows code page 1252).

Maybe there's a bug fix available for your software? Or maybe you can get it
fixed by whoever wrote it?

Regards,
Klaus
 
K

Klaus Linke

- when I get a file that is out of whack (so to speak), what do I do?

There's a tool from Microsoft to repair text that has been imported using
the wrong code page, but it's mostly for eastern european code pages
(eefonts.dot).
It adds "Fix broken text" to the Tools menu. I think in some versions
(2002/2003?), it came as an optional install which you can choose in the
Office setup (add/remove features).

I've just installed it, and it doesn't list Korean or other Asian languages.

Klaus
 
K

Klaus Linke

Sorry for the slow dribble of info, but another thing just occurred to me:
If it's an issue of font substitution, maybe it could help to go into "Tools
Options > Compatibility > Font substitution" and choose a sensible font?
Ot to install the missing font?

I've looked at the RTF specs, and the number following the \deff seems to
specify the font in the font list (...look for {\f30 in the RTF file), not
the character set as I thought.

Regards,
Klaus
 
H

Hans L

Nah, that is okay.

Did you find any explanation anywhere for th e0, 1, ...30 and so on for
\deff? I have not found one yet.

As for doc files, someone reminded me to open the script editor, whereupon I
can see that Word documents that I create have the code page Windows-1252.
Which brings me back to my original question: what is it that gives my doc
documents code page 1252?

Hans L
 
K

Klaus Linke

Did you find any explanation anywhere for th e0, 1, ...30 and so on for
\deff? I have not found one yet.

As I said, it's the index of the font in the font list ... look for {\f30 in
the RTF file.
If that is some weird font, or one you don't have installed, maybe you can
avoid it in the software that creates the docs/rtfs.

Is the translation software that produces these docs/rtfs something from a
big company? If they created their own fonts, maybe the Panose definition
(which helps Word find a good substitute font) is buggy. Is the font \f30
installed on your machine?
As for doc files, someone reminded me to open the script editor, whereupon
I can see that Word documents that I create have the code page
Windows-1252.
Which brings me back to my original question: what is it that gives my
doc
documents code page 1252?

The script editor saves the document in HTML, and then opens that.
The code page for HTML can be set in the Save dialog under Tools > Web
options > Encoding.
But I'm pretty sure it has nothing to do with whatever you see in the RTF.

It pretty definitely isn't Word's problem, and I'd try to fix the issue at
the source.


Regards,
Klaus
 
H

Hans L

Okay, I will follow you advice, Klaus. (Sorry for not responding quicker,
but my e-mail notification works for a while, then clonks out. Things are
falling apart here :)

Hans L
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top