Word 2007 byte value representation of Unicode characters

W

wjm

I've inserted a hex 0419 character in my Word 2007 document using Insert
Symbol with the Arial Unicode MS font.

When I append ".zip" to the document, extract it into a separate folder, and
open the "document.xml" file, I see the following coding:

<w:r>
<w:rPr>
<w:rFonts w:ascii="Arial Unicode MS" w:eastAsia="Arial Unicode MS"
w:hAnsi="Arial Unicode MS" w:cs="Arial Unicode MS" w:hint="eastAsia"/>
</w:rPr>
<w:t>Й</w:t>
</w:r>

When I put my cursor on the Unicode character and view the hex binary
representation, it appears as "D0 99".

What is the logic used to translate "0419" as "D0 99"?
 
P

Pesach Shelnitz

Hi,

You didn't say what software you used to reveal the Hex codes of the
characters, so I couldn't reproduce exactly what you observed. Instead, I
copied and pasted the coding in your posting into a blank Word document,
placed the cursor after the Cyrillic uppercase ee kratkoye (Й), and pressed
Alt+X to reveal the hexidecimal value of this character. The result is 0419.
Thus, the character that you inserted (Unicode Hex 0419) is the same
character that appears in the coding in your posting, and no translation has
been performed. Am I missing something in your question?

Thanks,
Pesach Shelnitz
 
T

Tony Jollans

I can't reproduce this either. Where are you looking at the (D099) hex code
and how are you exposing it?
 
T

Tony Jollans

OK. I understand now. What you are seeing is the actual encoded data as
stored, and it is stored in UTF-8 format (as declared at the beginning of
the file), and U+0419 is 0xD099, when converted to UTF-8.
 
W

wjm

Great. Thanks.



Tony Jollans said:
OK. I understand now. What you are seeing is the actual encoded data as
stored, and it is stored in UTF-8 format (as declared at the beginning of
the file), and U+0419 is 0xD099, when converted to UTF-8.

--
Enjoy,
Tony

www.WordArticles.com
 
Top