Visually Breaking Notepad
posted on Wednesday, July 05, 2006 11:12 AM
by
ericc
Here's an interesting post I stumbled upon made by Michael Kaplan:
http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx
After seeing his post, I decided to try it out. Here's screenshots of Notepad before and after:
Before:

After:

Interesting. What I did was create
the file using Windows Explorer (Right Click, New, Text Document) and
then dragged/dropped the file onto Notepad, added the magic text, saved
the file, then closed Notepad. Next I reopened Notepad and then
dragged/dropped the file back into Notepad. The After screenshot above
is what Notepad displayed the second time.
With the Windows XP version of Notepad, Microsoft included an Encoding drop down box in the Save As dialog.

Wondering
if this was the problem, I recreated the file with the magic text, but
this time using Notepad's Encoding specifically set to ANSI. Nope, no good. The Open
dialog showed the file with Unicode Encoding and once again the file
opened up with the asian text.


As Michael Kaplan explained in his blog, this is just another case of a problem with the IsTextUnicode API. Because given an input string, we don't know what kind of text it is, we need to call such methods in order to
determine the string format. Since the standard char[]* string
doesn't have any space to include metadata (ie. the type of
encoding the characters use), methods such as
IsTextUnicode can only "intelligently" (see the documentation for the API) scan the contents
and determine whether the string is Unicode or not. As you can see with
these screenshots, the method isn't always right, causing the
characters to be interpreted as Unicode and thus displaying the asian
characters rather than the western alphabet characters.
Now, if you explicitly save the text as Unicode (ie. UTF-8), Notepad injects the BOM characters in the beginning of the file to indicate that the file is Unicode. In this case, there is no confusion over what type of encoding the file is using and thus opens up properly in Notepad. Here's what the files look like in hex:
ANSI file:

Unicode file:

Notice the 0xFFEE characters in front of the Unicode file to indicate the UTF-16 encoding. This also means that the file is stored in little-endian format.
Curious as to what other programs may be affected, here's some more screenshots of other programs.
A favorite amongst many is Notepad2:
Before:

After:

Oops, looks like Notepad2 also has this problem. Another editor I like to use is PSPad:
Before:

After:

Looks like PSPad escapes unscathed.
Luckily Visual Studio doesn't have this problem:
Before:

After:

Nor does Word Beta 2:
Before:

After:
