XML文件中nbsp的产生和问题解决_oyaji21的专栏_蚂蚁淘，【正品极速】生物医学科研用品轻松购|ebiomall -蚂蚁淘商城

商品列表技术文章相关问答

当前位置： > 首页 > 技术文章 >

XML文件中nbsp的产生和问题解决_oyaji21的专栏

来自 : 蚂蚁淘

nbsp usage, the definitive, full answer (and you thought it was 42?)

Jeni Tennison.

Could somebody explain to my WHY amp; translates to
but amp;nbsp; doesn t change at all?

Let s consider this simple stylesheet:

   ?xml version 1.0 encoding ISO-8859-1 ?
   xsl:stylesheet version 1.0
                  xmlns:xsl http://www.w3.org/1999/XSL/Transform

   xsl:template match /
     html
       head title Test /title /head
       body
         p Non-breaking amp;nbsp;space /p
         p Non-breaking #160;space /p
       /body
     /html
   /xsl:template

   /xsl:stylesheet

This stylesheet is stored on the hard disk as a series of bytes. The bytes match characters according to the ISO-8859-1 encoding (see the encoding pseudo-attribute on the XML declaration?).

When the XML parser reads in this as an XML document, it decodes the bytes into Unicode characters. It also parses the document, recognising things like start tags (e.g. p ), built-in entity references (e.g. amp;) and character references (e.g. #160;).

The parser knows that amp; stands for an character (because it knows XML) and knows that #160; stands for a non-breaking space character (because it knows XML and Unicode).

The parser reports to the XSLT processor when elements occur and what characters text is made up of, but doesn t report whether a particular character was originally serialized as the plain character (an actual space character), an entity reference or a character reference.

As far as an XSLT processor is concerned, therefore, the following elements in the stylesheet (or in an XML source document) would all be reported as *exactly* the same (a p element containing a text node whose string value is a double-quote character):

     p /p
     p #34; /p
     p #x22; /p
     p quot; /p
     p ![CDATA[ ]] /p

The two p elements serialized in the stylesheet, look like:

p Non-breaking amp;nbsp;space /p
p Non-breaking #160;space /p

For the first p element, the XML parser reports the string (here containing no escaping of any kind - every character is a literal character):

Non-breaking nbsp;space

For the second p element, the XML parser reports the string (here containing an underscore character as a stand-in for a non-breaking space, since you can t see non-breaking spaces in emails):

Non-breaking_space

The XSLT processor builds a result tree from the stylesheet, which contains these text nodes and looks something like:

    /
     - html
        - head
       |   - title
       |      - text: Test
        - body
           - p
          |   - text: Non-breaking nbsp;space
           - p
              - text: Non-breaking_space

This tree exists in memory. All the characters are Unicode characters.

Once the XSLT processor has finished its transformation, it serializes this result tree. There are three methods that it could use to serialize the result tree: xml, html and text, which is controlled by the method attribute of xsl:output. It could also use any encoding - any mapping of characters to bytes - which is controlled by the encoding attribute of xsl:output.

The most straight-forward output method is the XML output method. In the XML output method, element nodes are serialized as a start tag, followed by content, followed by an end tag. Any characters in the element content that have to be escaped due to XML rules are escaped. So if you have a less-than sign in your text node, then it is automatically escaped to . If you have an ampersand in your text node then it is automatically escaped to amp;. If you have a character that can t be represented by the encoding that you re using, then it is escaped using character references (e.g. #160;).

Let s use a really really basic encoding, ASCII, which only covers 128 characters (and doesn t include non-breaking spaces). You can usually make your stylesheet generate ASCII with:

xsl:output encoding ASCII /

The non-breaking space character isn t covered by ASCII, so the non-breaking space character has to be escaped in the serialization using a character reference. So the serialization of the output tree will look like:

   html
     head title Title /title /head
     body
       p Non-breaking amp;nbsp;space /p
       p Non-breaking #160;space /p
     /body
   /html

If you used an encoding that covers the non-breaking space character, such as ISO-8859-1 or UTF-8 or UTF-16, then the non-breaking space character would be output as a literal non-breaking space character, and you d get (substituting _ for non-breaking space characters again):

   html
     head title Title /title /head
     body
       p Non-breaking amp;nbsp;space /p
       p Non-breaking_space /p
     /body
   /html

Trouble arises, however, when you try to view a document that s been saved using UTF-16 in an editor that doesn t support UTF-16 . The editor always tries to interpret the sequence of bytes that it reads from the file as ISO-8859-1 characters. It s a bit like taking an English document and trying to read it as if it were written in German. Some of the words might make sense, but most of the time you get gobbledy-gook.

Specifically, because UTF-16 uses two bytes for every character whereas ISO-8859-1 uses one, when you try to read a UTF-16 document as if it were ISO-8859-1, you see two characters for every one character that you expect. The first byte in a UTF-16 character is usually the same as the byte that is used in ISO-8859-1 to mean the Ă character, while the second byte is the one that actually contains the information. So you tend to see Ă_ rather than just _, for example.

Let s return to looking at the possible serializations of the result tree. The next possible serialization is HTML. HTML is serialized more-or-less the same as XML, with a few differences. The difference that is pertinent here is that when you use the html output method, XSLT processors are allowed to use the entities defined in HTML rather than as a native character (if the character can be represented in the encoding) or a character reference (if it can t). In our case, XSLT processors are allowed to serialize the non-breaking space character as the HTML character entity reference nbsp;. So serializing as HTML, you may get:

   html
     head title Title /title /head
     body
       p Non-breaking amp;nbsp;space /p
       p Non-breaking nbsp;space /p
     /body
   /html

Finally, let s consider the text output method. In the text output method, everything aside from text nodes are ignored, and the text is output without any automatic escaping. If a character can be represented in the encoding that you use, then it will be serialized as a native character. If it can t be, then the XSLT processor gives you an error. In our case, assuming that we re using an encoding that supports the non-breaking space characters, we d get something like (again with _ representing the non-breaking space):

Non-breaking nbsp;spaceNon-breaking_space

   And, how would you suggest someone actually get nbsp; into the
   output in order to avoid the issue which started this thread in the
   first place? (browsers assuming a different encoding type than is
   sent, and therefore mistranslating character 160 as Ă instead of
   ? I have yet to see a browser which misunderstands nbsp; .

Hopefully, what I ve explained above makes it clear that a browser that sees a non-breaking space character as an Ă followed by a non-breaking space character is making that error because it is reading the result of the transformation as if it is in one encoding (e.g. ISO-8859-1) when in fact it is in another encoding (e.g. UTF-16).

There are several solutions:

- change the browser so that it auto-detects the actual encoding that s being used in the HTML/XML document (and make sure that you re reporting the correct encoding in the HTTP headers)- change the serialization process so that you use an encoding that the browser is expecting, by adding encoding ISO-8859-1 to the xsl:output element- change the serialization process so that you use an encoding that doesn t include the non-breaking space character, so that the processor uses a character reference for it, for example using ASCII as the encoding- use the HTML output method with an XSLT processor that serializes non-breaking spaces as nbsp;

Cheers, Jeni

P.S. There is another solution that will work with some processors, but not all - disabling output escaping for the text node that contains the relevant characters. But since you can solve the problem a lot more elegantly with one of the methods above, there s no reason to use it. Jeni Tennison

nbsp doesn t work

Mike Brown

How can I make insert a tab and/or space
characters into my html output from the xsl?
nbsp;, etc aren t legal in the xsl document....

This is the all-time #1 FAQ.

Regardless, just pick one:

1. #160;

2. #xA0;

3. nbsp; after putting !DOCTYPE xsl:stylesheet [ !ENTITY nbsp #160; ] at the top of your stylesheet, after the XML declaration but before anything else (or reference any DTD containing that entity declaration);

4. type the character directly, if your keyboard and/or OS provide a way for you to do so, and your editor can be counted on to save the document in an encoding that supports that character, and you ve made the encoding declaration match your editor s output.

nbsp in output

Mike Brown

 I m generating HTML from XML 

 The output HTML needs to contain some nbsp; . But until now I could not

 find a way to implement that.

 nbsp; is, by definition #160;

Just put #160; (or #xA0;, the hex equivalent) in your stylesheet to represent the non-breaking space character in the stylesheet tree and result tree. when the result tree is output, the character will be output as either #160; or nbsp; assuming you have xsl:output method html / in the stylesheet.

Wendell Piez outlines a use in tables with empty cells.

Outputting spaces in html table cells

Use #160; for a non-breaking space. Your XML parser does not pick up the named entity nbsp; because it hasn t been declared. But a numbered character reference (which is what #160; is) will be recognized -- #160 is a non-breaking space.

You can even declare nbsp in an internal subset of your stylesheet if you want a friendlier representation of the character.

 There is some code before this that generates a table. 

if the value of blah is blank, and I was outputing this to html, then 

 netscape would

 not handle blank td/ fields in an elegant manner because it would shift 

 the next column over one to replace the blank column. Normally, I would insert an nbsp 

 between each td tag so that netscape would render a space and not ignore the cell, but as 

 you know, is reserved in xml. I tried amp;, but that doesn t render a space but rather 

 the real symbol. So my question is what is the best way to solve this problem?

Another explanation

Trevor Nash

In an attempt to reduce the number of how do I get nbsp; questions, I have tried to update Dave Pawson s FAQ on the subject: text follows. I also sent a message to the list owners to see if we can get the search mechanism tweaked to make it easier to find nbsp;

I actually found it quite hard to locate definitive answers on the subject which cover all the angles, partly because it has been discussed so many times, and partly becuase some need to be edited for language ;-)

I have paraphrased my recollections of what has been said about dealing with badly configured / old browsers. I would welcome pointers to actual messages off the list which I could quote instead, and any improvements on the ones I have chosen.

How to output nbsp in HTML

[ existing text from the nbsp topic ]
Mike Brown:

I m generating HTML from XML
The output HTML needs to contain some nbsp; . But until now I could not
find a way to implement that.

nbsp; is by definition #160; Just put #160; (or #xA0;) in your stylesheet to represent the non-breaking space character in the stylesheet tree and result tree. when the result tree is output, the character will be output as either #160; or nbsp; assuming you have xsl:output method html / in the stylesheet.

I thought the nbsp; entity was predefined in xml.

It is not predefined. Only amp; quot; apos; are predefined. You can either use #160; or #xA0;, or you can define an entity like nbsp for the same.

Try:

 ?xml version 1.0 encoding utf-8 ? 

 !DOCTYPE xsl:stylesheet [ !ENTITY nbsp #160; ] 

 xsl:stylesheet xmlns http://www.w3.org/1999/XSL/Transform 

version 1.0

Apparently one motivation for trying to get nbsp; into the output is to cope with browsers that either cannot handle the encoding being used or have been set up incorrectly (the advice is to set to auto detect if this option is available).

Mike Brown:
(http://www.biglist.com/lists/xsl-list/archives/200001/msg00255.html)

Another part of my problem was that a literal character #160 was
mysteriously coming through not as a non-breaking space, but as a Â
character, which is ANSI #194.

#160; in an XML document always refers to UCS character code U 00A0. This character must be encoded upon output in a document. If your document is encoded as ISO-8859-1 or US-ASCII, the character will manifest as the single byte A0 (in hex, or 160 in decimal). If your document is encoded with UTF-8, it will be the pair of bytes C2 C0.

If you are looking at the UTF-8 encoded document in an editor or shell/terminal window that doesn t know to interpret hex C2 C0 as a UTF-8 sequence, then you ll probably see Â (the character in many character sets/fonts at position hex C2, aka decimal 192) followed by an invisible character (C0, which if interpreted as an ISO-8859-x character happens to be invalid in HTML).

If you don t like the encoding your XSLT processor gives you normally, you can use the encoding attribute on the xsl:output element to specify a particular encoding (provided your processor knows how to deal with it).

Ref: http://www.w3.org/TR/xslt#output

If you are having to deal with old browsers and/or misconfigured clients which you do not have the power to change, then you might be left with no choice other than getting nbsp; into the output. There is no nice way to do this (as I hope we have already established, the standards are constructed such that it should not be necessary). But if it has to be done, here are the choices, and their caveats:

Choose a processor such as Saxon which gives you additional control over the serialisation: Saxon for example. Caveat: ties you to one processor.

Use xsl:text disable-output-escaping yes amp;nbsp; /xsl:text , possibly with the DTD subset trick described above to keep the stylesheet readable. Caveat: disable-output-escaping doesn t have to be honoured by the processor. Even if it seems to work, it can be fragile because it may be ignored if you later decide to send the ouput via a DOM, or you use variables and node-set() to store part of your output. See also DOE

Use an element or processing instruction to represent the non-breaking space, and substitute it with a custom serialiser. Caveat: hard work, and ties you to a specific processor or class of processors.

Wendell Piez outlines a use in tables with empty cells.

Outputting spaces in html table cells

Some references: On the finer points of encodings and character references: List archive Mike Brown on browser character encodings List archive

nbsp, why doesn t it work

Ragulf Pickaxe, David Carlisle

By googling I found a suggestion to use #160; instead.

Is there a reason why nbsp; is not working?

nbsp; is an HTML entity. XML only knows three entities: amp;

Therefore all other characters that you need must be with their char code, as you have found with #160; .

because XSLT files have to be well formed XML and in XML (and HTML) entities must be defined before use. Most HTML browsers implictly use a catalogue that (implictly) defines the entities in the HTML DTD including nbsp but in general it s just an undefined reference, unless you define it.

As many of you may have noticed, DOM parser gives errors if the nbsp; entity is present. The E_WARN message looks like:

Warning: DOMDocument::load() [function.load]: Entity nbsp not defined in ...

There re many ways to solve this:
a) The hard way
xsl:text disable-output-escaping yes amp;nbsp; /xsl:text

b) Defining nbsp;
At the top of the document, after the ?xml? definition, add:
   !DOCTYPE xsl:stylesheet [
    !ENTITY nbsp #160;
    ]

c) External Doctype
Just in case you want need other HTML entities, you can call an external doctype with the proper definitions

!DOCTYPE page SYSTEM http://gv.ca/dtd/character-entities.dtd

Of course, you can download the file and place it in your server.