entities: text or rawText?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

entities: text or rawText?

Lukas Theussl-4

Vincent,

I'm trying to understand some of the issues we have with entities in the
XmlParser. Is there a special reason why entities are emitted as rawText and not text?

I think they should be emitted as text:

First, custom entities can be used to simply define some replacement text inside
documents (eg <!ENTITY version "1.0">).

Second, the resulting events should be consumable by all sinks, not just x(ht)ml
based ones. Consider for instance the text "&amp;&AElig;" (where AElig is defined
as <!ENTITY AElig  "&#198;">). Currently it is emitted by the XhtmlBaseParser as
one text event "&" and one rawText event "&#198;". This means that eg the Latex
Sink will produce wrong output (the AElig should be converted to "\AE" in latex).

IMO the resolved entity should be emitted in a format-independent way, eg as one
(unicode?) character, just like &amp; is emitted as one character above. The
consuming sink then has to transform that into a format-specific representation.

WDYT?
-Lukas


Reply | Threaded
Open this post in threaded view
|

Re: entities: text or rawText?

Vincent Siveton-2
Hi Lukas,

2009/5/4 Lukas Theussl <[hidden email]>:
>
> Vincent,
>
> I'm trying to understand some of the issues we have with entities in the
> XmlParser. Is there a special reason why entities are emitted as rawText and
> not text?

The text used by XhtmlBaseParser#handleEntity() could contain
predefined entities [1] and numeric code entities (ie &AElig; will
become &#198; by XmlPullParser)
XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.

So using rawText() is to be sure to not escape text with entities.

> I think they should be emitted as text:
>
> First, custom entities can be used to simply define some replacement text
> inside documents (eg <!ENTITY version "1.0">).
>
> Second, the resulting events should be consumable by all sinks, not just
> x(ht)ml based ones. Consider for instance the text "&amp;&AElig;" (where
> AElig is defined as <!ENTITY AElig  "&#198;">). Currently it is emitted by
> the XhtmlBaseParser as one text event "&" and one rawText event "&#198;".
> This means that eg the Latex Sink will produce wrong output (the AElig
> should be converted to "\AE" in latex).
>
> IMO the resolved entity should be emitted in a format-independent way, eg as
> one (unicode?) character, just like &amp; is emitted as one character above.
> The consuming sink then has to transform that into a format-specific
> representation.

It could be another implementation.
XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()

Cheers,

Vincent

[1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent
Reply | Threaded
Open this post in threaded view
|

Re: entities: text or rawText?

Lukas Theussl-4

For reference: the XhtmlBaseParser in Doxia 1.1.1 emits entities as text, except
if they are not recognized (ie haven't been declared), then they are emitted as
unknown events.

-Lukas


Vincent Siveton wrote:

> Hi Lukas,
>
> 2009/5/4 Lukas Theussl <[hidden email]>:
>> Vincent,
>>
>> I'm trying to understand some of the issues we have with entities in the
>> XmlParser. Is there a special reason why entities are emitted as rawText and
>> not text?
>
> The text used by XhtmlBaseParser#handleEntity() could contain
> predefined entities [1] and numeric code entities (ie &AElig; will
> become &#198; by XmlPullParser)
> XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.
>
> So using rawText() is to be sure to not escape text with entities.
>
>> I think they should be emitted as text:
>>
>> First, custom entities can be used to simply define some replacement text
>> inside documents (eg <!ENTITY version "1.0">).
>>
>> Second, the resulting events should be consumable by all sinks, not just
>> x(ht)ml based ones. Consider for instance the text "&amp;&AElig;" (where
>> AElig is defined as <!ENTITY AElig  "&#198;">). Currently it is emitted by
>> the XhtmlBaseParser as one text event "&" and one rawText event "&#198;".
>> This means that eg the Latex Sink will produce wrong output (the AElig
>> should be converted to "\AE" in latex).
>>
>> IMO the resolved entity should be emitted in a format-independent way, eg as
>> one (unicode?) character, just like &amp; is emitted as one character above.
>> The consuming sink then has to transform that into a format-specific
>> representation.
>
> It could be another implementation.
> XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()
>
> Cheers,
>
> Vincent
>
> [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent
>