Quantcast

Character encoding for APT files

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Character encoding for APT files

Trevor Harmon
When Doxia generates HTML from APT, it appears to force the HTML file  
to use ISO-8859-1, regardless of the original APT's encoding. I don't  
really understand why, since the Maven Doxia Converter supposedly  
generates all files in UTF-8:

   http://maven.apache.org/doxia/doxia-tools/doxia-converter/index.html

I found another user who's having a similar problem:

   http://www.mailinglistarchive.com/users@.../ 
msg21983.html

He demonstrated a technique that appears to tell Doxia which encoding  
to use:

   <plugin>
     <artifactId>maven-site-plugin</artifactId>
       <configuration>
         <outputEncoding>UTF-8</outputEncoding>
       </configuration>
   </plugin>

But this has no effect for me. Is there any way to force Doxia to  
produce UTF-8 HTML for my APT files? Thanks,

Trevor

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Lukas Theussl-4

The doxia-converter is not used by the site plugin, it is supposed to be a
stand-alone tool.

There has been a lot of work regarding encoding issues since I last worked on
Doxia and I'm not up-to-date with the exact status. Maybe Herve or Vincent can
clarify?

Cheers,
-Lukas


Trevor Harmon wrote:

> When Doxia generates HTML from APT, it appears to force the HTML file  
> to use ISO-8859-1, regardless of the original APT's encoding. I don't  
> really understand why, since the Maven Doxia Converter supposedly  
> generates all files in UTF-8:
>
>   http://maven.apache.org/doxia/doxia-tools/doxia-converter/index.html
>
> I found another user who's having a similar problem:
>
>   http://www.mailinglistarchive.com/users@.../ msg21983.html
>
> He demonstrated a technique that appears to tell Doxia which encoding  
> to use:
>
>   <plugin>
>     <artifactId>maven-site-plugin</artifactId>
>       <configuration>
>         <outputEncoding>UTF-8</outputEncoding>
>       </configuration>
>   </plugin>
>
> But this has no effect for me. Is there any way to force Doxia to  
> produce UTF-8 HTML for my APT files? Thanks,
>
> Trevor
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Trevor Harmon
On Jan 13, 2009, at 5:27 AM, Lukas Theussl wrote:

> There has been a lot of work regarding encoding issues since I last  
> worked on Doxia and I'm not up-to-date with the exact status. Maybe  
> Herve or Vincent can clarify?

I haven't heard from them. Should I file a bug on this issue?

Trevor

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Lukas Theussl-4

yes please, a minimalistic test project that illustrates the problem will
certainly help.

thanks!
-Lukas


Trevor Harmon wrote:

> On Jan 13, 2009, at 5:27 AM, Lukas Theussl wrote:
>
>> There has been a lot of work regarding encoding issues since I last  
>> worked on Doxia and I'm not up-to-date with the exact status. Maybe  
>> Herve or Vincent can clarify?
>
>
> I haven't heard from them. Should I file a bug on this issue?
>
> Trevor
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Trevor Harmon
On Jan 22, 2009, at 10:16 AM, Lukas Theussl wrote:

> yes please, a minimalistic test project that illustrates the problem  
> will certainly help.

http://jira.codehaus.org/browse/DOXIA-278

Trevor

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Hervé BOUTEMY
Le jeudi 22 janvier 2009, Trevor Harmon a écrit :
> On Jan 22, 2009, at 10:16 AM, Lukas Theussl wrote:
> > yes please, a minimalistic test project that illustrates the problem
> > will certainly help.
>
> http://jira.codehaus.org/browse/DOXIA-278
>
> Trevor

Sorry, I was working on other things and missed this discussion.
I just commented (and closed as "Not A Bug" :) ) the issue.

Regards,

Hervé
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Trevor Harmon
On Jan 22, 2009, at 4:50 PM, Hervé BOUTEMY wrote:

> Sorry, I was working on other things and missed this discussion.
> I just commented (and closed as "Not A Bug" :) ) the issue.

I agree that autodetecting is not a bullet-proof feature, but an  
absolute guarantee is not required in this case. I share Jason van  
Zyl's view: "If it's right most of the time, and it saves the user  
from having to know or worry about it then yes I would use it." [1]

Another issue is that without autodetection, supporting more than one  
type of character encoding for the APT files in a Maven project is  
impossible.

That said, if autodetection is simply out of the question, let me  
suggest a different tack. Doxia appears to require ISO-8859-1 for APT  
files by default. This is a Western-centric encoding that lacks  
support for Asian languages. It is also deprecated. According to  
Wikipedia:

"The ISO/IEC working group responsible for maintaining eight-bit coded  
character sets disbanded and ceased all maintenance of ISO 8859,  
including ISO 8859-1, in order to concentrate on the Universal  
Character Set and Unicode." [2]

I would also say that with the increasing popularity of UTF-8, the  
number of encoding problems encountered by users due to Doxia favoring  
ISO-8859-1 is already larger than any problems that might occur due to  
bad autodetection. In other words, autodetection might be wrong some  
of the time, but for many users, ISO-8859-1 is wrong all of the time.

In light of this, I suggest changing Doxia's APT handling so that it  
defaults to UTF-8 rather than ISO-8859-1. Not only will this help  
UTF-8 users (who may be a majority), it will also help increase  
Maven's acceptance in the Asian world, a trend that is already  
happening [3].

I can work on a patch for this, if there's a chance it will be accepted.

Trevor

[1] http://www.nabble.com/Re%3A--VOTE--POM-Element-for-Source-File-Encoding-p16566779.html
[2] http://en.wikipedia.org/wiki/ISO_8859-1
[3] http://blogs.sonatype.com/people/2008/07/apache-maven-the-definitive-chinese-guide/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Hervé BOUTEMY
I knew this would cause another discussion: encoding choices are always like
this :)

Le vendredi 23 janvier 2009, Trevor Harmon a écrit :
> On Jan 22, 2009, at 4:50 PM, Hervé BOUTEMY wrote:
> > Sorry, I was working on other things and missed this discussion.
> > I just commented (and closed as "Not A Bug" :) ) the issue.
>
> I agree that autodetecting is not a bullet-proof feature, but an
> absolute guarantee is not required in this case. I share Jason van
> Zyl's view: "If it's right most of the time, and it saves the user
> from having to know or worry about it then yes I would use it." [1]
the problem with such an auto-dection in a tool like Doxia used by
maven-site-plugin is that if the guessed encoding is not right, you can't do
anything (or you have to configure it, which is what you wanted to avoid)
It is not the case for example in a GUI, like a web browser, where a user can
change the encoding in a couple of clicks if there is a problem

>
> Another issue is that without autodetection, supporting more than one
> type of character encoding for the APT files in a Maven project is
> impossible.
same remarks than before: and what if guessed encoding from a file is wrong?

>
> That said, if autodetection is simply out of the question, let me
> suggest a different tack. Doxia appears to require ISO-8859-1 for APT
> files by default. This is a Western-centric encoding that lacks
> support for Asian languages. It is also deprecated. According to
> Wikipedia:
>
> "The ISO/IEC working group responsible for maintaining eight-bit coded
> character sets disbanded and ceased all maintenance of ISO 8859,
> including ISO 8859-1, in order to concentrate on the Universal
> Character Set and Unicode." [2]
>
> I would also say that with the increasing popularity of UTF-8, the
> number of encoding problems encountered by users due to Doxia favoring
> ISO-8859-1 is already larger than any problems that might occur due to
> bad autodetection. In other words, autodetection might be wrong some
> of the time, but for many users, ISO-8859-1 is wrong all of the time.
Yes, I understand this one: historic default encoding is ISO-8859-1, which is
problematic for a lot of people.
There was a proposal implemented in a lot of Maven plugin to make encoding
easily configurable: see [4]
When the question of default encoding came, there was a large poll (you'll
find links in the proposal), which came to the conclusion that default source
encoding should be platform encoding.

The configuration part of the proposal was taken into account in
maven-site-plugin 2.0-beta-7 on 03 Jul 2008 (see MSITE-314), but the default
encoding wasn't changed: it is tracked MSITE-326 to let people vote if they
want platform encoding (= the full proposal, which is platform dependant)
instead of ISO-8859-1. There don't seem to be real traction...

There are a lot of Maven plugins today that complain if you don't configure
default encoding: it is a simple property to add in your POM. Doesn't it meet
your needs?

>
> In light of this, I suggest changing Doxia's APT handling so that it
> defaults to UTF-8 rather than ISO-8859-1. Not only will this help
> UTF-8 users (who may be a majority),
do you have figures, or is it a guess? AFAIK, Windows default encoding is
still CP-1252 in west european languages. I don't know if this has changed
with Vista.
Then I doubt everybody switched to UTF-8.
There is no really ideal default encoding: only configuration fixes the issue.

> it will also help increase
> Maven's acceptance in the Asian world, a trend that is already
> happening [3].
>
> I can work on a patch for this, if there's a chance it will be accepted.
>
> Trevor
>
> [1]
> http://www.nabble.com/Re%3A--VOTE--POM-Element-for-Source-File-Encoding-p16
>566779.html [2] http://en.wikipedia.org/wiki/ISO_8859-1
> [3]
> http://blogs.sonatype.com/people/2008/07/apache-maven-the-definitive-chines
>e-guide/
[4]
http://docs.codehaus.org/display/MAVENUSER/POM+Element+for+Source+File+Encoding

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Character encoding for APT files

Trevor Harmon
On Jan 23, 2009, at 3:24 PM, Hervé BOUTEMY wrote:

> the problem with such an auto-dection in a tool like Doxia used by
> maven-site-plugin is that if the guessed encoding is not right, you  
> can't do
> anything

I was thinking that manually specifying a particular encoding would  
override the autodetection feature.

> (or you have to configure it, which is what you wanted to avoid)

If autodetection guesses wrong (and I maintain that it would seldom  
guess wrong), having to configure it those few times would be better  
than having to configure it all the time, which is what UTF-8 users  
have to do now.

>> Another issue is that without autodetection, supporting more than one
>> type of character encoding for the APT files in a Maven project is
>> impossible.
> same remarks than before: and what if guessed encoding from a file  
> is wrong?

The error rate would go from all the time to some of the time, which  
is still a win. Again, I'm assuming that autodetection is optional and  
enabled by default; if it causes problems it could be disabled,  
reverting to the same behavior as before.

> There are a lot of Maven plugins today that complain if you don't  
> configure
> default encoding: it is a simple property to add in your POM.  
> Doesn't it meet
> your needs?

The problem is that I have many dozens of POMs, and I have to declare  
the encoding in all of them. Is there some way of configuring the  
encoding globally, perhaps in settings.xml?

>> In light of this, I suggest changing Doxia's APT handling so that it
>> defaults to UTF-8 rather than ISO-8859-1. Not only will this help
>> UTF-8 users (who may be a majority),
> do you have figures, or is it a guess?

It's a guess, though there's circumstantial evidence pointing to the  
rise of UTF-8. It's definitely growing on the web [1], and text  
editors I've used, such as Eclipse on Linux and TextMate on Mac OS X,  
default to UTF-8. I'm actually surprised UTF-8 hasn't been adopted  
more quickly because it solves so many issues. But I worry that we're  
never we're never going to get there if modern applications continue  
to require native file encodings by default.

Trevor

[1] http://www.w3.org/QA/2008/05/utf8-web-growth.html

Loading...