Allowed characters in GAV and how/where to sanitize?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Allowed characters in GAV and how/where to sanitize?

Andreas Sewe-2
Hi Maven developers,

doing a large-scale analysis of Maven Central, I've come across a couple
of "weird" GAVs like this one: groupId=com.knappsack,
artifactId=swagger4spring-web, version=mvn+release:perform [1].

The colon in the version raises the question as to the allowed
characters in the different components of a GAV. AFAICT, a colon in the
version is at least rejected by the
org.eclipse.aether.artifact.DefaultArtifact(String) constructor, so that
seems to be illegal, but DefaultModelValidator doesn't complain. Also,
querying the index of Central returns an
org.apache.maven.index.ArtifactInfo with a version of
"mvn+release:perform" just fine.

What's the best way to handle this?

Should every plug-in that consumes, say, a Maven Index sanitize the results?

Or should this be handled upstream in the repository manager? (Note that
the POM of [1] has a <version> of "mvn release:perform", but the
ArtifactInfo's version is "mvn+release:perform", so some sanitation has
already happened somewhere, probably in Nexus.)

Best wishes,

Andreas

[1]
<http://search.maven.org/#artifactdetails%7Ccom.knappsack%7Cswagger4spring-web%7Cmvn%2Brelease%3Aperform%7Cjar>


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allowed characters in GAV and how/where to sanitize?

stephenconnolly
In an url path segment space is mapped to +

The repo manager should be blocking those... likely not doing it’s job.

We probably should also barf on : in a version. There is validation on
artifactId and groupId when last I checked

On Tue 9 Jan 2018 at 09:59, Andreas Sewe <[hidden email]>
wrote:

> Hi Maven developers,
>
> doing a large-scale analysis of Maven Central, I've come across a couple
> of "weird" GAVs like this one: groupId=com.knappsack,
> artifactId=swagger4spring-web, version=mvn+release:perform [1].
>
> The colon in the version raises the question as to the allowed
> characters in the different components of a GAV. AFAICT, a colon in the
> version is at least rejected by the
> org.eclipse.aether.artifact.DefaultArtifact(String) constructor, so that
> seems to be illegal, but DefaultModelValidator doesn't complain. Also,
> querying the index of Central returns an
> org.apache.maven.index.ArtifactInfo with a version of
> "mvn+release:perform" just fine.
>
> What's the best way to handle this?
>
> Should every plug-in that consumes, say, a Maven Index sanitize the
> results?
>
> Or should this be handled upstream in the repository manager? (Note that
> the POM of [1] has a <version> of "mvn release:perform", but the
> ArtifactInfo's version is "mvn+release:perform", so some sanitation has
> already happened somewhere, probably in Nexus.)
>
> Best wishes,
>
> Andreas
>
> [1]
> <
> http://search.maven.org/#artifactdetails%7Ccom.knappsack%7Cswagger4spring-web%7Cmvn%2Brelease%3Aperform%7Cjar
> >
>
> --
Sent from my phone
Reply | Threaded
Open this post in threaded view
|

Re: Allowed characters in GAV and how/where to sanitize?

Hervé BOUTEMY
notice that Central contains artifacts produced by Maven but also by other
tools: I did some analysis myself and found strange things also that are
clearly not produced by Maven. Scala for example produces some artifacts that
I doubt could be referenced by Maven.

Then: what do we call "broken"?
Something that seems "clearly" related to a typo?
Something that can't be consumed by Maven?
Something that people who produced the release (with any tooling) won't
consume for syntactic reasons on the result? Something that they won't consume
for other reasons? (like for example because it's continuous deployment and
it's the 4th version of the day)

And what can we do?
On the past artifacts, removing anything is not really an option: IMHO, the
issue does not deserve the effort and to break our base rule about
inalterability.
On the future, perhaps we can do something:
- at Maven level, sure we can and we should improve controls as much as
possible
- on other build tools: perhaps we should try not only to implement checks in
Maven but also document rules for other tools to implement same rules
- on repo managers used by the publishers: same rules documentation
prerequisite, but other tools target
- on sync to central: this is the only location where some rules can be
checked for absolutely any new artifact then really interesting at a first
glance. But making rules evolve at this level is really hard since there is no
real feedback process I know of when base Central publication rules are not
met. Base Central publication rules were defined from the beginning (signature,
...), then are implemented by publishers' repo managers. I suppose failed
controls done by sync to central (then sync blocked) are rare: I'm not sure
there is a strong process/tooling. And adding it would cost some management:
not easy. IMHO, we should start by first detecting if there are really issues
on new artifacts these days before trying to take actions at this level.

Regards,

Hervé

Le mardi 9 janvier 2018, 15:42:41 CET Andreas Sewe a écrit :

> Hi Stephen,
>
> > In an url path segment space is mapped to +
>
> not quite. This holds only for the query part of an URI which is
> typically encoding according to the application/x-www-form-urlencoded
> scheme. Elsewhere in a URI, e.g., the path component, a space is simply
> percent-encoded a %20.
>
> > The repo manager should be blocking those... likely not doing it’s job.
>
> I agree. IMHO the repo manager should block (if only as a last resort
> for people using something to deploy that doesn't do the check earlier).
>
> That being said, the situation on Maven Central is not that dire; there
> are very few versions in the wild that I consider broken:
>
> Additional quotes:
>
> - "1.0.0
> - '1.0'
>
> CLI trouble:
>
> - mvn+release:perform
> - version=1.6.2.1
>
> Commas instead of dots as separator
>
> - 1,0
>
> Expressions or expression-like constructs:
>
> - ${env.VERSION}
> - ${parent.version}
> - @metro.version@
> - $%7Bcucumber-jvm.version%7D
>
> If you are interested, I have a more complete list (about 30 entries
> overall), together with a histogram of characters used in versions.
> Interestingly, no non-ASCII characters are used, not even in qualifiers.
>
> > We probably should also barf on : in a version. There is validation on
> > artifactId and groupId when last I checked
>
> Different validators barf on different things. The
> org.apache.maven.*project*.validation.DefaultModelValidator used by
> deploy:deploy-file is happy with *any* non-empty version, whereas
> org.apache.maven.*model*.validation.DefaultModelValidator does a bit
> more; in particular, it checks for certain filesystem-unsafe characters,
> including the colon: \ / : " < > | ? *
>
> I don't really know why deploy:deploy-file prefers one ModelValidator
> over the other, though. Is this a bug in the the maven-deploy-plugin?
>
> Best wishes,
>
> Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Allowed characters in GAV and how/where to sanitize?

Andreas Sewe-2
Hervé BOUTEMY wrote:
> notice that Central contains artifacts produced by Maven but also by other
> tools: I did some analysis myself and found strange things also that are
> clearly not produced by Maven. Scala for example produces some artifacts that
> I doubt could be referenced by Maven.

Yes, Maven is not the only tool that can deploy artifacts to Central --
and that's a good thing.

> Then: what do we call "broken"?
> Something that seems "clearly" related to a typo?
> Something that can't be consumed by Maven?
> Something that people who produced the release (with any tooling) won't
> consume for syntactic reasons on the result? Something that they won't consume
> for other reasons? (like for example because it's continuous deployment and
> it's the 4th version of the day)

I wouldn't go so far to treat version=1.6.2.1 as an illegal version
(after all, I can image someone using legitimately using a qualifier
scheme like 1.2.3-os=linux), but there are IMHO two cases which I always
consider broken:

- Spaces in any of the components of a GAV
- A colon in any of the components of a GAV

Spaces are just likely to cause trouble for some tool further down the line.

And for colons we know that they will cause trouble, being the default
separator for GAVs when written as a single string.

Aside from those characters, I would probably just ban a few characters
(non-printable control characters). A bit similar to what XML did with a
its NCName (non-colon name) production [1].

However, for groudId and artifactId we already have much stricter rules
(A-Z, a-z, 0-9, ., -, _), so the argument can be made that
versions/classifiers/extensions should also be made up of a more limited
character set as well.

In particular, care should to be taken that the path component can still
be parsed unambiguously, so allowing '.' in a classifier is probably
asking for trouble.

> And what can we do?
> On the past artifacts, removing anything is not really an option: IMHO, the
> issue does not deserve the effort and to break our base rule about
> inalterability.
> On the future, perhaps we can do something:
> - at Maven level, sure we can and we should improve controls as much as
> possible

Yes, if only that at this level we can provide the best error messages,
as the error is recognized closest to the user.

> - on other build tools: perhaps we should try not only to implement checks in
> Maven but also document rules for other tools to implement same rules

The Maven Resolver is a great place to enforce some rules in
DefaultArtifact (or whatever replaces it). Granted, not everyone deploys
using the Maven Resolver, but its *the* place that knows about all the
intricacies of the repository layout already.

> - on repo managers used by the publishers: same rules documentation
> prerequisite, but other tools target

Well, Nexus already has some checks in place, to avoid versions like
"1/../../other-artifact/2". However, groupIds like "org...example" are
still accepted (deployed under org/example).

> - on sync to central: this is the only location where some rules can be
> checked for absolutely any new artifact then really interesting at a first
> glance. But making rules evolve at this level is really hard since there is no
> real feedback process I know of when base Central publication rules are not
> met. Base Central publication rules were defined from the beginning (signature,
> ...), then are implemented by publishers' repo managers. I suppose failed
> controls done by sync to central (then sync blocked) are rare: I'm not sure
> there is a strong process/tooling. And adding it would cost some management:
> not easy. IMHO, we should start by first detecting if there are really issues
> on new artifacts these days before trying to take actions at this level.
That being said, I think the first step is to document the syntax for
GAVs somewhere (e.g., at [2] or [3]).

Best wishes,

Andreas

[1] <https://www.w3.org/TR/REC-xml-names/#NT-NCName>
[2] <http://maven.apache.org/pom.html#Maven_Coordinates>
[3] <http://maven.apache.org/ref/3.5.2/maven-model/maven.html>



signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allowed characters in GAV and how/where to sanitize?

rfscholte
On Wed, 10 Jan 2018 15:07:08 +0100, Andreas Sewe  
<[hidden email]> wrote:

> Fred Cooke wrote:
>> Re versions, I know the background on it, but it annoys me that maven  
>> can't
>> handle 4 part versions, 1.2.3.4 as sometimes it's handy to do a patch  
>> level
>> that deep. Lots of messed up software in the world :-)

The number of dots is not limited to 2 for a version. A version has some  
banned characters, but that's it.
However, an ArtifactVersion[1] has some explicit/named segments (which I  
personally consider a bad choice in the past). This means there can be  
unexpected results when *comparing* versions.
In general you won't hit such issues as long as you don't use version  
ranges.
Assuming most won't use ranges, I can only think of possible issues with  
enforcer rules like requireUpperBoundDeps[2], but only when having 4 or  
more segments.

thanks,
Robert

[1]  
https://maven.apache.org/ref/3.5.2/maven-artifact/apidocs/org/apache/maven/artifact/versioning/ArtifactVersion.html
[2]  
https://maven.apache.org/enforcer/enforcer-rules/requireUpperBoundDeps.html

>
> Are you sure that's still the case (the parts-restriction, not the
> messiness of software ;-)?
>
> At least the Maven Resolver uses a versioning scheme that's quite
> flexible [1]. Not sure if the flexibility at this low level bubbles up
> all the way to the top, though. Maybe one the Maven developers can chime  
> in.
>
>> Format should be N[.N as many times as needed][optional hyphen and
>> qualifier of some sort] or something like that. Not hard limited to 1 2  
>> or
>> 3 parts.
>
> AFAICT, that's what GenericVersionScheme does.
>
> Hope this helps,
>
> Andreas
>
> [1]
> <https://github.com/apache/maven-resolver/blob/3fc53c052f538169cb7dc6aa9ed9052514b569ca/maven-resolver-util/src/main/java/org/eclipse/aether/util/version/GenericVersionScheme.java#L31>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]