What is Percent encoding

Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances. Although it is known as URL encoding it is, in fact, used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests.

Types of URI characters

The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are those characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or more generally, a URI). Unreserved characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.

Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.In the "query" component of a URI (the part after a ? character), for example, / is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.

URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.

Percent-encoding unreserved characters

Characters from the unreserved set never need to be percent-encoded.

URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers shouldn't treat %41 differently from A or %7E differently from ~, but some do. For maximum interoperability, URI producers are discouraged from percent-encoding unreserved characters.

Percent-encoding the percent character

Because the percent character ( % ) serves as the indicator for percent-encoded octets, it must be percent-encoded as %25 for that octet to be used as data within a URI.

Percent-encoding arbitrary data

Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often don't, provide an explicit mapping between URI characters and all possible data values being represented by those characters.

Binary data

Since the publication of RFC 1738 in 1994 it has been specified[1] that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above. Byte value 0F (hexadecimal), for example, should be represented by %0F, but byte value 41 (hexadecimal) can be represented by A, or %41. The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred as it results in shorter URLs.

Character data

The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, stateful, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.

For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all, and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.

Main article: MIME

URL encoding the space character: + or %20?

From Wikipedia (emphasis and link added):

When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST, or, historically, via email. The encoding used by default is based on a very early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with "+" instead of "%20". The MIME type of data encoded this way is application/x-www-form-urlencoded, and it is currently defined (still in a very outdated manner) in the HTML and XForms specifications.

So, the real percent encoding uses %20 while form data in URLs is in a modified form that uses +. So you're most likely to only see + in URLs in the query string after an ?.


A space may only be encoded to "+" in the "application/x-www-form-urlencoded" content-type key-value pairs query part of an URL. This is a MAY, not a MUST. In the rest of URLs, it is encoded as %20.

In my opinion, its better to always encode spaces as %20, not as "+", even in the query part of an URL, because it is the HTML specification (RFC-1866) that specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.) This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.

Here is a sample string in URL where the HTML specification allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to the HTML specification. In other cases, spaces should be encoded to %20. But since it's hard to correctly determine the context, it's the best practice to never encode spaces as "+".

I would recommend to percent-encode all character except "unreserved" defined in RFC-3986, p.2.3

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The implementation depends on the programming language that you chose.

If your URL contains national characters, first encode them to UTF-8 and then percent-encode the result.

How to urlencode a querystring in Python?

You need to pass your parameters into urlencode() as either a mapping (dict), or a sequence of 2-tuples, like:

 >>> import urllib
 >>> f = { 'eventName' : 'myEvent', 'eventDescription' : 'cool event'}
 >>> urllib.urlencode(f)
 'eventName=myEvent&eventDescription=cool+event'


Context

Python (version 2.7.2 )
Problem

You want to generate a urlencoded query string.
You have a dictionary or object containing the name-value pairs.
You want to be able to control the output ordering of the name-value pairs.
Solution

urllib.urlencode
urllib.quote_plus
Pitfalls

dictionary output arbitrary ordering of name-value pairs
(see also: Why is python ordering my dictionary like so?)
(see also: Why is the order in dictionaries and sets arbitrary?)
handling cases when you DO NOT care about the ordering of the name-value pairs
handling cases when you DO care about the ordering of the name-value pairs
handling cases where a single name needs to appear more than once in the set of all name-value pairs
.

urlencode vs rawurlencode?

It will depend on your purpose. If interoperability with other systems is important then it seems rawurlencode is the way to go. The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode).

rawurlencode follows RFC 1738 prior to PHP 5.3.0 and RFC 3986 afterwards (see http://us2.php.net/manual/en/function.rawurlencode.php)

Returns a string in which all non-alphanumeric characters except -_.~ have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 3986 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).

Note on RFC 3986 vs 1738. rawurlencode prior to php 5.3 encoded the tilde character (~) according to RFC 1738. As of PHP 5.3, however, rawurlencode follows RFC 3986 which does not require encoding tilde characters.

urlencode encodes spaces as plus signs (not as %20 as done in rawurlencode)(see http://us2.php.net/manual/en/function.urlencode.php)

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

This corresponds to the definition for application/x-www-form-urlencoded in RFC 1866.

Additional Reading:

You may also want to see the discussion at http://bytes.com/groups/php/5624-urlencode-vs-rawurlencode.

Also, RFC 2396 is worth a look. RFC 2396 defines valid URI syntax. The main part we're interested in is from 3.4 Query Component:

Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$"
 are reserved.

As you can see, the + is a reserved character in the query string and thus would need to be encoded as per RFC 3986 (as in rawurlencode).

One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L in ASCII, it's actually a <. I'm sure you see the confusion here.

Both of these functions manage EBCDIC if the web server has defined it.

Also, they both use an array of chars (think string type) hexchars look-up to get some values, the array is described as such:

/* rfc1738:
 ...The characters ";",
 "/", "?", ":", "@", "=" and "&" are the characters which may be
 reserved for special meaning within a scheme...
 ...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
 reserved characters used for their reserved purposes may be used
 unencoded within a URL...
 For added safety, we only leave -_. unencoded.
 */
static unsigned char hexchars[] = "0123456789ABCDEF";

Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.

Differences in ASCII:

URLENCODE:

RAWURLENCODE:

Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x and y, checks for exit on len reaching 0, and increments both x and y. I know, it's not what you'd expect, but it's valid code.

Differences:

They basically iterate differently, one assigns a + sign in the event of ASCII 20.

Differences in EBCDIC:

URLENCODE:

RAWURLENCODE:

Grand Summary

Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.

Suggested implementations

Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."

If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.

If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.

Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).

What is %2C in a URL?

Checkout http://www.asciitable.com/

look into the Hx (Hex) column, 2C mapped to ,

any unusual encoding can be check by this way

+----+-----+----+-----+----+-----+----+-----+
| Hx | Chr | Hx | Chr | Hx | Chr | Hx | Chr |
+----+-----+----+-----+----+-----+----+-----+
| 00 | NUL | 20 | SPC | 40 | @ | 60 | ` |
| 01 | SOH | 21 | ! | 41 | A | 61 | a |
| 02 | STX | 22 | " | 42 | B | 62 | b |
| 03 | ETX | 23 | # | 43 | C | 63 | c |
| 04 | EOT | 24 | $ | 44 | D | 64 | d |
| 05 | ENQ | 25 | % | 45 | E | 65 | e |
| 06 | ACK | 26 | & | 46 | F | 66 | f |
| 07 | BEL | 27 | ' | 47 | G | 67 | g |
| 08 | BS | 28 | ( | 48 | H | 68 | h |
| 09 | TAB | 29 | ) | 49 | I | 69 | i |
| 0A | LF | 2A | * | 4A | J | 6A | j |
| 0B | VT | 2B | + | 4B | K | 6B | k |
| 0C | FF | 2C | , | 4C | L | 6C | l |
| 0D | CR | 2D | - | 4D | M | 6D | m |
| 0E | SO | 2E | . | 4E | N | 6E | n |
| 0F | SI | 2F | / | 4F | O | 6F | o |
| 10 | DLE | 30 | 0 | 50 | P | 70 | p |
| 11 | DC1 | 31 | 1 | 51 | Q | 71 | q |
| 12 | DC2 | 32 | 2 | 52 | R | 72 | r |
| 13 | DC3 | 33 | 3 | 53 | S | 73 | s |
| 14 | DC4 | 34 | 4 | 54 | T | 74 | t |
| 15 | NAK | 35 | 5 | 55 | U | 75 | u |
| 16 | SYN | 36 | 6 | 56 | V | 76 | v |
| 17 | ETB | 37 | 7 | 57 | W | 77 | w |
| 18 | CAN | 38 | 8 | 58 | X | 78 | x |
| 19 | EM | 39 | 9 | 59 | Y | 79 | y |
| 1A | SUB | 3A | : | 5A | Z | 7A | z |
| 1B | ESC | 3B | ; | 5B | [ | 7B | { |
| 1C | FS | 3C | < | 5C | \ | 7C | | |
| 1D | GS | 3D | = | 5D | ] | 7D | } |
| 1E | RS | 3E | > | 5E | ^ | 7E | ~ |
| 1F | US | 3F | ? | 5F | _ | 7F | DEL |
+----+-----+----+-----+----+-----+----+-----+

How to do URL decoding in Java?

This does not have anything to do with character encodings such as UTF-8 or ASCII. The string you have there is URL encoded. This kind of encoding is something entirely different than character encoding.

Try something like this:

String result = java.net.URLDecoder.decode(url, "UTF-8");

Note that a character encoding (such as UTF-8 or ASCII) is what determines the mapping of characters to raw bytes. For a good intro to character encodings, see this article.

Should I URL-encode POST data?

The general answer to your question is that it depends. And you get to decide by specifying what your "Content-Type" is in the HTTP headers.

A value of "application/x-www-form-urlencoded" means that your POST body will need to be URL encoded just like a GET parameter string. A value of "multipart/form-data" means that you'll be using content delimiters and NOT url encoding the content.

This answer has a much more thorough explanation if you'd like more information.

Specific Answer

For an answer specific to the PHP libraries you're using (CURL), you should read the documentation here.

Here's the relevant information:

CURLOPT_POST

TRUE to do a regular HTTP POST. This POST is the normal application/x-www-form-urlencoded kind, most commonly used by HTML forms.

CURLOPT_POSTFIELDS

The full data to post in a HTTP "POST" operation. To post a file, prepend a filename with @ and use the full path. The filetype can be explicitly specified by following the filename with the type in the format ';type=mimetype'. This parameter can either be passed as a urlencoded string like 'para1=val1&para2=val2&...' or as an array with the field name as key and field data as value. If value is an array, the Content-Type header will be set to multipart/form-data. As of PHP 5.2.0, value must be an array if files are passed to this option with the @ prefix.

A html space is showing as %2520 instead of %20

The common space character is encoded as %20 as you noted yourself. The % character is encoded as %25.

The way you get %2520 is when your url already has a %20 in it, and gets urlencoded again, which transforms the %20 to %2520.

Are you (or any framework you might be using) double encoding characters?

Edit: Expanding a bit on this, especially for LOCAL links. Assuming you want to link to the resource C:\my path\my file.html:

NOTES:

Sharing a URL with a query string on Twitter

This can be solved by using https://twitter.com/intent/tweet instead of http://www.twitter.com/share. Using the intent/tweet function, you simply URL encode your entire URL and it works like a charm.

https://dev.twitter.com/web/intents

urlencoded Forward slash is breaking URL

Apache denies all URLs with %2F in the path part, for security reasons: scripts can't normally (ie. without rewriting) tell the difference between %2F and / due to the PATH_INFO environment variable being automatically URL-decoded (which is stupid, but a long-standing part of the CGI specification so there's nothing can be done about it).

You can turn this feature off using the AllowEncodedSlashes directive, but note that other web servers will still disallow it (with no option to turn that off), and that other characters may also be taboo (eg. %5C), and that %00 in particular will always be blocked by both Apache and IIS. So if your application relied on being able to have %2F or other characters in a path part you'd be limiting your compatibility/deployment options.

I am using urlencode() while preparing the search URL

You should use rawurlencode(), not urlencode() for escaping path parts. urlencode() is misnamed, it is actually for application/x-www-form-urlencoded data such as in the query string or the body of a POST request, and not for other parts of the URL.

The difference is that + doesn't mean space in path parts. rawurlencode() will correctly produce %20 instead, which will work both in form-encoded data and other parts of the URL.

NameValueCollection to URL Query?

Simply calling ToString() on the NameValueCollection will return the name value pairs in a name1=value1&name2=value2 querystring ready format. Note that NameValueCollection types don't actually support this and it's misleading to suggest this, but the behavior works here due to the internal type that's actually returned, as explained below.

Thanks to @mjwills for pointing out that the HttpUtility.ParseQueryString method actually returns an internal HttpValueCollection object rather than a regular NameValueCollection(despite the documentation specifying NameValueCollection). The HttpValueCollectionautomatically encodes the querystring when using ToString(), so there's no need to write a routine that loops through the collection and uses the UrlEncode method. The desired result is already returned.

With the result in hand, you can then append it to the URL and redirect:

var nameValues = HttpUtility.ParseQueryString(Request.QueryString.ToString());
string url = Request.Url.AbsolutePath + "?" + nameValues.ToString();
Response.Redirect(url);

Currently the only way to use a HttpValueCollection is by using the ParseQueryString method shown above (other than reflection, of course). It looks like this won't change since the Connect issue requesting this class be made public has been closed with a status of "won't fix."

As an aside, you can call the AddSet, and Remove methods on nameValues to modify any of the querystring items before appending it. If you're interested in that see my response to another question.

How to find out if string has already been URL encoded?

Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.

I hope one can't write a quine in urlencode, or this algorithm would get stuck.

How to encode URL in Groovy?

You could use java.net.URLEncoder.

In your example above, the brackets must be encoded too:

def toEncode = "dehydrogenase (NADP+)"
assert java.net.URLEncoder.encode(toEncode, "UTF-8") == "dehydrogenase+%28NADP%2B%29"

You could also add a method to string's metaclass:

String.metaClass.encodeURL = {
 java.net.URLEncoder.encode(delegate, "UTF-8")
}

And simple call encodeURL() on any string:

def toEncode = "dehydrogenase (NADP+)"
assert toEncode.encodeURL() == "dehydrogenase+%28NADP%2B%29" 

This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.

URLDecoder.decode(url);

This will give you the correct text. The result of decoding the like you provided is this.

http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3

The %20 is an escaped space character. To get the above I used the URLDecoder object.

GETting a URL with an url-encoded slash

I want to send a HTTP GET to http://example.com/%2F. My first guess would be something like this:

using (WebClient webClient = new WebClient())
{
 webClient.DownloadData("http://example.com/%2F");
}

Unfortunately, I can see that what is actually sent on the wire is:

GET // HTTP/1.1
Host: example.com
Connection: Keep-Alive

So http://example.com/%2F gets translated into http://example.com// before transmitting it.

Is there a way to actually send this GET-request?

The OCSP-protocol mandates sending the url-encoding of a base-64-encoding when using OCSP over HTTP/GET, so it is necessary to send an actual %2F rather than an '/' to be compliant.

EDIT:

Here is the relevant part of the OCSP protocol standard (RFC 2560 Appendix A.1.1):

An OCSP request using the GET method is constructed as follows:

GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}

I am very open to other readings of this, but I cannot see what else could be meant.

By default, the Uri class will not allow an escaped / character (%2f) in a URI (even though this appears to be legal in my reading of RFC 3986).

Uri uri = new Uri("http://example.com/%2F");
Console.WriteLine(uri.AbsoluteUri); // prints: http://example.com//

(Note: don't use Uri.ToString to print URIs.)

According to the bug report for this issue on Microsoft Connect, this behaviour is by design, but you can work around it by adding the following to your app.config or web.config file:

<uri>
 <schemeSettings>
 <add name="http" genericUriParserOptions="DontUnescapePathDotsAndSlashes" />
 </schemeSettings>
</uri>

(Reposted from https://stackoverflow.com/a/10415482 because this is the "official" way to avoid this bug without using reflection to modify private fields.)

Edit: The Connect bug report is no longer visible, but the documentation for <schemeSettings>recommends this approach to allow escaped / characters in URIs. Note (as per that article) that there may be security implications for components that don't handle escaped slashes correctly.

iOS : How to do proper URL encoding?

I did some tests and I think the problem is not really with the UIWebView but instead that NSURLwon't accept the URL because of the é in "Témp" is not encoded properly. This will cause +[NSURLRequest requestWithURL:] and -[NSURL URLWithString:] to return nil as the string contains a malformed URL. I guess that you then end up using a nil request with -[UIViewWeb loadRequest:] which is no good.

Example:

NSLog(@"URL with é: %@", [NSURL URLWithString:@"http://host/Témp"]);
NSLog(@"URL with encoded é: %@", [NSURL URLWithString:@"http://host/T%C3%A9mp"]);

Output:

2012-10-02 12:02:56.366 test[73164:c07] URL with é: (null)
2012-10-02 12:02:56.368 test[73164:c07] URL with encoded é: http://host/T%C3%A9mp

If you really really want to borrow the graceful handling of malformed URLs that WebKit has and don't want to implement it yourself you can do something like this but it is very ugly:

UIWebView *webView = [[[UIWebView alloc]
 initWithFrame:self.view.frame]
 autorelease];
NSString *url = @"http://www.httpdump.com/texis/browserinfo/Témp.html";
[webView loadHTMLString:[NSString stringWithFormat:
 @"<script>window.location=%@;</script>",
 [[[NSString alloc]
 initWithData:[NSJSONSerialization
 dataWithJSONObject:url
 options:NSJSONReadingAllowFragments
 error:NULL]
 encoding:NSUTF8StringEncoding]
 autorelease]]
 baseURL:nil];

The answer @Dhaval Vaishnani provided is only partially correct. This method treats the ?=and & characters as not to be encoded, since they're valid in an URL. Thus, to encode an arbitrary string to be safely used as a part of an URL, you can't use this method. Instead you have to fall back to using CoreFoundation and CFURLRef:

NSString *unsafeString = @"this &string= confuses ? the InTeRwEbZ";
CFStringRef safeString = CFURLCreateStringByAddingPercentEscapes (
 NULL,
 (CFStringRef)unsafeString,
 NULL,
 CFSTR("/%&=?$#+-~@<>|\\*,.()[]{}^!"),
 kCFStringEncodingUTF8
);

Don't forget to dispose of the ownership of the resulting string using CFRelease(safeString);.

Also, it seems that despite the title, OP is looking for decoding and not encoding a string. CFURLRef has another, similar function call to be used for that:

NSString *escapedString = @"%32%65BCDEFGH";
CFStringRef unescapedString = CFURLCreateStringByReplacingPercentEscapesUsingEncoding (
 NULL,
 (CFStringRef)escapedString,
 CFSTR(""),
 kCFStringEncodingUTF8
);

Again, don't forget proper memory management.

how to encode URL to avoid special characters in java

URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".

RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).

In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URIobject using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.

Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).

When should an asterisk be encoded in an HTTP URL?

As Major pointed out, the RFC that HTTP 1.1 references for URL syntax has been obsoleted by RFC3986, which isn't as black and white about the use of asterisks as the originally referenced RFC was.

RFC2396 (URL spec before January 2005 - original answer)

An asterisk never needs to be encoded in HTTP 1.1 URLs as * is listed as an "unreserved character" in RFC2396, which is used to define URI syntax in HTTP 1.1. Unreserved characters are allowed in the path component of a URL.

2.3. Unreserved Characters

Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols.

 unreserved = alphanum | mark
 mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.

RFC3986 (current URL syntax for HTTP)

RFC3986 modifies RFC2396 to make the asterisk a reserved character, with the reason that it is "typically unsafe to decode". My understanding of this RFC is that the unencoded asterisk character is allowed in the path, query, and fragment components of a URL, as these components do not specify the asterisk as a delimiter (2.2. Reserved Characters):

These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax... If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

Additionally, 3.3 Path confirms that a subset of reserved characters (sub-delims) can be used unencoded in path segments (parts of the path component broken up by /):

Aside from dot-segments ("." and "..") in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment. ... For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same.

HTTP 1.0 references RFC1738 to define URL syntax, which through a series of updates and obsoletes means it uses the same RFC as HTTP 1.1 for URL syntax.

As far as backwards compatibility goes, RFC1738 specifies the asterisk as a reserved character, though as HTTP 1.0 doesn't actually define any special meaning for an unencoded asterisk in the path component of a URL, it shouldn't break anything if you use one. This should mean you're still safe putting asterisks in the URLs pointing to the oldest of systems.


As a side note, the asterisk character does have a special meaning in a Request-URI in both HTTP specs, but it's not possible to represent it with an HTTP URL:

The asterisk "*" means that the request does not apply to a particular resource, but to the server itself, and is only allowed when the method used does not necessarily apply to a resource. One example would be

 OPTIONS * HTTP/1.1