|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.norconex.commons.lang.url.URLNormalizer
public class URLNormalizer
The general idea behind URL normalization is to make different URLs
"equivalent" (i.e. eliminate URL variations pointing to the same resource).
To achieve this,
URLNormalizer
takes a URL and modifies it to its
most basic or standard form (for the context in which it is used).
Of course URLNormalizer
can simply be used as a generic
URL manipulation tool for your needs.
You would typically "build" your normalized URL by invoking each method of interest, in the relevant order, using a similar approach:
String url = "Http://Example.com:80//foo/index.html"; URL normalizedURL = new URLNormalizer(url) .lowerCaseSchemeHost() .removeDefaultPort() .removeDuplicateSlashes() .removeDirectoryIndex() .addWWW() .toURL(); System.out.println(normalizedURL.toString()); // Output: http://www.example.com/foo/
Several normalization methods implemented come from the RFC 3986 standard. These standards and several more normalization techniques are very well summarized on the Wikipedia article titled URL Normalization. This class implements most normalizations described on that article and borrows several of its examples, as well as a few additional ones.
The normalization methods available can be broken down into three categories:
The following normalizations are part of the RFC 3986 standard and should result in equivalent URLs (one that identifies the same resource):
Convert scheme and host to lower case
Convert escape sequence to uppercase
Decode percent-encoded unreserved characters
Removing default ports
The following techniques will generate a semantically equivalent URL for the majority of use cases but are not enforced as a standard.
These normalizations will fail to produce semantically equivalent URLs in many cases. They usually work best when you have a good understanding of the website behind the supplied URL and whether for that site, which normalizations can be be considered to produce semantically equivalent URLs or not.
Remove directory index
Remove fragment (#)
Replace IP with domain name
Unsecure schema (https → http)
Secure schema (http → https)
Remove duplicate slashes
Remove "www."
Add "www."
Sort query parameters
Remove empty query parameters
Remove trailing question mark (?)
Remove session IDs
Refer to each methods below for description and examples (or click on a normalization name above).
Constructor Summary | |
---|---|
URLNormalizer(String url)
Create a new URLNormalizer instance. |
|
URLNormalizer(URL url)
Create a new URLNormalizer instance. |
Method Summary | |
---|---|
URLNormalizer |
addTrailingSlash()
Adds a trailing slash (/) to a URL ending with a directory. |
URLNormalizer |
addWWW()
Adds "www." domain name prefix. |
URLNormalizer |
decodeUnreservedCharacters()
Decodes percent-encoded unreserved characters. |
URLNormalizer |
lowerCaseSchemeHost()
Converts the scheme and host to lower case. |
URLNormalizer |
removeDefaultPort()
Removes the default port (80 for http, and 443 for https). |
URLNormalizer |
removeDirectoryIndex()
Removes directory index files. |
URLNormalizer |
removeDotSegments()
Removes the unnecessary "." and ".." segments from the URL path. |
URLNormalizer |
removeDuplicateSlashes()
Removes duplicate slashes. |
URLNormalizer |
removeEmptyParameters()
Removes empty parameters. |
URLNormalizer |
removeFragment()
Removes the URL fragment (from the "#" character until the end). |
URLNormalizer |
removeSessionIds()
Removes a URL-based session id. |
URLNormalizer |
removeTrailingQuestionMark()
Removes trailing question mark ("?"). |
URLNormalizer |
removeWWW()
Removes "www." domain name prefix. |
URLNormalizer |
replaceIPWithDomainName()
Replaces IP address with domain name. |
URLNormalizer |
secureScheme()
Converts http scheme to https . |
URLNormalizer |
sortQueryParameters()
Sorts query parameters. |
String |
toString()
Returns the normalized URL as string. |
URI |
toURI()
Returns the normalized URL as URI . |
URL |
toURL()
Returns the normalized URL as URL . |
URLNormalizer |
unsecureScheme()
Converts https scheme to http . |
URLNormalizer |
upperCaseEscapeSequence()
Converts letters in URL-encoded escape sequences to upper case. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public URLNormalizer(URL url)
URLNormalizer
instance.
url
- the url to normalizepublic URLNormalizer(String url)
URLNormalizer
instance.
url
- the url to normalizeMethod Detail |
---|
public URLNormalizer lowerCaseSchemeHost()
HTTP://www.Example.com/ → http://www.example.com/
public URLNormalizer upperCaseEscapeSequence()
http://www.example.com/a%c2%b1b →
http://www.example.com/a%C2%B1b
public URLNormalizer decodeUnreservedCharacters()
http://www.example.com/%7Eusername/ →
http://www.example.com/~username/
public URLNormalizer removeDefaultPort()
http://www.example.com:80/bar.html →
http://www.example.com/bar.html
public URLNormalizer addTrailingSlash()
Adds a trailing slash (/) to a URL ending with a directory. A URL is considered to end with a directory if the last path segment, before fragment (#) or query string (?), does not contain a dot, typically representing an extension.
Please Note: URLs do not always denote a directory structure and many URLs can qualify to this method without truly representing a directory. Adding a trailing slash to these URLs could potentially break its semantic equivalence.
http://www.example.com/alice →
http://www.example.com/alice/
public URLNormalizer removeDotSegments()
Removes the unnecessary "." and ".." segments from the URL path.
URI.normalize()
is invoked to perform this normalization.
Refer to it for exact behavior.
http://www.example.com/../a/b/../c/./d.html →
http://www.example.com/a/c/d.html
Please Note: URLs do not always represent a clean hierarchy structure and the dots/double-dots may have a different signification on some sites. Removing them from a URL could potentially break its semantic equivalence.
URI.normalize()
public URLNormalizer removeDirectoryIndex()
Removes directory index files. They are often not needed in URLs.
http://www.example.com/a/index.html →
http://www.example.com/a/
Index files must be the last URL path segment to be considered. The following are considered index files:
Please Note: There are no guarantees a URL without its index files will be semantically equivalent, or even be valid.
public URLNormalizer removeFragment()
Removes the URL fragment (from the "#" character until the end).
http://www.example.com/bar.html#section1 →
http://www.example.com/bar.html
public URLNormalizer replaceIPWithDomainName()
Replaces IP address with domain name. This is often not reliable due to virtual domain names and can be slow, as it has to access the network.
http://208.77.188.166/ → http://www.example.com/
public URLNormalizer unsecureScheme()
Converts https
scheme to http
.
https://www.example.com/ → http://www.example.com/
public URLNormalizer secureScheme()
Converts http
scheme to https
.
http://www.example.com/ → https://www.example.com/
public URLNormalizer removeDuplicateSlashes()
Removes duplicate slashes. Two or more adjacent slash ("/") characters will be converted into one.
http://www.example.com/foo//bar.html
→ http://www.example.com/foo/bar.html
public URLNormalizer removeWWW()
Removes "www." domain name prefix.
http://www.example.com/ → http://example.com/
public URLNormalizer addWWW()
Adds "www." domain name prefix.
http://example.com/ → http://www.example.com/
public URLNormalizer sortQueryParameters()
Sorts query parameters.
http://www.example.com/?z=bb&y=cc&z=aa →
http://www.example.com/?y=cc&z=bb&z=aa
public URLNormalizer removeEmptyParameters()
Removes empty parameters.
http://www.example.com/display?a=b&a=&c=d&e=&f=g →
http://www.example.com/display?a=b&c=d&f=g
public URLNormalizer removeTrailingQuestionMark()
Removes trailing question mark ("?").
http://www.example.com/display? →
http://www.example.com/display
public URLNormalizer removeSessionIds()
Removes a URL-based session id. It removes PHP (PHPSESSID), ASP (ASPSESSIONID), and Java EE (jsessionid) session ids.
http://www.example.com/servlet;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED?a=b
→ http://www.example.com/servlet?a=b
Please Note: Removing session IDs from URLs is often a good way to have the URL return an error once invoked.
public String toString()
toString
in class Object
public URI toURI()
URI
.
public URL toURL()
URL
.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |