Java Protocol Handlers

Roger Orr demonstrates the use of Java's URL handling to make code independent of the source of data.

Introduction

In today's programming environment data can be sourced from a variety of locations, using a range of protocols. In many cases the actual source of the data is irrelevant to the application; when this is the case then being able to abstract details of location away from the code means that we can process data from a variety of different places by simply changing a configuration string.

So, for example, the configuration for the common Java logging suite log4j [ log4j ] can be provided as easily from a local file on the hard disk as from an Internet web site without any changes being required to the application code.

As another example, an overnight batch process might take data via ftp from a remote server, but be more easily tested by running against a sample disk file containing a known dataset.

The standard method of describing such abstract locations for data is through a URL (Universal Resource Locator) - the most common example of these being web site address such as http://www.accu.org.

Java comes with built-in support for URLs, most obviously through the java.net.URL class.

A simple example of using the URL class

Listing 1 is a trivial Java program which makes use of the URL class to read data in a location agnostic manner.

    package howzatt;

    public class Example {
      public static void main( String[] args ) {
        for ( String uri : args ) {
          read( uri );
        }
      }

      public static void read( String uri ) {
        try {
          java.net.URL url = new java.net.URL( uri );
          java.io.InputStream is = url.openStream();
          int ch;
          while ( ( ch = is.read() ) != -1 ) {
            System.out.print( (char)ch );
          }
          is.close();
        }
        catch ( java.io.IOException ex ) {
          System.err.println( "Error reading " +
            uri + ": " + ex );
        }
      }
    }

Listing 1

This program can be run against:

a local file using an argument like file:example.txt
a remote file using an argument like file://server/path/file
a Web site using an argument like http://www.accu.org/ (subject to any restrictions by network firewalls or proxies).

URLs - or 'what's in a name?'

The syntax of a URL is defined by RFC 2396 [ RFC2396 ] which, confusingly perhaps, uses the term URI (Universal Resource Identifier) rather than URL. The difference between the two is that a URI is more general; it can describe resources that aren't locations, for example urn:isbn:978-0-470-84674-2 which is the ISBN number of a book. In practice, however, the distinction between the two terms is often blurred. There is a fuller discussion of this issue on the w3c site [ W3C - URI ].

Each identifier before the first colon in a URI name defines a 'scheme' and schemes such as http and ftp are globally recognised as standard. The full list of official schemes is held by the Internet Assigned Numbers Authority [ IANA ].

Java has support for URIs and URLs through the java.net.URI and java.net.URL classes. Additionally, Java is supplied with inbuilt support for a number of different schemes, as a minimum support for the following is guaranteed: http , https , ftp , file , and jar .

Note that Java refers to these schemes as 'protocols' although, for example, processing an http URL involves two protocols - DNS to resolve the host name and HTTP to access the data.

Although the five standard protocols are often adequate, there are sometimes cases where access is required to other data sources. Often the location of this data can be described using the URI syntax but it may not be an 'official' URI scheme.

For example, data might be obtainable using scp (secure file copy) and the obvious URI of scp://user@host/path/file could be used to represent the location of a file on some remote host. Or again, data may be supplied in a zip file or some other compressed format and you want to be able to access the data uncompressed from within your program.

Fortunately Java allows us to supply our own protocol handlers to extend the set of supported schemes.

There are existing extensions to the Java protocol handlers provided by various sites on the Internet and supporting various protocols; one such example is Hansa [ Hansa ]. If your requirements are for support of a well-known protocol you may be able to find a pre-written protocol handler.

However, there may be times when you want to implement a protocol handler yourself - whether for an unsupported official scheme or for a proprietary one.

Extending Java's URL handling

Java supplies a standard mechanism for extending the supported protocol schemes, which is described in brief in the documentation for the URL class. The process consists of three parts:

writing a class, derived from java.net.URLStreamHandler , that knows how to open a connection to URLs of the new scheme
writing a class derived from java.net.URLConnection , to access data from these connections
associating the stream handler class with the protocol name

The first two steps will obviously depend heavily on the specifics of the protocol being supported, and may involve such actions as opening network connections or invoking external programs. I'll illustrate the process with a very simple example that uses the 'quote of the day' service to make the general principle clear without requiring too many protocol-specific details.

The final step involves plugging your new classes into the processing the URL class uses when it comes across a protocol for the first time. The URL class attempts to create an instance of the correct URLStreamHandler class in the following order:

If a factory has been registered with the URL class then the createURLStreamHandler method of the factory is called with the protocol name.
If there is no factory, or the factory does not recognise the protocol, then Java looks for the system property java.protocol.handler.pkgs which is a | delimited list of packages. For each package it tries to load the class <package>.<protocol>.Handler , which, if present, must be the URLStreamHandler for the given protocol.
Failing this the system default package is searched for a handler in the same way.

For stand-alone applications the easiest way to register your new protocol is to define the system property used by the URL class; so let's see how this might work.

A 'quote of the day' handler

A standard Internet service is supported, on many operating systems, on port 17. This service simply returns a random quote whenever a TCP/IP connection is made to it. If this service is available on your machine, you can see it at work using telnet. Here is a Windows example:

      C:> telnet localhost qotd
      "We want a few mad people now. See where the sane  ones have landed us!"
      George Bernard Shaw (1856-1950)
      Connection to host lost.

If this attempt fails, you might need to start the service (or connect to another machine that does offer the qotd service). On Windows it is one of the 'Simple TCP/IP Services'.

In order to access this service from my example program at the start of the article I need a URL syntax, so I've picked the simple format:

qotd://hostname .

Since we are using an unofficial scheme there are several alternative ways of encoding the data as a URI.

Listing 2 contains example code for a simple stream handler for the qotd protocol and the actual connection handling code itself is in Listing 3.

    package howzatt.qotd;

    public class Handler
    extends java.net.URLStreamHandler {
      protected java.net.URLConnection
      openConnection(java.net.URL u)
      throws java.io.IOException {
        return new QotdConnection( u );
      }
    }

Listing 2

    package howzatt.qotd;

    public class QotdConnection
    extends java.net.URLConnection {

      private static final int QOTD = 17;
      private java.net.Socket socket;

      public QotdConnection( java.net.URL u ) {
        super( u );
      }

      public void connect()
      throws java.io.IOException {
        final String host = getURL().getHost();
        socket = new java.net.Socket( host, QOTD );
        connected = true;
      }

      public java.io.InputStream getInputStream()
      throws java.io.IOException {
        if ( ! connected )
          connect();
        return socket.getInputStream();
      }
    }

Listing 3

Now if we compile these two additional classes, we can use the qotd protocol with the example program shown earlier like this:

    java -Djava.protocol.handler.pkgs=howzatt howzatt. Example qotd://localhost

If all is well we get a quote displayed - we have transparently extended our simple application to acquire data from a different source.

Problems with protocol handlers

In my experience the biggest problem with extending Java's protocol handlers is with the registration process. Writing the code to handle the specific protocol is a fairly clear task, it requires a decision about the URI syntax to be used for and the code written for the particular connection type.

The registration problem is harder because of two design issues.

The factory registration is inextensible
The class loader used by the URL class cannot be changed

As a mentioned earlier, one way of registering your URLStreamHandler class with the URL class is to provide a factory object. Unfortunately this mechanism is somewhat inflexible; specifically the setURLStreamHandlerFactory method can be called at most once in a given Java Virtual Machine.

This may be a valid restriction for a small Java application but it becomes hard to manage when two different parts of the application, possibly written by unrelated teams, each wish to register a factory for their own protocol with the URL class.

However, even leaving this problem aside, the factory approach requires the application code to register the factory explicitly which makes it hard to add new protocols to existing programs. This is what we did earlier to the example program, and is one of the most powerful aspects of Java's protocol handler support.

On the other hand, using the protocol.Handler convention can be problematic because of the way Java class loaders work.

When a new protocol is detected by the URL class it tries to load the appropiate handler class but using the class loader that was used to load the URL class itself.

For a stand-alone application this does not usually present a problem, but where the Java code is running inside a web service or as an applet it is normal for user-supplied code to be loaded by a different class loader than the core Java classes.

In these cases, any protocol handler class supplied in the user code will not be found by the system class loader used to load the java.net.URL class.

In these cases it also may not be as simple to externally configure the system property used by the URL class and the System.setProperty method can be used at runtime to add additional packages. Note however that this approach might be barred by the security manager and care must also be taken to ensure that any existing packages defined by this system property are retained. See Listing 4.

    public static void register() {
      final String packageName =
         Handler.class.getPackage().getName();
      final String pkg = packageName.substring(
         0, packageName.lastIndexOf(  '.' ) );
      final String protocolPathProp =
         "java.protocol.handler.pkgs";

      String uriHandlers = System.getProperty(
         protocolPathProp, "" );
      if ( uriHandlers.indexOf( pkg ) == -1 ) {
        if ( uriHandlers.length() != 0 )
          uriHandlers += "|";
        uriHandlers += pkg;
        System.setProperty( protocolPathProp,
           uriHandlers );
      }
    }

Listing 4

Alternative approaches

Given the problems with registration, other approaches can be taken. One is to jettison the Java URL and provide a different abstraction; this seems to be the approach favoured by the Apache 'Commons Virtual File System', which retains the use of the URI syntax but provides an alternative method of access the data using a FileSystemManager class.

The weakness with such an approach is that it does not of itself support handling of additional protocols when using existing code that uses the java.net.URL class internally to connect to a URL.

Another approach is to use the factory registration, but to provide a factory class that itself supports registration of multiple different stream handlers using different names.

This approach supports code using the java.net.URL class but it does require a registration call for each protocol and so hence changes are needed to an application before it can make use of the new URLs. However the approach gets around the problems discussed above with multiple class loaders since the factory is loaded by the user code class loader rather than by the class loader for the URL class.

Restrictions

The Java protocol handlers are not suitable for every situation. There are two main reasons for this.

The URL abstraction may hide too much detail of the underlying data representation. For example, processing might require file-system specific methods, or be intolerant of network latency.
Not all resources are easily described by a URI, and not all protocols fit into the URLConnection model. Security can be a particular problem here since the usual way of including a username/password into a URL uses plain text which is obviously rather insecure.

Conclusion

The location abstraction provided by the URL notation makes it possible to write programs that can transparently access data from a wide variety of different places.

There is a parallel with the way that Unix treats 'everything like a file' - even access to system information. This common view of data means that simple tools may have wide applicability. The same principle applies with the use of URLs in Java - the abstraction can make programs able to process a wide range of data from a variety of sources without needing explicit coding.

Java provides a relatively simple mechanism to add new protocols to your applications and hence widen the range of locations for sourcing data.

There is a great deal of power in this approach; sadly the specific details of registering with the URL class are not very flexible but in most cases there are various techniques to work around the limitations.

References

[ log4j] 'Apache log4j', http://logging.apache.org/log4j/

[ RFC2396] 'Uniform Resource Identifiers (URI): Generic Syntax', http://www.ietf.org/rfc/rfc2396.txt

[ W3C-URI] 'URIs, URLs, and URNs: Clarifications and Recommendations', http://www.w3.org/TR/uri-clarification/

[ IANA] 'Uniform Resource Identifer (URI) Schemes', http://www.iana.org/assignments/uri-schemes.html

[ Hansa] 'Project Hansa', http://wiki.ops4j.org/dokuwiki/doku.php?id=hansa:hansa

[ VFS] 'Commons Virtual File System', http://commons.apache.org/vfs/