OBriens tower
Musings on software development, Linux and business

Archive for the 'java' Category

Quantifying simplicity in code

Wednesday, May 14th, 2008

One of the key values here at Applepie is simplicity. As a member of Applepie I strive whenever possible to deliver lean, readable, manageable code that meets its requirements and is a pleasure for other developers to use. It’s not easy. To help ‘keep it simple’ we follow a number of practices, such as code peer reviews, which at there heart have the question: “Is this simple?”. The outcome of these practices leads to better code that we have qualified as manageable and elegant. Code we feel is simple.

However, Qualifying or feeling something to be simple is often not enough, we want hard fact, we want to quantify how well a body of code exhibits simplicity. One answer, is to analyse the code base and generate metrics that can support us in our quest to simplify. One of the best metric I’ve found for this is Cyclomatic Complexity.

Cyclomatic Complexity (CC) measures the complexity of your code by counting the number of paths through your code. The number of paths is the CC number. The bigger the CC number the more likely that code is difficult to conceptualize and also less likely you can unit test that code effectively. It’s a bit like navigating a road, a Y junction fine, a cross roads OK, a four-way interchange is (well for me anyway) reaching my cognitive limit.

As a rule of thumb, Cyclomatic Complexity Numbers are as follows:

  • Simple - 11 or less is optimal
  • Manageable - 11-21, may be problematic and will require large unit tests.
  • Complex - 21-50 will be problematic and certainly will not be easy to test.
  • Forget it - 50+ cannot be tested and requires self-actualized yogi chess grandmaster to understand.

So how to profile your code for CC. I’m going to focus on Java though similar tools exist for C#, PHP, Ruby, Python. The tool I’ve used for Java is PMD (works with the IDE of your choice). PMD generates many static metrics on your code base including Cyclomatic Complexity.

To install PMD follow the instructions on this PMD onJava article, or if you only have 5 minutes here’s a quick install guide for eclipse. Pop the exploded zip downloaded from PMD into your plugins folder. After you have installed the plugin you need to activate the metrics for your project. Go to the project properties and select the Metrics option and then select the enable metrics. That’s it! The metrics are then calculated and a metric view is presented.

McCabe Cyclomatic Complexity (CC) is the PMD metric we are interested in. PMD generates these CC statistics for the entire project. It allows you to drill down a tree of statistics and quickly ascertain the highest CC for each package, class and method. It even highlights items over a threshold in red (defaults to11, but is configurable). Now at last I have the CC count for every piece of code I work on.

These generated statistics gives a great overview of where code needs to be simplified and (hopefully) helps a bit more in answering “Is this simple?”, allowing me to back up my gut feel with some quantifiable statistics.

If you want to know more on Cyclomatic Complexity see here. If your interested in other metrics supporting simplicity see Operands on an operation.

Java job vacancies in Galway, Ireland

Saturday, May 10th, 2008

Galway is a great place to live and work.

When I arrived to Galway from British Columbia in 1993 both Linux and Java were in their infancy. I had heard of neither - and why should I have - I was a carpenter.
Engaged to a Galway girl (there ain’t nothin’ like them, lads!), I was at that time entitled to a work permit, and went down to the Mill Street Garda station to get one. The conversation with the garda went something like this:

me: Hi, I am engaged to a Galway girl and I’d like to get a work permit.
garda: Where are you from?
me: Canada.
garda: What do you do?
me: I’m a carpenter.
garda: Go back to Canada, there’s no work here.

The following year I traded hardwood and softwood for hardware and software. Wow that was almost fifteen years ago.

The job situation is different here now. There are jobs for Java developers in Galway, and yes foreigners are welcome (permission to work in Europe is required).

Some of the companies I am aware of who have recently been hiring Java developers in Galway include:
Applepie Solutions (us)
ATFM Solutions
Celtrak
Cisco Systems
Duolog
Fisc Ireland (Fidelity Investments)
Nortel Networks

Know of other companies looking for java developers in Galway? Let me know and I’ll add a link here.

What’s wrong with this java code?

Wednesday, April 16th, 2008

Can you spot the bug in this code?

01   Connection conn=null;
02   Statement st = null;
03   ResultSet rs = null;
04   try{
05    conn = getConnection();
06    st = conn.createStatement();
07    rs = conn.executeQuery("select foo from bar");
08    ...
09  }finally{
10    if(rs!=null) rs.close();
11    if(st!=null) st.close();
12    if(conn!=null) conn.close();
13  }

Answer: If an SQLException is thrown at line 10 or 11, line 12 will not be executed. If line 12 is not executed some resources may be lost.

Yes, the likelihood of an exception being thrown at line 10 or 11 is low, but good java programmers will avoid leaking resources by defensive use of try-catch blocks. Use the pattern of starting the try block immediately after allocating a resource. Here’s a better way to write the same block of code:

01   Connection conn =  getConnection();
02   try{
03     Statement st = conn.createStatement();
04     try{
05       ResultSet rs = conn.executeQuery("select foo from bar");
06        try{
07         ...
08        }finally{
10          rs.close();
11        }
12     }finally{
13       st.close();
14     }
15   }finally{
16     conn.close();
17   }

Getting a thread dump from Tomcat running as a Windows service

Monday, April 14th, 2008

We’re supporting a java application deployed on Tomcat, which is running as a windows service. On Friday the logs showed that parts of the application were frequently timing out while trying to aquire a DB connection from a pool. We wanted to get a thread dump to see if any threads holding connections were deadlocked. If Tomcat had been started from a console this would be straightforward, unfortunately it wasn’t, and we didn’t have the option to re-start it on the production server.

One useful tool is the free web start version of stack trace. We had no joy with this either though. Our remote desktop session was not the account from which the service was started. Stack trace helpfully suggests using Start->run->”mstsc /console” to start the remote desktop session in this case, but this would have terminated other sessions that were open to the server, and therefore wasn’t an option for us.

Cue a moment of inspiration from Rob, which resulted in a simple jsp that will output a thread dump. Note that your applicaiton must be running on at least java 5.0 for this to work. Just make a simple jsp with the following snippit as the body of the page, and drop it in the web root of your application. Then fire up a browser, navigate to the jsp and view the dump without even having to restart Tomcat!


<body>
<center><h1>Thread Dump</h1></center>
<pre>
<%

  StringBuffer sb = new StringBuffer();
  Map  st = Thread.getAllStackTraces();
  for (Map.Entry  e : st.entrySet() ) {
    StackTraceElement[] el = e.getValue();
    Thread t= e.getKey();
    sb.append(”\”" ).append( t.getName() ).append( “\” ” );
    sb.append( t.isDaemon()?”daemon”:”" ).append( ” prio=” ).append( t.getPriority() );
    sb.append ( ” Thread id=” ).append( t.getId()  ).append( ” ” ).append( t.getState()  );
    sb.append( “\n” );
    for (StackTraceElement line: el) {
      sb.append(”\t”+line + “\n”);
    }
    sb.append(”\n”);
  }

% >
<%=sb.toString() %>
</pre>
</body>

Feeling better after a good stack dump

Thursday, March 27th, 2008

Earlier this week we encountered performance problems on one of the production systems we developed and help support. After having migrated several hundred clients onto the high throughput java based system, the users began to notice some strange slowness appearing.

Log files are great. By studying the production logs we were determined that the output side of the application was no longer keeping up with input. Having already developed some performance enhancements for a future release, we backported some of these and generated a patch which we tested then deployed into the production system. The system was fast again.

Or so we thought. Infact it was much faster for two days until the the strange slowness suddenly reappeared. The logs revealed that things had slowed down, but no indication why. I generated a stack dump (java on linux) using kill -3. I created a three column spreadsheet, then skipping idle threads belonging to the web application container created one row for each application thread. The columns are:

  1. Thread name
  2. Kind of thread (input, output, etc.)
  3. what the thread is doing

This took a little time as the application has more than 100 threads, but as I did it a pattern began to emerge… I could see many threads waiting to lock a statically synchronized method of a date parsing utility component. I investigated and found that the method had been statically synchronized because it relies on java.text.SimpleDateFormat, a class which is not synchronized and relatively expensive to create.

Studying what others have written about this problem, the development team is now reworking the implementation to use ThreadLocal instances of the SimpleDateFormat rather than statically shared instances. The stack dump was very useful in helping to find the blockage. I hope the fix resolves the problem!

Cheap and cheerful java object persistence using Lucene

Tuesday, March 18th, 2008

I took advantage of the the St. Patrick’s long weekend to experiment with using Lucene as a simple java object store. The context of the research was to determine whether it is feasible to create with Lucene a simple persistence layer to be used in a project currently holding an increasing number of disconnected java objects in an in-memory map.

I came to considering Lucene as an object store having already investigated using persistent maps and caching components such as jcs and ehcache. One of the main issues I encountered with these was that searching for objects based on some criteria other than the key required either indexing the sought objects at an application level, or putting up with a lot of I/O when iterating through a large volume of stored objects. I deemed hibernate to be an option, but avoided it primarily due to concerns about increasing the complexity of an already-complex-enough project.

While the practice of indexing java objects with lucene has been around for a while, the option of easily persisting the objects themselves in lucene is newer. A recently added feature provides the ability to store fields containing binary content - perhaps a suitable place for storing java objects? Grant Ingersol, one of the committers on the Lucene project recently blogged,

I even use it in things that 5 years ago I would never have thought I would use it for (object stores, etc.)

There are several features about my java objects which make them suitable for indexing and storing in lucene:

  • They already implement java.io.Serializable.
  • They are essentially data holders.
  • They are disconnected - they do not hold references to other objects which will also be in the repository.
  • They have get* methods which can be used for accessing most anything I will want to search on.
  • Each object already has a unique identifier

The result of the weekend’s work was a single java class which implements persistence in lucene. I called it Lucos - Lucene object store. It is available for download here.

The basic functionality is to put/get an object in/out of the store in a manner similar to how an object is stored in a map. Here is an example:

Person fred = new Person("Fred Flinstone");
Lucos lucos = new Lucos();
lucos.put("fflinstone",fred);
Person x = (Person) = lucos.get("fflinstone");
//NB: x is a COPY of fred
assertEquals(fred,x);

Putting an object in the class using the put(String id, Object value) method, creates indexed fields for all of the no-arg get* methods on the value class. It also create indexes on all the value class and all the classes it extends or implements. Put changes are committed immediately to the index. Subsequent gets (or searches) reload the index (if necessary) to retrieve the latest changes.

To find all the instances of person in the repository:

EntryIterator it =
lucos.findInstances(Person.class);
System.out.println("Found "+it.length+" persons");
while(it.hasNext()){
String id = it.getKey();
Person person = (Person) it.getValue();
...
}

Providing search functionality was one of the features I required in order to overcome the issues already identified with searching a persistent map. One of the difficulties I encountered in doing this was that where fields were stored tokenized an exact match did not seem possible, and where stored untokenized, a partial match did not. To overcome this difficulty, I indexed fields in both tokenized and untokenized format, appending ‘.exact’ to the name of the untokenized field. Given that my Person has method String getName(), I can search my objects with any of these:

// find all persons named fred using a TermQuery
lucos.findInstances(Person.class, "name", "fred");
// find all persons named fred using lucene syntax query and the installed Analyzer
lucos.findInstances(Person.class, "name:fred");


// find all persons named Fred Flinstone using a TermQuery
lucos.findInstances(Person.class, "name.exact", "Fred Flinstone");

If you want to use a query not parsed using the Lucos analyzer, parse the query first, then pass it to findInstances:

QueryParser parser =
new QueryParser("name.exact", new KeywordAnalyzer());
Query query = parser.parse("\"Fred Flinstone\"");
it = lucos.findInstances(Person.class,query);

Here’s how to create a Lucos instance which uses file persistent storage:

String folder = "{path to folder}";
Directory directory = FSDirectory.getDirectory(folder);
Lucos lucos = new Lucos(directory);

Finally, don’t forget to close() lucos when finished with it. This will release the lucene write lock:

lucos.close();

I still need to do volume and load testing with some production data to verify the solution will provide memory/performance trade-off in reducing the size of my in-memory map. For the moment I’m satisfied that it is feasible to use Lucene as a java object store. The solution adds minimal complexity to the project introducing only one additional (lucene) jar file. For a future iteration it might be worth considering adding a dependency on xstream, removing the requirement that objects placed into the repository implement the serializable interface, and also possibly making them more generally searchable.

If you would like to add cheap and cheerful java object persistence into your project, I hope that Lucos might provide you with some code for thought and perhaps the basis for a solution. The code and a test class for Lucos is available for download here.

Comments are welcome!

Toolbox for a Java craftsman

Thursday, July 26th, 2007

Back in the 80’s, the “olden days”, before I was a software guy I was a builder guy. I drove around in an rusty Ford econoline van with scaffolds and ladders on the roof; tools and materials in the back of the van, techniques in the back of the head. Tools and materials and techniques. Most jobs required all three, and like all tradesmen I accumulated some of each over the years. In those days I would arrive on a project ready to hit the ground running - I brought the basics with me.

Now with more than a decade of industrial software engineering behind me, I’m pelased with the collection of tools and techniques I carry with me. I don’t mean the phone and the laptop. I mean tools like putty and firefox and scite and cvs, mstsc, thunderbird, skype, gaim, open office, gimp, password safe, igal and vi to name a few.

The contents of the toolbox change over time. I no longer use cygwin and though emacs is still in there but doesn’t come out so often. Subversion is in there now, though I haven’t used it enough to make it my own.

Some tools are best not left behind. To any java project I always bring eclipse, ant and junit, and usually log4j.

Re-usable suitably-licensed open-source software components are a big part of my toolbox. Here are some of the most tried and trusted components that I have included in various java projects over the years and am happy to recommend:

  • Logging: log4j (Notice I’m mentioning it for the second time)
  • Working with xml: dom4j provides useful functionality.
  • Text indexing: lucene is an extremely well done component.
  • Templating: velocity is proven.
  • Scheduling: quartz is very robust and reliable.
  • Working with pdf: PDFbox is worthwhile. I’d like to try a using pdfbox with velocity sometime, for templating pdf documents.
  • File identification: ffident provides mime type identification. I added some (what I thought were useful) new features which I sent to the author, but he doesn’t seem to have folded them in.
  • Http requests: http client is reliable.
  • Http file uploads: Commons file upload is helpful for handling files uploaded to your servlet or jsp page.
  • cleaning up html: nekohtml is very useful if you want to take html pages from the wild and convert them to xml in order to run xpath expressions against them.
  • Working with excel spreadsheets: poi is handy for reating and writing html.
  • Charting: JFreeChart is very useful.
  • Embedded sql: hsqldb is a useful sql database written in java which can be embedded into your application to run in-memory or persisting to the file system.
  • Scripting: BeanShell, Rhino, and BSF if you can’t decide between them ;-)
  • JNDI: Commons naming provides jndi setup for your application using tomcat 4 style configuration. (I don’t know why this is not more readily available, if you do, please let me know!)

Maybe some of these tools will be useful to you. If you have some more tried and trusted components for java applications, maybe post a comment.

When I moved from the West of Canada to the West of Ireland in the early 90’s I left most of the tools behind me, the saws and the ladders and the compressors and the welder. I still have my old hammer. I’m a bit rusty on some of the techniques, but I’ll probably never forget the how to dry a paint brush, taught to me by my mentor Lorenzo Quarenghi, son of Walter: after cleaning the brush in water or thinner (as appropriate), go outside wearing your old shoes or boots. With your heel on the ground and your toe pointing up, hold the handle of the brush and tap the metal edge on the top of your toe. The spray goes onto the ground and onto the sole of your shoe, the brush becomes clean and dry!

Naive language detection using ICU4J and Classifier4J

Thursday, May 10th, 2007

Consumer products in Canada are labeled both in English and Francais. Because of this, many otherwise ‘en’ exclusive kids (like me) have enjoyed the advantage of acquiring useful ‘fr’ vocabulary like pain, miel, beurre, and beurre d’archides. I never picked up a lot of French vocabulary,* but am well able to visibly distinguish between the French and English sides of a label.

Now I live in Europe where many consumer products are labeled in several languages. I’m able to pick out which one is French, English and German. I’m embarassed to admit it but if you gave me texts in each of Portugese, Spanish, Italian and Latin I’d be hard pressed to identify them correctly.

So I was in a bit of a pickle last week when we began receiving IPTC7901 news feeds in Latvian, Lithuanian, Estonian, Russian and English, with the goal of converting them to NewsML, loading them into Profium’s Metadata Server and presenting them in the editorial system. I knew neither the language nor the character encoding. News represented as a series of ??? ?????? ??? ?????? ?? marks is not a big seller.

The news provider advised us that three character encodings were used for the feeds; English in ISO-8859-1, Russian in KOI8-R and the three Baltic languages in ISO-8859-13. Examining the files using a text editor, it was obvious from the text which articles were written in English and which in Russian, but I was unable to distinguish between the three baltic languages. Apart from the text of the articles I could find no distinguishing features between the feeds. (I found no relationship between IPTC subject and the article langauge or encoding).

My investigation led me to use ICU4J CharsetDetector to detect the character set. This worked fine for the Russian and English texts, but failed miserably for the Baltic languages, identifying them mostly as Spanish or Portugese wrongly detecting the character encoding for these files as ISO-8859-1 rather than ISO-8859-13.

I resolved the problem by teaching my application the difference between Latvian, Lithuanian and Estonian. To do this, I retrieved some text in each of those languages from the internet and saved these to separate UTF-8 encoded xml files (lv.xml, lt.xml, et.xml). Our application loads the text from the xml files into a Classifier4J VectorClassifier and is now able to distinguish between the three Baltic languages.

We now use the following logic to determine the language of the incoming feed:
If ICU4J gives a high confidence that the file is English or Russian, go with that, otherwise read the text using ISO-8859-13 and select the Baltic language which is given the highest degree of confidence using Classifier4J .

Apparently the process is working well. The incoming IPTC7901 documents are being picked up, converted to NewsML and are appearing correctly in the multimedia desk. I say apparently because the Baltic languages are, as they say, Greek to me!

Next week I’m off to Riga. I will be going there without the illusion I had travelling to France a number of years ago. That time I arrived in Antibes with nowhere to stay and the naive belief that using my Canadian French label vocabulary I could parler un petit peut. Quelle horreur!

*Some will say I’ve never picked up much English vocabulary either.

Extracting metadata from html to excel

Monday, February 19th, 2007

We recently completed a small but important project involving extracting marketing data from a set of html pages into an excel spreadsheet. This is quite a simple process, and here I will describe why businesses sometimes need to do this, the approach we use, and some of the issues to expect if you do this.

Government bodies often publish information which can be very useful to businesses. But frequently although the information is useful, the format is not. Manual extraction is possible though tedious and timeconsuming.

A building supplier wants the contact details extracted from publicly available County Council Planning Applications. They use these for a targetted marketing campaign with great effect: People who are spending money building or renovating their homes will be giving a big part of that money on building supplies. The building supplier wants these customers. The planning applications including contact details are published on a website, but navigating through the website to determine which pages have been recently added or changed is difficult. Automation makes extracting the data feasible.

A service or product supplier wishes to target businesses in a particular sector. A government body maintains a database of companies operating in that sector, but again, this information is provided in a format where each company details are on a separate web page containing an overview and contact details. If the number of web pages extends beyond several dozen, then likely an automated extraction process will be cheaper and more accurate than manual effort.

In general a set of web pages such as the planning list or the company list is generated by an application. All the pages contain the same look/feel, and if you look into the ’source’ of the web page from your web browser you may also notice that the various pages of the list have patterns. These patterns can be used to extract the data. For example the contact name may come after “Name”, and the telephone number after “Tel.”. While it may be possible to use regular expressions to extract the data, we’ve found using XPath to be easier and more accurate.

Here is a process which works to extract data from a set of related web pages. This is sometimes called web scraping.

  1. Download the full set of web pages using wget
  2. Use NekoHTML to convert the set of web pages into well-formed xml documents.
  3. Using a small subset of the web pages, create a named set of XPath expressions which uniquely identify each piece of data. The Firefox XPath Checker is useful here.
  4. Run the XPath expressions over the full set of documents, using POI to place the results into a Microsoft Excel (or OpenOffice) spreadsheet, typically one row per document.

There are several issues and difficulties you can expect if you are extracting data from a website belonging to someone else.

A first issue of concern is whether you have a right to extract and use the data. Often if the information is published by a public body, then you may have, or you may need to look at any terms of use published on the website. Determining your legal rights and obligations isn’t an area we can help you with, sorry.

A second issue you may encounter is that that once you’ve extracted the information, some data cleansing may be necessary for example to correct formatting, removing duplicates etc. In some cases this can be done easily enough using the features of the spreadsheet. In more difficult cases it may be better done in software prior to creating the spreadsheet.

Finally, if you intend on continuing to extract data from a set of web pages on an on-going basis then you are likely to eventually run into difficulties when the owner of those web pages moves them or changes their format.

I hope this brief overview of extracting metadata from web pages into a spreadsheet may have been useful to you. If you’ve got some comments from your own experience of having done this, please do post them.

Won’t it all be so much easier when everybody is using microformats?

Avoiding the pain of the myfaces tabbed pane

Wednesday, June 28th, 2006

When designing my recent JSF application I tried to use JSF controls instead of pure html or javascript wherever possible to keep the code clean and concise. The application required a tabbed pane of some form for the main content and this led me to use the Apache MyFaces JSF tabbed pane component.

Unfortunately, this led to problems as it implicitly wraps your pane in a form. You can’t nest forms and in my app I needed two tabs each containing a form. I had to get the one form to act like two. As you’d imagine this leads to some nastiness regarding jsf validation because it’ll try to validate all elements in the entire form when you’ll only want it to worry about the ones on the particular tab submitted. I resorted to manual validation in my application.

Faced with extending my previous application to have more tabs, with more forms, I went in search of a better solution. After failing to find a solution to the jsf tabbed pane component problems, or an alternative component, I was spared from rolling my own by luckily finding
this lovely javascript solution
.

Just one simple import of the js script and a stylesheet and you can make a dynamic tabbed pane with html as simple as below:


<div class="tab-pane">
   <div class="tab-page">
      <h2 class="tab">my tab 1</h2>
      Tab 1 content goes here
   </div>
   <div class="tab-page">
      <h2 class="tab">my tab 2</h2>
      Tab 2 content goes here
   </div>
</div>

<t:panelTabbedPane>
    <t:panelTab label="my tab 1">
       Tab 1 content goes here
    </t:panelTab>
    <t:panelTab label="my tab 2">
       Tab 2 content goes here
    </t:panelTab>
</t:panelTabbedPane>

When you compare the pure html solution with the equivalent JSF mark-up it’s clear that it’s just as clean, but of course with the html solution you’ve the added bonus of being able to use a separate form (or jsf form) in each pane:


   <div class="tab-page">
      <h2 class="tab">my tab 1</h2>
        <h:form>
        ...

        ...
        </h>
   </div>