OBriens tower
Musings on software development, Linux and business

Archive for the 'useful tools' Category

Quantifying simplicity in code

Wednesday, May 14th, 2008

One of the key values here at Applepie is simplicity. As a member of Applepie I strive whenever possible to deliver lean, readable, manageable code that meets its requirements and is a pleasure for other developers to use. It’s not easy. To help ‘keep it simple’ we follow a number of practices, such as code peer reviews, which at there heart have the question: “Is this simple?”. The outcome of these practices leads to better code that we have qualified as manageable and elegant. Code we feel is simple.

However, Qualifying or feeling something to be simple is often not enough, we want hard fact, we want to quantify how well a body of code exhibits simplicity. One answer, is to analyse the code base and generate metrics that can support us in our quest to simplify. One of the best metric I’ve found for this is Cyclomatic Complexity.

Cyclomatic Complexity (CC) measures the complexity of your code by counting the number of paths through your code. The number of paths is the CC number. The bigger the CC number the more likely that code is difficult to conceptualize and also less likely you can unit test that code effectively. It’s a bit like navigating a road, a Y junction fine, a cross roads OK, a four-way interchange is (well for me anyway) reaching my cognitive limit.

As a rule of thumb, Cyclomatic Complexity Numbers are as follows:

  • Simple - 11 or less is optimal
  • Manageable - 11-21, may be problematic and will require large unit tests.
  • Complex - 21-50 will be problematic and certainly will not be easy to test.
  • Forget it - 50+ cannot be tested and requires self-actualized yogi chess grandmaster to understand.

So how to profile your code for CC. I’m going to focus on Java though similar tools exist for C#, PHP, Ruby, Python. The tool I’ve used for Java is PMD (works with the IDE of your choice). PMD generates many static metrics on your code base including Cyclomatic Complexity.

To install PMD follow the instructions on this PMD onJava article, or if you only have 5 minutes here’s a quick install guide for eclipse. Pop the exploded zip downloaded from PMD into your plugins folder. After you have installed the plugin you need to activate the metrics for your project. Go to the project properties and select the Metrics option and then select the enable metrics. That’s it! The metrics are then calculated and a metric view is presented.

McCabe Cyclomatic Complexity (CC) is the PMD metric we are interested in. PMD generates these CC statistics for the entire project. It allows you to drill down a tree of statistics and quickly ascertain the highest CC for each package, class and method. It even highlights items over a threshold in red (defaults to11, but is configurable). Now at last I have the CC count for every piece of code I work on.

These generated statistics gives a great overview of where code needs to be simplified and (hopefully) helps a bit more in answering “Is this simple?”, allowing me to back up my gut feel with some quantifiable statistics.

If you want to know more on Cyclomatic Complexity see here. If your interested in other metrics supporting simplicity see Operands on an operation.

Getting a thread dump from Tomcat running as a Windows service

Monday, April 14th, 2008

We’re supporting a java application deployed on Tomcat, which is running as a windows service. On Friday the logs showed that parts of the application were frequently timing out while trying to aquire a DB connection from a pool. We wanted to get a thread dump to see if any threads holding connections were deadlocked. If Tomcat had been started from a console this would be straightforward, unfortunately it wasn’t, and we didn’t have the option to re-start it on the production server.

One useful tool is the free web start version of stack trace. We had no joy with this either though. Our remote desktop session was not the account from which the service was started. Stack trace helpfully suggests using Start->run->”mstsc /console” to start the remote desktop session in this case, but this would have terminated other sessions that were open to the server, and therefore wasn’t an option for us.

Cue a moment of inspiration from Rob, which resulted in a simple jsp that will output a thread dump. Note that your applicaiton must be running on at least java 5.0 for this to work. Just make a simple jsp with the following snippit as the body of the page, and drop it in the web root of your application. Then fire up a browser, navigate to the jsp and view the dump without even having to restart Tomcat!


<body>
<center><h1>Thread Dump</h1></center>
<pre>
<%

  StringBuffer sb = new StringBuffer();
  Map  st = Thread.getAllStackTraces();
  for (Map.Entry  e : st.entrySet() ) {
    StackTraceElement[] el = e.getValue();
    Thread t= e.getKey();
    sb.append(”\”" ).append( t.getName() ).append( “\” ” );
    sb.append( t.isDaemon()?”daemon”:”" ).append( ” prio=” ).append( t.getPriority() );
    sb.append ( ” Thread id=” ).append( t.getId()  ).append( ” ” ).append( t.getState()  );
    sb.append( “\n” );
    for (StackTraceElement line: el) {
      sb.append(”\t”+line + “\n”);
    }
    sb.append(”\n”);
  }

% >
<%=sb.toString() %>
</pre>
</body>

Cheap and cheerful java object persistence using Lucene

Tuesday, March 18th, 2008

I took advantage of the the St. Patrick’s long weekend to experiment with using Lucene as a simple java object store. The context of the research was to determine whether it is feasible to create with Lucene a simple persistence layer to be used in a project currently holding an increasing number of disconnected java objects in an in-memory map.

I came to considering Lucene as an object store having already investigated using persistent maps and caching components such as jcs and ehcache. One of the main issues I encountered with these was that searching for objects based on some criteria other than the key required either indexing the sought objects at an application level, or putting up with a lot of I/O when iterating through a large volume of stored objects. I deemed hibernate to be an option, but avoided it primarily due to concerns about increasing the complexity of an already-complex-enough project.

While the practice of indexing java objects with lucene has been around for a while, the option of easily persisting the objects themselves in lucene is newer. A recently added feature provides the ability to store fields containing binary content - perhaps a suitable place for storing java objects? Grant Ingersol, one of the committers on the Lucene project recently blogged,

I even use it in things that 5 years ago I would never have thought I would use it for (object stores, etc.)

There are several features about my java objects which make them suitable for indexing and storing in lucene:

  • They already implement java.io.Serializable.
  • They are essentially data holders.
  • They are disconnected - they do not hold references to other objects which will also be in the repository.
  • They have get* methods which can be used for accessing most anything I will want to search on.
  • Each object already has a unique identifier

The result of the weekend’s work was a single java class which implements persistence in lucene. I called it Lucos - Lucene object store. It is available for download here.

The basic functionality is to put/get an object in/out of the store in a manner similar to how an object is stored in a map. Here is an example:

Person fred = new Person("Fred Flinstone");
Lucos lucos = new Lucos();
lucos.put("fflinstone",fred);
Person x = (Person) = lucos.get("fflinstone");
//NB: x is a COPY of fred
assertEquals(fred,x);

Putting an object in the class using the put(String id, Object value) method, creates indexed fields for all of the no-arg get* methods on the value class. It also create indexes on all the value class and all the classes it extends or implements. Put changes are committed immediately to the index. Subsequent gets (or searches) reload the index (if necessary) to retrieve the latest changes.

To find all the instances of person in the repository:

EntryIterator it =
lucos.findInstances(Person.class);
System.out.println("Found "+it.length+" persons");
while(it.hasNext()){
String id = it.getKey();
Person person = (Person) it.getValue();
...
}

Providing search functionality was one of the features I required in order to overcome the issues already identified with searching a persistent map. One of the difficulties I encountered in doing this was that where fields were stored tokenized an exact match did not seem possible, and where stored untokenized, a partial match did not. To overcome this difficulty, I indexed fields in both tokenized and untokenized format, appending ‘.exact’ to the name of the untokenized field. Given that my Person has method String getName(), I can search my objects with any of these:

// find all persons named fred using a TermQuery
lucos.findInstances(Person.class, "name", "fred");
// find all persons named fred using lucene syntax query and the installed Analyzer
lucos.findInstances(Person.class, "name:fred");


// find all persons named Fred Flinstone using a TermQuery
lucos.findInstances(Person.class, "name.exact", "Fred Flinstone");

If you want to use a query not parsed using the Lucos analyzer, parse the query first, then pass it to findInstances:

QueryParser parser =
new QueryParser("name.exact", new KeywordAnalyzer());
Query query = parser.parse("\"Fred Flinstone\"");
it = lucos.findInstances(Person.class,query);

Here’s how to create a Lucos instance which uses file persistent storage:

String folder = "{path to folder}";
Directory directory = FSDirectory.getDirectory(folder);
Lucos lucos = new Lucos(directory);

Finally, don’t forget to close() lucos when finished with it. This will release the lucene write lock:

lucos.close();

I still need to do volume and load testing with some production data to verify the solution will provide memory/performance trade-off in reducing the size of my in-memory map. For the moment I’m satisfied that it is feasible to use Lucene as a java object store. The solution adds minimal complexity to the project introducing only one additional (lucene) jar file. For a future iteration it might be worth considering adding a dependency on xstream, removing the requirement that objects placed into the repository implement the serializable interface, and also possibly making them more generally searchable.

If you would like to add cheap and cheerful java object persistence into your project, I hope that Lucos might provide you with some code for thought and perhaps the basis for a solution. The code and a test class for Lucos is available for download here.

Comments are welcome!

Toolbox for a Java craftsman

Thursday, July 26th, 2007

Back in the 80’s, the “olden days”, before I was a software guy I was a builder guy. I drove around in an rusty Ford econoline van with scaffolds and ladders on the roof; tools and materials in the back of the van, techniques in the back of the head. Tools and materials and techniques. Most jobs required all three, and like all tradesmen I accumulated some of each over the years. In those days I would arrive on a project ready to hit the ground running - I brought the basics with me.

Now with more than a decade of industrial software engineering behind me, I’m pelased with the collection of tools and techniques I carry with me. I don’t mean the phone and the laptop. I mean tools like putty and firefox and scite and cvs, mstsc, thunderbird, skype, gaim, open office, gimp, password safe, igal and vi to name a few.

The contents of the toolbox change over time. I no longer use cygwin and though emacs is still in there but doesn’t come out so often. Subversion is in there now, though I haven’t used it enough to make it my own.

Some tools are best not left behind. To any java project I always bring eclipse, ant and junit, and usually log4j.

Re-usable suitably-licensed open-source software components are a big part of my toolbox. Here are some of the most tried and trusted components that I have included in various java projects over the years and am happy to recommend:

  • Logging: log4j (Notice I’m mentioning it for the second time)
  • Working with xml: dom4j provides useful functionality.
  • Text indexing: lucene is an extremely well done component.
  • Templating: velocity is proven.
  • Scheduling: quartz is very robust and reliable.
  • Working with pdf: PDFbox is worthwhile. I’d like to try a using pdfbox with velocity sometime, for templating pdf documents.
  • File identification: ffident provides mime type identification. I added some (what I thought were useful) new features which I sent to the author, but he doesn’t seem to have folded them in.
  • Http requests: http client is reliable.
  • Http file uploads: Commons file upload is helpful for handling files uploaded to your servlet or jsp page.
  • cleaning up html: nekohtml is very useful if you want to take html pages from the wild and convert them to xml in order to run xpath expressions against them.
  • Working with excel spreadsheets: poi is handy for reating and writing html.
  • Charting: JFreeChart is very useful.
  • Embedded sql: hsqldb is a useful sql database written in java which can be embedded into your application to run in-memory or persisting to the file system.
  • Scripting: BeanShell, Rhino, and BSF if you can’t decide between them ;-)
  • JNDI: Commons naming provides jndi setup for your application using tomcat 4 style configuration. (I don’t know why this is not more readily available, if you do, please let me know!)

Maybe some of these tools will be useful to you. If you have some more tried and trusted components for java applications, maybe post a comment.

When I moved from the West of Canada to the West of Ireland in the early 90’s I left most of the tools behind me, the saws and the ladders and the compressors and the welder. I still have my old hammer. I’m a bit rusty on some of the techniques, but I’ll probably never forget the how to dry a paint brush, taught to me by my mentor Lorenzo Quarenghi, son of Walter: after cleaning the brush in water or thinner (as appropriate), go outside wearing your old shoes or boots. With your heel on the ground and your toe pointing up, hold the handle of the brush and tap the metal edge on the top of your toe. The spray goes onto the ground and onto the sole of your shoe, the brush becomes clean and dry!

Avaeon Topoix - rapid and available .NET applications

Tuesday, July 24th, 2007

Galway is a hub of excitement this time of year. The Art’s Festival is in full tilt in Galway now, and next week the races begin. Something for everyone!

And some excitement brewing in the software development community as well! Today I went to Avaeon’s office near the race track where I had the privilege of a personal presentation of their newly released Topoix product, which provides a development framework and deployment model for forms based web applications. Wow! This is really a piece of software to pique the attention of IT Managers who want to better, stronger, faster development and deployment of .NET applications.

Topoix is a .NET based framework that provides a structure on which applications can be constructed and deployed. It encapsulates complex actions and presents the developer with simplified methods to perform then; this allows complex .NET applications to be developed without the developer requireing a full understanding of all the associated technical complexities. The main goal of Topoix is to reduce the time and effort involved in application development.

Unlike some other software components, Topoix is not simply a development tool or a deployment model. Rather it is an extension to the .NET framework with aspects covering both development and deployment in order to produce software which, when compared to standard .NET applications, is faster to develop, easier to maintain, and more highly available.

The development framework allows for a complete(!) separation of form design and business logic. Form validation rules are defined as metadata separately from the html forms to which they are applied. Unlike standard .NET applications, developers need not edit aspx pages. Instead they concentrate on the business logic of the application leaving it to the framework to apply this logic into the html forms which may have been provided by a separate design team.

The framework also provides a transactionable persistence model to bind the strongly typed form data with the SQL Server database. The application binds values to/from the database using the appropriate connection and the sql mappings provided by the developer.

But wait there’s more! The framework provides inherent support for referential integrity, allowing dropdown lists to be populated directly from the referenced tables. I was most impressed with the dynamic way the declarative business rules are applied. Using simple declarations, the developer can specify that a change to one field on a form can cause other fields to be required or hidden or pre-populated!

Two features are particularly interesting when deploying your topoix application: automatic versioning and distributed session managment.

Firstly the business rules encapsulated in metadata are automatically versioned. Updated versions can be released onto the production systems without first removing the previous versions. Without interrupting existing sessions, newer business rules can be rolled out. New sessions will receive the updated business rules while existing sessions will be able to complete using the older version of the rules.

Secondly the application framework is of particular value when deploying web applications into environments where there is zero tolerance for work flow interuuption. Topoix provides distributed session management such that if a node in the server farm goes down, the failover to another server is transparent with zero interruption to users, even those who were previously connected to the failed node.

Avaeon was established in 2001 and have a wealth of experience developing and deploying highly available web applications in regulated environments. Topoix has evolved over several years and the current product has been in production use for over a year, for applications available to literally thousands of online users. If you want to know more about it visit the website.

Congratulations and wishes best luck to Anthony and the team at Avaeon! I look forward to building an application with Topoix later this year.

Extracting metadata from html to excel

Monday, February 19th, 2007

We recently completed a small but important project involving extracting marketing data from a set of html pages into an excel spreadsheet. This is quite a simple process, and here I will describe why businesses sometimes need to do this, the approach we use, and some of the issues to expect if you do this.

Government bodies often publish information which can be very useful to businesses. But frequently although the information is useful, the format is not. Manual extraction is possible though tedious and timeconsuming.

A building supplier wants the contact details extracted from publicly available County Council Planning Applications. They use these for a targetted marketing campaign with great effect: People who are spending money building or renovating their homes will be giving a big part of that money on building supplies. The building supplier wants these customers. The planning applications including contact details are published on a website, but navigating through the website to determine which pages have been recently added or changed is difficult. Automation makes extracting the data feasible.

A service or product supplier wishes to target businesses in a particular sector. A government body maintains a database of companies operating in that sector, but again, this information is provided in a format where each company details are on a separate web page containing an overview and contact details. If the number of web pages extends beyond several dozen, then likely an automated extraction process will be cheaper and more accurate than manual effort.

In general a set of web pages such as the planning list or the company list is generated by an application. All the pages contain the same look/feel, and if you look into the ’source’ of the web page from your web browser you may also notice that the various pages of the list have patterns. These patterns can be used to extract the data. For example the contact name may come after “Name”, and the telephone number after “Tel.”. While it may be possible to use regular expressions to extract the data, we’ve found using XPath to be easier and more accurate.

Here is a process which works to extract data from a set of related web pages. This is sometimes called web scraping.

  1. Download the full set of web pages using wget
  2. Use NekoHTML to convert the set of web pages into well-formed xml documents.
  3. Using a small subset of the web pages, create a named set of XPath expressions which uniquely identify each piece of data. The Firefox XPath Checker is useful here.
  4. Run the XPath expressions over the full set of documents, using POI to place the results into a Microsoft Excel (or OpenOffice) spreadsheet, typically one row per document.

There are several issues and difficulties you can expect if you are extracting data from a website belonging to someone else.

A first issue of concern is whether you have a right to extract and use the data. Often if the information is published by a public body, then you may have, or you may need to look at any terms of use published on the website. Determining your legal rights and obligations isn’t an area we can help you with, sorry.

A second issue you may encounter is that that once you’ve extracted the information, some data cleansing may be necessary for example to correct formatting, removing duplicates etc. In some cases this can be done easily enough using the features of the spreadsheet. In more difficult cases it may be better done in software prior to creating the spreadsheet.

Finally, if you intend on continuing to extract data from a set of web pages on an on-going basis then you are likely to eventually run into difficulties when the owner of those web pages moves them or changes their format.

I hope this brief overview of extracting metadata from web pages into a spreadsheet may have been useful to you. If you’ve got some comments from your own experience of having done this, please do post them.

Won’t it all be so much easier when everybody is using microformats?