Cheap and cheerful java object persistence using Lucene
by Robert FullerI took advantage of the the St. Patrick’s long weekend to experiment with using Lucene as a simple java object store. The context of the research was to determine whether it is feasible to create with Lucene a simple persistence layer to be used in a project currently holding an increasing number of disconnected java objects in an in-memory map.
I came to considering Lucene as an object store having already investigated using persistent maps and caching components such as jcs and ehcache. One of the main issues I encountered with these was that searching for objects based on some criteria other than the key required either indexing the sought objects at an application level, or putting up with a lot of I/O when iterating through a large volume of stored objects. I deemed hibernate to be an option, but avoided it primarily due to concerns about increasing the complexity of an already-complex-enough project.
While the practice of indexing java objects with lucene has been around for a while, the option of easily persisting the objects themselves in lucene is newer. A recently added feature provides the ability to store fields containing binary content - perhaps a suitable place for storing java objects? Grant Ingersol, one of the committers on the Lucene project recently blogged,
I even use it in things that 5 years ago I would never have thought I would use it for (object stores, etc.)
There are several features about my java objects which make them suitable for indexing and storing in lucene:
- They already implement java.io.Serializable.
- They are essentially data holders.
- They are disconnected - they do not hold references to other objects which will also be in the repository.
- They have get* methods which can be used for accessing most anything I will want to search on.
- Each object already has a unique identifier
The result of the weekend’s work was a single java class which implements persistence in lucene. I called it Lucos - Lucene object store. It is available for download here.
The basic functionality is to put/get an object in/out of the store in a manner similar to how an object is stored in a map. Here is an example:
Person fred = new Person("Fred Flinstone");
Lucos lucos = new Lucos();
lucos.put("fflinstone",fred);
Person x = (Person) = lucos.get("fflinstone");
//NB: x is a COPY of fred
assertEquals(fred,x);
Putting an object in the class using the put(String id, Object value) method, creates indexed fields for all of the no-arg get* methods on the value class. It also create indexes on all the value class and all the classes it extends or implements. Put changes are committed immediately to the index. Subsequent gets (or searches) reload the index (if necessary) to retrieve the latest changes.
To find all the instances of person in the repository:
EntryIterator it =
lucos.findInstances(Person.class);
System.out.println("Found "+it.length+" persons");
while(it.hasNext()){
String id = it.getKey();
Person person = (Person) it.getValue();
...
}
Providing search functionality was one of the features I required in order to overcome the issues already identified with searching a persistent map. One of the difficulties I encountered in doing this was that where fields were stored tokenized an exact match did not seem possible, and where stored untokenized, a partial match did not. To overcome this difficulty, I indexed fields in both tokenized and untokenized format, appending ‘.exact’ to the name of the untokenized field. Given that my Person has method String getName(), I can search my objects with any of these:
// find all persons named fred using a TermQuery
lucos.findInstances(Person.class, "name", "fred");
// find all persons named fred using lucene syntax query and the installed Analyzer
lucos.findInstances(Person.class, "name:fred");
// find all persons named Fred Flinstone using a TermQuery
lucos.findInstances(Person.class, "name.exact", "Fred Flinstone");
If you want to use a query not parsed using the Lucos analyzer, parse the query first, then pass it to findInstances:
QueryParser parser =
new QueryParser("name.exact", new KeywordAnalyzer());
Query query = parser.parse("\"Fred Flinstone\"");
it = lucos.findInstances(Person.class,query);
Here’s how to create a Lucos instance which uses file persistent storage:
String folder = "{path to folder}";
Directory directory = FSDirectory.getDirectory(folder);
Lucos lucos = new Lucos(directory);
Finally, don’t forget to close() lucos when finished with it. This will release the lucene write lock:
lucos.close();
I still need to do volume and load testing with some production data to verify the solution will provide memory/performance trade-off in reducing the size of my in-memory map. For the moment I’m satisfied that it is feasible to use Lucene as a java object store. The solution adds minimal complexity to the project introducing only one additional (lucene) jar file. For a future iteration it might be worth considering adding a dependency on xstream, removing the requirement that objects placed into the repository implement the serializable interface, and also possibly making them more generally searchable.
If you would like to add cheap and cheerful java object persistence into your project, I hope that Lucos might provide you with some code for thought and perhaps the basis for a solution. The code and a test class for Lucos is available for download here.
Comments are welcome!
March 18th, 2008 at 16:25
Great posting! I was exploring the possibility of using Lucene as a simple data persistance engine instead of a file system/database today and hit your blog by coincidence.
A few questions:
1. how do you handle modifications? - e.g. what if I rename “Fred Flinstone” to “Fred FlinTstone”? In that case, I would want the index to replace all references to the old name with the new name.
2. Since the primary aim of Lucos is caching engine, there are some fundamental differences with my aim (persistence) . Nevertheless, if your get/set objects were *not* initially serializable, a more generic approach could have been to use introspection to convert all ‘bean’ objects into “Document” objects (using the ‘getters’) and vice-versa, using the ’setters’? I suppose this approach may have crossed your mind? Any thoughts/comments on that?
March 18th, 2008 at 16:57
Thanks David,
In answer to your questions
Q. how do you handle modifications?
A. When the java object is changed, it is again put() into lucos which re-indexes that object replacing the original object. To perform a rename of all “Fred Flinstone” to “Fred FlinTstone” I would use findInstances, then for each entry found, set the name correctly and use lucos.put() to update the entry for that object.
Q. did I consider using introspection
A. Infact I since my objects implement serializable already, I did what was easiest. If the objects weren’t already serializable then I would consider using xstream to handle the serialization/deserialization.
I considered using apache beanutils to extract the field properties which would no doubt produce a better result than simply taking the results of the get* methods, but didn’t add that in to the first pass which was sufficient to meet my requirements. Yes in some cases beanutils could also be used to do what you suggest in recreating the object from the document based on the fields on the lucene document. For some kinds of java beans this might obviate the need to store the binary content.
March 19th, 2008 at 9:13
Thanks for your reply… indeed, I forgot about beanutils
While doing some more research, I found the Compass project which supports Object to Search Engine Mapping - http://www.compass-project.org/docs/2.0.0M2/reference/html/core-osem.html
There’s some overhead to it, but it supports marshalling and un-marshalling on Resource objects which are essentially a wrapper around Lucene Documents. There are a lot of other things that it can do, but the core seems have sufficient functionality for what we’re trying to do