Rabbits down holes

By Andrew Pollack on 05/17/2005 at 12:35 PM EDT

Sometimes you chase the rabbit too far down the hole. In my never ending quest for high performance, I built an in memory index that works very much like a sorted view column in that its very quick to find an entry, and even support partial matches -- in fact, the constructor accepts a regular expression used to split the entry into the parts for granularity of partial matches. We tend to think in terms of any letter, but in my case splitting on periods was good enough and produces a far smaller index.

As you can guess, it won't scale. Oh, sure it scales enough for anything I can currently see putting in it, but go not too much further out and you run out of heap space. So, its got to go off to disk. Now I find myself halfway through the paper process of creating a fast on disk structure to match the index, and I realize -- its just not worth it. Its a great tool, but its tech that needs to be shelved so I can finish this project. Maybe it will reappear incarnated as my own full text index or something.

Jakarta Commons?By Stephan H. Wissel on 05/18/2005 at 08:57 AM EDT
Hi Andrew,
Did you have a look at Apache's Jakarta Commons? They have all kinds of stuff
on caching, collection and things we tend to reinvent.
:-) stw
Unfortunately, the license prohibits it.By Andrew Pollack on 05/18/2005 at 07:15 PM EDT
Most of that stuff is released under GNU -- which means that if you use it in
any thing you do, anything related must also be given away. As much of what I
do is commercial, I can't use that stuff.
Typically it's Apache 2.0By Stephan H. Wissel on 05/19/2005 at 07:48 AM EDT
Hi Andrew,
on Apache most of the time you have an Apache 2.0 license. The same as in
Tomcat. So it should be possible to reuse it. Even M$ uses a .NET port of
Lucene in their desktop search.
Have a second look. E.g. the collections are Apache license. What parts were
GPL you liked?
;-) stw
I meant ALL of the commons is Apache 2.0 lic... By Stephan H. Wissel on 05/19/2005 at 07:49 AM EDT

