This document started out a quick blog entry, about two and a half hours ago. Its become one of the most detailed performance explanations I’ve ever put down in writing. If you have the patience to sit and read it, I’d appreciate any feedback you have. Its not really very language specific, though Java is used in cases where an example makes sense.
The thing about performance programming work is that the better you do the more you find that needs doing.
Today when I went to do a big initial data load, the results didn’t match what I’d seen in testing. To find out what was going on, I started the server in Eclipse using the built in debugger, and when it got good and bogged down I paused it -- all of it, all the threads. One by one I walked through where each thread was, and what it was waiting for. The stopping point was code that must synchronize on a specific object – an object which is static across all instances of the cache.
Multi-threading processing is like a bunch of people trying to talk.
Whether you use a cache or not, at some point all your individual threads need to interact with the back end data. Just like a panel of experts all competing for the microphone at say, Gurupalooza. In this case, the microphone is Domino, and the experts are server threads. Since you have dozens of threads using the same data you have to have some way to prevent more than one thread from working on the same back end document at the same time. The problem is exacerbated when the index is new, when doing a data pre-load of data, or someone asks a good question, since nearly every interaction will generate an addition or update to the index.
There’s only one ‘Talking Stick’
To avoid these kinds of conflicts as each thread tries to populate the index with similar data, you create some object which is “static” – which means a single instance of the object is the same no matter how many instantiations of its containing class are in use. You use the object as a semaphore. In Java, this is done with the ‘synchronized’ keyword. Essentially, you’re saying “here’s a block of code, but you can’t execute this block until you have sole ownership of the object you’re synchronized on. It’s like a talking stick. Everyone may have something to say, but only the one with the stick gets to talk. What makes it really useful, is that you can have many blocks of code all wrapped in “synchronized” blocks, which all synchronize on the same object. That means you can protect the resource (the microphone) by making sure that only one block of code (expert) can access it at a time.
How you do stop someone from hogging the microphone?
What I mean is, how do you prevent this choking point from bringing the whole mess to a halt? There are several answers, of course. The first is to make sure you’re not making each process wait for things it doesn’t really need to wait for. To continue the metaphor, get a separate talking stick for each conversation. In this case, that means using a different synchronization object for each document in the database you need to protect. That’s easily said, by not so easily done. First, because the very act of creating or locking onto the right synchronization object takes long enough that more than one thread can try to do it at the same time. Back to the starting block, right? That brings us to the next thing we have to do. We have to reduce the length of time these blocking events actually take to do the critical part of their work.
Getting the right microphones to the right experts
So there does have to be one line, but we’ll make it as short as possible. We’re going to create a collection of objects to synchronize on which represent the documents, but which aren’t actually the documents themselves. That’s because getting the documents is expensive. It takes a lot of time.
Enough with the metaphor, what does the code do?
When a thread needs access to a document from the index, it calls a method to get that document’s “lock” first. We’ll call that method “getDocLock()”. That method grabs the main lock for all the document locks so that it can only pass back one document lock at a time. Once it has the main lock, it checks its cache to see if a lock already exists for the specific document in question. If one exists, it is returned by the method. If not, one is created – even though there is no back end document associated with it yet. We’re creating an object which is the key to the document, not the document itself. Once that key is created – a very fast thing – and returned by the method the main lock can be released, and the rest of the threads can come in and get their own locks. We’ve protected the index because only one thread has “permission” to create or update that particular back end document, but we’ve reduced the amount of time the entire cache is unavailable.
There’s a lot more complexity to do this practically of course
There are some things to consider when you go to do this, though. First, you need to make sure that any call for the same document will always produce the same lock, even though that document doesn’t yet exist. The index key to the lock must exactly match the one and only index key you use to access the document itself then. Second, you have to remember that you’re synchronizing on objects, not variable names or the values that object contain. That means the lock object becomes invalid if it goes away from the cache. You need either a check-in/check-out process for lock objects, or your need to never remove a lock object from the cache once it’s created. That means your cache can grow to be quite large. Better to create a method called “releaseDocLock()” that tells the cache it can destroy the lock object.
Even more performance gains can be had
Particularly if you’ll be hitting the same back end documents frequently and making a lot of changes – remember, you only need the lock when you’re doing an update, not a read – you’ll start to notice that having more than one thread waiting for the same document lock can be slow. To avoid this, do as little as possible while you need the lock. One way of accomplishing this is to virtualize the document. Once read into memory, you stuff the data it has on it into a cache of its own. From then on, once you have a document’s lock you don’t go get the document from the database, you get it from memory. If it doesn’t exist, then get it from the database and put it in memory. Work on the data there, still under lock, and FLAG the in memory representation of the document as “dirty” so it gets written to the database – then drop the lock. Let a background process write the document back to the disk. It won’t be changing now, and if another thread gets the same lock, it will still get the most updated data because it gets it from memory, not the database itself. You’ve taken the most expensive part of the process – the “document write” – out of the locked part of the process. Another process which runs as a background cleanup task will see the flag. When it does, it will first load the back end document from the database (or create it if need be) and THEN it will attempt to get a lock on the in memory version of the data. When it gets that lock, it quickly updates the back end document, resets the dirty flag to ‘clean’ and immediately releases the lock BEFORE then writing the back end document to disk. Neither the read nor the write took place under a locked condition, and if an update does happen while the back end document is being written, it will just wait until the next time the cleanup thread comes around and does it again.
The same techniques are used for lots of things
Almost any time you need multithreaded access to a single data set, some kind of system like this is in place. It’s a fairly standard data management technique, but normally you don’t see it. It’s behind the scenes being handled by the ODBC software, or the server itself depending on what you’re doing. By emulating the process in our code, we’ve created a more specific and more granular locking mechanism. We could do this with the database tools – for example with Domino’s new check-in check-out feature for documents, or if we’re talking about a relation database we could lock records and the database would handle things to make sure only one lock at a time was issued. But this is faster, because we’re doing all of it in memory. We’ve moved the disk reads and writes, the network traffic, and the servers themselves outside our locking mechanism. We can even get fancy and start managing how long we keep in memory data around when its not being used.
So what kind of performance am I getting?
Well, my process is to accept a document unid and database replica id from the client to the server, then get the document analyze the mail header including reverse lookups, decode the text mime body parts and find any url targets, email addresses, or U.S. style phone numbers and analyze them to determine where we’ve seen them before and how frequently. This includes DNS resolution on each discovered ‘target’ and recursive comparison to other sites in the same address, the same subnet, or with the same domain.
So far, I’ve achieved a performance level of more than one thousand documents processed per minute in worst case tests on a single processor machine running at only 26% processor capacity, sustained for hours at a time. To get there, I’m using every trick I know, and quite a few I’ve figured out for this project. Self adjusting thread pool and queue sizes, built in lookup caches, network socket timeout adjustments, and burning cinnamon incense. I’ve done everything that doesn’t require a live chicken and a bowl.
And you thought I was just an ADMIN guy....
PS: Special thanks to Kathy Sierra & Bert Bates who wrote the book "Head First Java". I read this a little over 18 months ago on the suggestion of Tom Duff, and much of my understanding of Java's thread management comes directly from those pages. I still use it to refresh my thinking on core Java concepts like when objects equal other objects, simply share data that looks the same, or are in fact the same object with different things pointing to them. Kathy's and Bert's book is, to this day, just about the only "textbook" I've ever read front to back -- more than half in the first sitting.
High Performance Threadsafe Access to Data