|Professional Services||Second Signal||Presentations||Andrew's Blog||Support|
Update: For those following this work, but using the "ReadViewEntries" method described at the end, by moving to faster processors (this article was originally written in 2003), and by doing all comparisons and calculations in memory, we've been able to reduce the total runtime to right around 30 minutes, even as the number of rows of Oracle data has increased to well over 6 million.
Have you ever had to compare thousands, even millions of records to see which need to be updated? That was just the case for me when reviewing a process that's been in place for a very long time. I found a way to save HOURS of processing. The example talks about relational data, but it works really nicely for Notes -> Notes comparisons as well.
In this case, we get a dump of 2.6 million records to compare with the 60,000 or so notes documents, and have to check for changes and updates as necessary. The server is a 2 processor 300mhz machine with an old RAID system. No race car.
The data we get is extremely denormalized SQL Result set data. For example, the user "X" had 360 rows of data in the Oracle view that we have to read, and he's just one person. Why so many? It has to do with the way that view is created for us. Normally when you join tables, the number of rows is equal to the number of rows in the table contributing the most rows to the result set. In this case, that would be 5 rows. Something about the way the view is created for us however, causes it to be a MULTIPLE of the number of values in some columns times the number of values in others. In this users case, where he has multiple values in multiples columns, it comes to 360 total rows. As I understand it, they join many tables together from their system to give us this data, but the way they do the joins, each table join multiplies the values. If I had to guess, they're doing it sequentially, one table at a time joined to the existing result rather than as a single big join. This may be legitimate due to total size of the job -- a single big multi-table join may be outside the capacity of their system. In any case, it isn't an option for me to go and examine the code that creates the table.
That said, we need to check EVERY ROW (2.6 million of them) to see if there has been a data change compared to what we have stored on our documents for that user. Checking 2.6 million rows times 21 columns is more than 56 million comparisons! It was taking us 12+ hours at one point, using code that was considered very fast by other industry experts.
It turns out, most of the processing that we were doing was taken up loading the notes document into memory from the database or loading the values that are on it into working variables then doing the comparisons. Most of that is wasted -- if we could only know which ones we didn't need to read, we could skip them. Now, one of the things to always keep in mind, is that a view, in Notes, is also a note. That's always caused us trouble, but it can also be a good thing. What if we could just read that one note, and know which documents referenced on it had the same data as we were getting from the Oracle side?
What I've done is take all the data we get for a single user in Oracle, and combine it into a single really long string of text. If I could add that string to the view, I could just check it -- BUT -- the string is too long. You couldn't store it all like that and do any good. The view would be so large, reading it would take a very long time, and indexing on that string would be a nightmare. The answer is a HASH algorithm.
A hash is a mathematical formula, that when applied to a string of text produces a unique value, which doesn't contain the actual text. The "uniqueness" (or likelihood that two strings could produce the same result) is a function of the length of the hash key the algorithm produces. Well, built into Notes is a really great 32 byte hash algorithm -- the one used to store passwords in an unreadable way. So I decided to try reducing all the data for a single user to a short (32 byte) hash key, so I could check that to determine if the data had changed.
For simplicity, instead of writing my own -- or looking one up and copying it from the web -- I opted to use the one Lotus builds in. That's cool because is well tested and I know that any problems I have aren't a result of my own faulty math. They call this magic hash function "@Password". Yep, its that simple. Take your long string, feed it to @Password() and get back a short -- but powerfully unique -- string. In LotusScript you can do this with the evaluate() function. (TIP: Make sure you strip any " quotation marks " that may be in your string when you create the formula so that you don't inadvertently terminate the string and cause the formula to be invalid).
Now, instead of a really long string of concatenated text to compare, I have only this hash key. Its only 32 bytes (letters) long, but it represents a unique value for all the data for each user in the system. The likelihood of two different strings having the same 32 byte hash is vanishingly small. (Rich Schwartz may jump in here and give us the actual probability). I take that value and store it with the user record. Now, next time through, Instead of loading the user record, I just find that user's view entry (which is indexed on the user's unique key) and grab that hashed value from the view without having to load the document. I compare the stored hash value with the new one generated from the oracle data, and if they are the same, I know nothing has changed -- even though I never compared the actual values themselves! It means I can ignore that data and go on to the next set of records.
To take things a step further, I now pre-read the view by way of the notesViewentry objects, into a LotusScript "List" where the listtag is the user's unique key, and the data is the hash. As a result, once I've read the view once, I never have to go back to it during the entire run of the agent.
We've cut over to the new method, and cut processing time down from 12 hours to around 3.5 hours.
This can work even better, when you're comparing two sets of Notes documents. Store the hashed key for each document on that document, and also store the hashed key of its duplicate. Now, each document has all the data necessary to compare with the other, and it can all be done by reading view entries, never actually having to load a document into memory for no reason. Have to check or change only certain fields? Fine, just make your hash key out of those fields.
Please wait while your document is saved.