Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

A faster way to update data!

By Andrew Pollack on 08/22/2003 at 04:07 PM EDT

Update: For those following this work, but using the "ReadViewEntries" method described at the end, by moving to faster processors (this article was originally written in 2003), and by doing all comparisons and calculations in memory, we've been able to reduce the total runtime to right around 30 minutes, even as the number of rows of Oracle data has increased to well over 6 million.

Have you ever had to compare thousands, even millions of records to see which need to be updated? That was just the case for me when reviewing a process that's been in place for a very long time. I found a way to save HOURS of processing. The example talks about relational data, but it works really nicely for Notes -> Notes comparisons as well.

In this case, we get a dump of 2.6 million records to compare with the 60,000 or so notes documents, and have to check for changes and updates as necessary. The server is a 2 processor 300mhz machine with an old RAID system. No race car.

The data we get is extremely denormalized SQL Result set data. For example, the user "X" had 360 rows of data in the Oracle view that we have to read, and he's just one person. Why so many? It has to do with the way that view is created for us. Normally when you join tables, the number of rows is equal to the number of rows in the table contributing the most rows to the result set. In this case, that would be 5 rows. Something about the way the view is created for us however, causes it to be a MULTIPLE of the number of values in some columns times the number of values in others. In this users case, where he has multiple values in multiples columns, it comes to 360 total rows. As I understand it, they join many tables together from their system to give us this data, but the way they do the joins, each table join multiplies the values. If I had to guess, they're doing it sequentially, one table at a time joined to the existing result rather than as a single big join. This may be legitimate due to total size of the job -- a single big multi-table join may be outside the capacity of their system. In any case, it isn't an option for me to go and examine the code that creates the table.

That said, we need to check EVERY ROW (2.6 million of them) to see if there has been a data change compared to what we have stored on our documents for that user. Checking 2.6 million rows times 21 columns is more than 56 million comparisons! It was taking us 12+ hours at one point, using code that was considered very fast by other industry experts.

It turns out, most of the processing that we were doing was taken up loading the notes document into memory from the database or loading the values that are on it into working variables then doing the comparisons. Most of that is wasted -- if we could only know which ones we didn't need to read, we could skip them. Now, one of the things to always keep in mind, is that a view, in Notes, is also a note. That's always caused us trouble, but it can also be a good thing. What if we could just read that one note, and know which documents referenced on it had the same data as we were getting from the Oracle side?

What I've done is take all the data we get for a single user in Oracle, and combine it into a single really long string of text. If I could add that string to the view, I could just check it -- BUT -- the string is too long. You couldn't store it all like that and do any good. The view would be so large, reading it would take a very long time, and indexing on that string would be a nightmare. The answer is a HASH algorithm.

A hash is a mathematical formula, that when applied to a string of text produces a unique value, which doesn't contain the actual text. The "uniqueness" (or likelihood that two strings could produce the same result) is a function of the length of the hash key the algorithm produces. Well, built into Notes is a really great 32 byte hash algorithm -- the one used to store passwords in an unreadable way. So I decided to try reducing all the data for a single user to a short (32 byte) hash key, so I could check that to determine if the data had changed.

For simplicity, instead of writing my own -- or looking one up and copying it from the web -- I opted to use the one Lotus builds in. That's cool because is well tested and I know that any problems I have aren't a result of my own faulty math. They call this magic hash function "@Password". Yep, its that simple. Take your long string, feed it to @Password() and get back a short -- but powerfully unique -- string. In LotusScript you can do this with the evaluate() function. (TIP: Make sure you strip any " quotation marks " that may be in your string when you create the formula so that you don't inadvertently terminate the string and cause the formula to be invalid).

Now, instead of a really long string of concatenated text to compare, I have only this hash key. Its only 32 bytes (letters) long, but it represents a unique value for all the data for each user in the system. The likelihood of two different strings having the same 32 byte hash is vanishingly small. (
Rich Schwartz may jump in here and give us the actual probability). I take that value and store it with the user record. Now, next time through, Instead of loading the user record, I just find that user's view entry (which is indexed on the user's unique key) and grab that hashed value from the view without having to load the document. I compare the stored hash value with the new one generated from the oracle data, and if they are the same, I know nothing has changed -- even though I never compared the actual values themselves! It means I can ignore that data and go on to the next set of records.

To take things a step further, I now pre-read the view by way of the notesViewentry objects, into a LotusScript "List" where the listtag is the user's unique key, and the data is the hash. As a result, once I've read the view once, I never have to go back to it during the entire run of the agent.

We've cut over to the new method, and cut processing time down from 12 hours to around 3.5 hours.

This can work even better, when you're comparing two sets of Notes documents. Store the hashed key for each document on that document, and also store the hashed key of its duplicate. Now, each document has all the data necessary to compare with the other, and it can all be done by reading view entries, never actually having to load a document into memory for no reason. Have to check or change only certain fields? Fine, just make your hash key out of those fields.

Cool?


There are  - loading -  comments....

brilliant!By Anonymous on 06/23/2004 at 11:21 AM EDT
------------------------------------------------------------------------------
in response to
------------------------------------------------------------------------------
Have [SNIP] Edited by AJP to remove entire quoted story [/snip] . Cool?
re: A faster way to update data!By Devin Olson on 06/23/2004 at 11:21 AM EDT
"Sometimes the spark of brilliance can only be found in the mines of
misery".....Sorry, my muse was waxing poetic again.

This is a GREAT solution you have come up with. I'm thinking on implementing
something similar at work for a pretty nasty project. Ever hear of a product
called "Training Server"? We are integrating it with our NAB -10K plus users,
promises to be a fun one ;->

-Devin.
re: re: A faster way to update data!By Andrew Pollack on 06/23/2004 at 11:21 AM EDT
:Cool -- I hope it works for you. I have code laying around to do most of that
work. Take a look at the "Tips N' Tricks for Admins" presentation on my
"business" home page. http://www.thenorth.com

------------------------------------------------------------------------------
in response to
------------------------------------------------------------------------------
"Sometimes the spark of brilliance can only be found in the mines of
misery".....Sorry, my muse was waxing poetic again. This is a GREAT solution
you have come up with. I'm thinking on implementing something similar at work
for a pretty nasty project. Ever hear of a product called "Training Server"? We
are integrating it with our NAB -10K plus users, promises to be a fun one ;->
-Devin.
some commentsBy Bernard Devlin on 07/16/2004 at 04:51 AM EDT
Hi Andrew, a great idea.

Just a couple of comments on your blog though. Do you know on IE on the Mac
the background colour is dark blue, and the text is black? I could only read
the text by using 'select all' to highlight it in inverse video. Also,
clicking on the links to the comments results in 'special database object
cannot be located' Missing subform?
Hash for a whole doc?By harkpabst_meliantrop on 12/06/2005 at 05:28 PM EST
"Store the hashed key for each document on that document ..."

O.K., I admitt, I still don't get it. How do you compute a hash key for a whole
document with this technique? From my quick testings, @Password wont't accept
any input exept for plain text!?! Do I have to manually build a string from all
fields in the doc then? What about RT data like file attachments? Or am I just
overseeing the obvious?


Other Recent Stories...

  1. 10/08/2018Will you be at the NYC Launch Event for HCL Domino v10 -- Find me!Come find me in NYC on Wednesday at the Launch Event if you're there. I really do want to talk to ...... 
  2. 09/04/2018With two big projects on hold, I suddenly find myself very available for new short and long term projects. In twenty five years, I don't think I've ever written an entry like this, but if you need the kind of work I do now would be a great time to get in touch. Both of the big projects I had lined up for late summer and early fall have been placed on hold and will be that way for a while. With the kids now all off at college and careers, I'm open to more travel than such than I have been in decades, but unless something else comes along, I'll be here working on updates to Second Signal and other things that ...... 
  3. 07/13/2018Who is HCL and why is it a good thing that they are now the ones behind Notes and Domino?We need to address some biases here. IBM has made a deal under which the Notes & Domino software and intellectual property is now being developed and maintained by HCL America. HCL America is part of the very large "HCL Technologies" company that has grown from its roots in India to become an 8 Billion Dollar company with a global presence in the IT Industry. You could be excused for initially believing, as many people do when they hear this, that "they've outsourced the code to India where they'll milk it ...... 
  4. 03/21/2018Domino Apps on IOS is a Game Changer. Quit holding back. 
  5. 02/15/2018Andrew’s Proposed Gun Laws 
  6. 05/05/2016Is the growing social-sourced economy the modern back door into socialism? 
  7. 04/20/2016Want to be whitelisted? Here are some sensible rules for web site advertising 
  8. 12/30/2015Fantastic new series on Syfy called “The Expanse” – for people who love traditional science fiction 
  9. 10/20/2015My suggestion is to stay away from PayAnywhere(dot)com  
  10. 08/07/2015Here is one for you VMWARE gurus - particularly if you run ESXi without fancy drive arrays 
Click here for more articles.....


pen icon Comment Entry
Subject
Your Name
Homepage
*Your Email
* Your email address is required, but not displayed.
 
Your thoughts....
 
Remember Me  

Please wait while your document is saved.