Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

A faster way to update data!

By Andrew Pollack on 08/22/2003 at 04:07 PM EDT

Update: For those following this work, but using the "ReadViewEntries" method described at the end, by moving to faster processors (this article was originally written in 2003), and by doing all comparisons and calculations in memory, we've been able to reduce the total runtime to right around 30 minutes, even as the number of rows of Oracle data has increased to well over 6 million.

Have you ever had to compare thousands, even millions of records to see which need to be updated? That was just the case for me when reviewing a process that's been in place for a very long time. I found a way to save HOURS of processing. The example talks about relational data, but it works really nicely for Notes -> Notes comparisons as well.

In this case, we get a dump of 2.6 million records to compare with the 60,000 or so notes documents, and have to check for changes and updates as necessary. The server is a 2 processor 300mhz machine with an old RAID system. No race car.

The data we get is extremely denormalized SQL Result set data. For example, the user "X" had 360 rows of data in the Oracle view that we have to read, and he's just one person. Why so many? It has to do with the way that view is created for us. Normally when you join tables, the number of rows is equal to the number of rows in the table contributing the most rows to the result set. In this case, that would be 5 rows. Something about the way the view is created for us however, causes it to be a MULTIPLE of the number of values in some columns times the number of values in others. In this users case, where he has multiple values in multiples columns, it comes to 360 total rows. As I understand it, they join many tables together from their system to give us this data, but the way they do the joins, each table join multiplies the values. If I had to guess, they're doing it sequentially, one table at a time joined to the existing result rather than as a single big join. This may be legitimate due to total size of the job -- a single big multi-table join may be outside the capacity of their system. In any case, it isn't an option for me to go and examine the code that creates the table.

That said, we need to check EVERY ROW (2.6 million of them) to see if there has been a data change compared to what we have stored on our documents for that user. Checking 2.6 million rows times 21 columns is more than 56 million comparisons! It was taking us 12+ hours at one point, using code that was considered very fast by other industry experts.

It turns out, most of the processing that we were doing was taken up loading the notes document into memory from the database or loading the values that are on it into working variables then doing the comparisons. Most of that is wasted -- if we could only know which ones we didn't need to read, we could skip them. Now, one of the things to always keep in mind, is that a view, in Notes, is also a note. That's always caused us trouble, but it can also be a good thing. What if we could just read that one note, and know which documents referenced on it had the same data as we were getting from the Oracle side?

What I've done is take all the data we get for a single user in Oracle, and combine it into a single really long string of text. If I could add that string to the view, I could just check it -- BUT -- the string is too long. You couldn't store it all like that and do any good. The view would be so large, reading it would take a very long time, and indexing on that string would be a nightmare. The answer is a HASH algorithm.

A hash is a mathematical formula, that when applied to a string of text produces a unique value, which doesn't contain the actual text. The "uniqueness" (or likelihood that two strings could produce the same result) is a function of the length of the hash key the algorithm produces. Well, built into Notes is a really great 32 byte hash algorithm -- the one used to store passwords in an unreadable way. So I decided to try reducing all the data for a single user to a short (32 byte) hash key, so I could check that to determine if the data had changed.

For simplicity, instead of writing my own -- or looking one up and copying it from the web -- I opted to use the one Lotus builds in. That's cool because is well tested and I know that any problems I have aren't a result of my own faulty math. They call this magic hash function "@Password". Yep, its that simple. Take your long string, feed it to @Password() and get back a short -- but powerfully unique -- string. In LotusScript you can do this with the evaluate() function. (TIP: Make sure you strip any " quotation marks " that may be in your string when you create the formula so that you don't inadvertently terminate the string and cause the formula to be invalid).

Now, instead of a really long string of concatenated text to compare, I have only this hash key. Its only 32 bytes (letters) long, but it represents a unique value for all the data for each user in the system. The likelihood of two different strings having the same 32 byte hash is vanishingly small. (
Rich Schwartz may jump in here and give us the actual probability). I take that value and store it with the user record. Now, next time through, Instead of loading the user record, I just find that user's view entry (which is indexed on the user's unique key) and grab that hashed value from the view without having to load the document. I compare the stored hash value with the new one generated from the oracle data, and if they are the same, I know nothing has changed -- even though I never compared the actual values themselves! It means I can ignore that data and go on to the next set of records.

To take things a step further, I now pre-read the view by way of the notesViewentry objects, into a LotusScript "List" where the listtag is the user's unique key, and the data is the hash. As a result, once I've read the view once, I never have to go back to it during the entire run of the agent.

We've cut over to the new method, and cut processing time down from 12 hours to around 3.5 hours.

This can work even better, when you're comparing two sets of Notes documents. Store the hashed key for each document on that document, and also store the hashed key of its duplicate. Now, each document has all the data necessary to compare with the other, and it can all be done by reading view entries, never actually having to load a document into memory for no reason. Have to check or change only certain fields? Fine, just make your hash key out of those fields.

Cool?


There are  - loading -  comments....

brilliant!By Anonymous on 06/23/2004 at 11:21 AM EDT
------------------------------------------------------------------------------
in response to
------------------------------------------------------------------------------
Have [SNIP] Edited by AJP to remove entire quoted story [/snip] . Cool?
re: A faster way to update data!By Devin Olson on 06/23/2004 at 11:21 AM EDT
"Sometimes the spark of brilliance can only be found in the mines of
misery".....Sorry, my muse was waxing poetic again.

This is a GREAT solution you have come up with. I'm thinking on implementing
something similar at work for a pretty nasty project. Ever hear of a product
called "Training Server"? We are integrating it with our NAB -10K plus users,
promises to be a fun one ;->

-Devin.
re: re: A faster way to update data!By Andrew Pollack on 06/23/2004 at 11:21 AM EDT
:Cool -- I hope it works for you. I have code laying around to do most of that
work. Take a look at the "Tips N' Tricks for Admins" presentation on my
"business" home page. http://www.thenorth.com

------------------------------------------------------------------------------
in response to
------------------------------------------------------------------------------
"Sometimes the spark of brilliance can only be found in the mines of
misery".....Sorry, my muse was waxing poetic again. This is a GREAT solution
you have come up with. I'm thinking on implementing something similar at work
for a pretty nasty project. Ever hear of a product called "Training Server"? We
are integrating it with our NAB -10K plus users, promises to be a fun one ;->
-Devin.
some commentsBy Bernard Devlin on 07/16/2004 at 04:51 AM EDT
Hi Andrew, a great idea.

Just a couple of comments on your blog though. Do you know on IE on the Mac
the background colour is dark blue, and the text is black? I could only read
the text by using 'select all' to highlight it in inverse video. Also,
clicking on the links to the comments results in 'special database object
cannot be located' Missing subform?
Hash for a whole doc?By harkpabst_meliantrop on 12/06/2005 at 05:28 PM EST
"Store the hashed key for each document on that document ..."

O.K., I admitt, I still don't get it. How do you compute a hash key for a whole
document with this technique? From my quick testings, @Password wont't accept
any input exept for plain text!?! Do I have to manually build a string from all
fields in the doc then? What about RT data like file attachments? Or am I just
overseeing the obvious?


Other Recent Stories...

  1. 05/05/2016Is the growing social-sourced economy the modern back door into socialism?Is the growing social-sourced economy the modern back door into socialism? I read a really insightful post a couple of days ago that suggested the use of social network funding sites like “Go Fund Me” and “Kickstarter” have come about and gained popularity in part because the existing economy in no longer serving its purpose for anyone who isn’t already wealthy. Have the traditional ways to get new ventures funded become closed to all but a few who aren’t already connected to them and so onerous as to make ...... 
  2. 04/20/2016Want to be whitelisted? Here are some sensible rules for web site advertisingAn increasing number of websites are now detecting when users have ad-blocking enabled, and refuse to show content unless you "whitelist" their site (disable your ad-blocking for them). I think that is a fair decision on their part, it's how they pay for the site. However, if you want me (and many others) to white list your site, there are some rules you should follow. If you violate these rules, I won't whitelist your site, I'll just find content elsewhere. 1. The total space taken up by advertisements ...... 
  3. 12/30/2015Fantastic new series on Syfy called “The Expanse” – for people who love traditional science fiction[] “The Expanse” is a new science fiction series being broadcast onthe Syfy channelthis winter. It’s closely based on a series of books by author James S. A. Corey beginning with “Leviathan Wakes”. There are 5 books in the “Expanse” series so far. If you’re a fan of the novels you’ll appreciate how closely the books are followed.TIP: The first five episodes are already available on Syfy.com. If you’re having trouble getting into the characters and plot, use those to get up to speed.The worlds created for ...... 
  4. 10/20/2015My suggestion is to stay away from PayAnywhere(dot)com  
  5. 08/07/2015Here is one for you VMWARE gurus - particularly if you run ESXi without fancy drive arrays 
  6. 08/06/2015The Killer of Orphans (Orphan Documents) 
  7. 06/02/2015Homeopathic Marketing: Traveler on my Android is now calling itself VERSE. Allow me to translate that for the IBM Notes community... 
  8. 03/17/2015A review of British Airways Premium Economy Service – How to destroy customer goodwill all at once 
  9. 02/26/2015There's a bug in how @TextToTime() and @ToTime() process date strings related to international standards and browser settings. 
  10. 01/21/2015Delivering two new presentations at Developer Camp (EntwicklerCamp) 2015 in Germany 
Click here for more articles.....


pen icon Comment Entry
Subject
Your Name
Homepage
*Your Email
* Your email address is required, but not displayed.
 
Your thoughts....
 
Remember Me  

Please wait while your document is saved.