Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

SNTT (seriously geeky) - Some helpful @Formulas - clean up your XML, and convert from hex value strings to decimal numbers

By Andrew Pollack on 11/01/2006 at 04:40 PM EDT

In working to improve this blogsite, I found some challenges in creating good XML when the source data can be variable. The problem is that in formula language lacks easy tools for doing this. There is no simple XML transformation process that works the way @URLENCODE works. It sure would have been easier.

First off, as you put together your XML string, you want to make sure to replace the "reserved" characters that are valid ascii values but still can't be in your content. By these I mean the greater than and less than brackets, ampersand, quote, and apostrophe. You want to do this to each of the fields you'll build your string out of, so that these characters are transformed before take the next step of scrubbing any high order characters from the data. I prefer to make the replacement lists in one place, as a constant at the top of the formula, as then may get used quite a bit.

xmlfrom := "<":">":"'":"&" : "\"" ;
xmlto := "&lt;" : "&gt;" : "&apos;" : "&amp;" : "&quot;";
SourceText := @ReplaceSubstring(@Text(SourceText); xmlfrom ; xmlto );

That was the easy part. What's harder is scrubbing your XML for any characters not allowed under the UTF-8 encoding schema. Basically, that means anything with unicode value above 127. The one that was messing up this blogsite was "é" from the name of a person in another story. That character comes out to a unicode value of 0xE9 (or 233 in decimal).

Formula language presents us with some challenges here. In LotusScript this would be handled fairly simply. Formula language however does not have easy functions for deal with conversion to hexadecimal values from strings, and strings are not stored as unicode values. There is also no simple function to get the unicode value of a character. Solving these problems leads me to these other useful formulae.

First, here's one to convert a string of hexadecimal digits into a numeric value. This could have been simpler for my needs, as I'm only dealing with two digit values, but I wanted something that could handle any reasonable length string of hex characters. The initial value is in the field "sourcetext" and the result will be in the numeric field "result".

result := 0;
sourcetext := "1A3";
HexChars := "A":"B":"C":"D":"E":"F" ;
DecimalValues :=  "10": "11" : "12" : "13" : "14" : "15";
@For(z := 1 ; z <= @Length(sourcetext) ; z := z + 1 ;  
 result :=  @If(@IsNumber(result) ; result ; 0) +
 @Power(16 ; z-1) * @TextToNumber(@Replace(@Left(
 @Right(sourcetext; z);1) ; HexChars; DecimalValues)));

* Notice that I'm setting "result" to zero here. If you do this inside a loop of any kind, you'll find the value keeps increasing. Make sure to set it back to zero for each new iteration, as in the final formula shown below.

Finally we're going to put the whole thing together and produce an XML string that we can be sure is valid. For this example, lets assume we're using the "Subject" field of a document in a view column formula to produce this XML.

The key to this whole thing is the function "@URLEncode" which takes as one of its possible options "ISO-8859-1". The formula looks at each character in the string, and encodes it in that format. For any character above decimal value 127 along with any values that are not valid in that format (like spaces) the encoding will change it to a % sign followed by a two digit hexadecimal value. That two digit value represents the character code. Since any other character will be unchanged, we can tell which are 'special' characters according to that format by checking the length of the string returned. For each "special" character, we check to see if it is in fact above value 127. To do this, we convert the hexadecimal value to a number value first. If it turns out that we're dealing with a space character or some other value which is valid in UTF-8 encoding but not in ISO-8859-1, we just convert it back again. For those characters which are above 127, we turn them into the XML representation allowed by UTF-8 encoding, which is "&#x" followed by the two digit hex value, and then by a closing semicolon. Thus, the "é" character from earlier encodes as "&#xE9;"

Notice also that we do the transformation for brackets and ampersands and so forth on the subject string itself before we apply the XML tags around it. We don't want to change the brackets on the XML tags of course. We do the high order character transformation on the entire XML string however, because those characters are not valid anywhere on the XML document.

SourceText := Subject;
HexChars := "A":"B":"C":"D":"E":"F" ;
DecimalValues :=  "10" : "11" : "12" : "13" : "14" : "15";
xmlfrom := "<":">":"'":"&" : "\"" ;
xmlto := "&lt;" : "&gt;" : "&apos;" : "&amp;" : "&quot;";
SourceText := @ReplaceSubstring(@Text(SourceText); xmlfrom ; xmlto );
SourceText := "<subject>" + SourceText + "</subject>";
@For( x := 0 ; x < @Length(SourceText); x := x + 1 ;
 tcharenc := @URLEncode("ISO-8859-1" ;@Middle(SourceText; x ; 1));  
  @If( @Length(tcharenc) = 1; ResultText := ResultText + tcharenc ;
    @Transform( @Explode(tcharenc;"%") ; "chval" ; @Do(
      result := 0;
      @For(z := 1 ; z <= @Length(chval) ; z := z + 1 ;
        result :=  @If(@IsNumber(result) ; result ; 0) +
        @Power(16 ; z-1) * @TextToNumber(@Replace(@Left(
        @Right(chval; z);1) ; HexChars; DecimalValues)));      
      ResultText := ResultText + @If( result > 127 ; "&#x" + chval + ";"  ; @Char(result))))));


There are  - loading -  comments....

My own thoughts on this are...By Richard Schwartz on 11/01/2006 at 11:27 PM EDT
Not clear to me why you use 8859-1.

The UTF-8 value for a character can be 1, 2, or three bytes long (2, 4, or 6
hex digits). For example, each of these three requires two bytes: אבג They
also go right to left when displayed ;-)

And each of these requires three bytes: ①②③

Without following your code line-for-line, I can't be sure; but it seems to me
that you're in danger of losing data if you do a transformation to through
8859-1
Because I haven't found anything better.By Andrew Pollack on 11/02/2006 at 12:17 AM EDT
Using the options "Domino" or "UTF-8" produce a set of values that browsers do
not interpret correctly. In this case, it produces "%C3%A9" which are not a
valid representation of the character without also specifying another code page.

Interestingly enough, using the value "Platform" seems to work, although all
three fail when attempting the character data you entered.

Something is still missing.
My own thoughts on this are...By Kerr on 11/03/2006 at 09:20 AM EDT
Something is very much missing. Part of that is that URL encoding is a specific
task that has different requirements to discovering the Unicode code point of a
given character. The xml entity reference is just the Unicode code point, in
either a hex or decimal format and has nothing directly to do with any charset.

@URLEncode does the following:
For every char in the input string, if the character is not valid for a url
query as defined by the url spec (basically [a-zA-Z0-9] plus a few others) then
return the encoded character as defined by the specified charset, preceding
each byte with a % char.

Using this method to try and extract the Unicode code point is going to involve
knowledge of how that charset encoding relates to Unicode, which may be
non-trivial.

There is really no reason why you shouldn't be able to just put the "é"
directly on the document, as long as the charset is utf-8 and xml document
specifies that it is utf-8 (although that is the default). Unless you are
working around a bug that is causing a problem or you have characters that your
font's don't have a glyph for you should not need to use character entity
references at all.

A couple of points of clarification:
UTF-8 can encode any Unicode code point, using as many bytes as it needs using
surrogate pairs.
The only characters that are not valid in an xml document are control
characters or half of surrogate pairs. There are also unassigned code points
that shouldn’t be used.
As long as the charset that encodes (serialises) the xml document is the one
that is used to unencode it, that this is the one specified on the document,
and the charset is capable of encoding all characters on the document, then any
problems will be caused by a bug, not a limitation in Unicode, the charset or
XML.
I hear you.... but in practice it's not working as wellBy Andrew Pollack on 11/04/2006 at 12:01 AM EDT
The issue seems to be that the rss data isn't be generated by domino, so what
domino is encoding doesn't match the charset I've coded for the xml to be.

I have an idea or two to try that may help. By the way, I hate character set
issues.

*Tested -- initially anyway, I've found a VERY simple and happy way to resolve
this issue. Fascinatingly simple really. I'll try to post it tomorrow.
My own thoughts on this are...By Richard Schwartz on 11/02/2006 at 01:19 AM EDT
The problem is that XML character refs are expressed in UTF-16, not UTF-8. The
correct encoding for that character in UTF-16 is this

é

That can be shortened to é. In fact, for any character that falls within
the 8859-1 set, the UTF-16 value will have the two leading zeros, which can
simply be dropped. But if you want to support all Unicode characters, you have
to be prepared to deal with

There does not seem to be a way to use @UrlEncode to get UTF-16. One could
write a translator from UTF-8 to UTF-16 in formula language, though. Shouldn't
be too too hard. Maybe I'll take it on as a challenge.


Other Recent Stories...

  1. 05/05/2016Is the growing social-sourced economy the modern back door into socialism?Is the growing social-sourced economy the modern back door into socialism? I read a really insightful post a couple of days ago that suggested the use of social network funding sites like “Go Fund Me” and “Kickstarter” have come about and gained popularity in part because the existing economy in no longer serving its purpose for anyone who isn’t already wealthy. Have the traditional ways to get new ventures funded become closed to all but a few who aren’t already connected to them and so onerous as to make ...... 
  2. 04/20/2016Want to be whitelisted? Here are some sensible rules for web site advertisingAn increasing number of websites are now detecting when users have ad-blocking enabled, and refuse to show content unless you "whitelist" their site (disable your ad-blocking for them). I think that is a fair decision on their part, it's how they pay for the site. However, if you want me (and many others) to white list your site, there are some rules you should follow. If you violate these rules, I won't whitelist your site, I'll just find content elsewhere. 1. The total space taken up by advertisements ...... 
  3. 12/30/2015Fantastic new series on Syfy called “The Expanse” – for people who love traditional science fiction[] “The Expanse” is a new science fiction series being broadcast onthe Syfy channelthis winter. It’s closely based on a series of books by author James S. A. Corey beginning with “Leviathan Wakes”. There are 5 books in the “Expanse” series so far. If you’re a fan of the novels you’ll appreciate how closely the books are followed.TIP: The first five episodes are already available on Syfy.com. If you’re having trouble getting into the characters and plot, use those to get up to speed.The worlds created for ...... 
  4. 10/20/2015My suggestion is to stay away from PayAnywhere(dot)com  
  5. 08/07/2015Here is one for you VMWARE gurus - particularly if you run ESXi without fancy drive arrays 
  6. 08/06/2015The Killer of Orphans (Orphan Documents) 
  7. 06/02/2015Homeopathic Marketing: Traveler on my Android is now calling itself VERSE. Allow me to translate that for the IBM Notes community... 
  8. 03/17/2015A review of British Airways Premium Economy Service – How to destroy customer goodwill all at once 
  9. 02/26/2015There's a bug in how @TextToTime() and @ToTime() process date strings related to international standards and browser settings. 
  10. 01/21/2015Delivering two new presentations at Developer Camp (EntwicklerCamp) 2015 in Germany 
Click here for more articles.....


pen icon Comment Entry
Subject
Your Name
Homepage
*Your Email
* Your email address is required, but not displayed.
 
Your thoughts....
 
Remember Me  

Please wait while your document is saved.