Friday, December 15, 2006

UTF-8, rails vaildations and all that jazz

To be honest, it never occurred to me that Ruby did not natively support Unicode given its roots. Alas, Ruby really has no clue about multi-byte character encodings, except Regexp that can handle UTF-8 if the 'u' specifier is used. But wait, given that UTF-8 is the default encoding for XHTML etc., what affect does this have in Rails applications? A lot and not much at all at the same time. Let me explain the little I know about this.

I ran into this issue on the first rails app I deployed. Even though I had pretty tight validation regexs on my models, every once in a while user provided content would clearly contain Unicode and not display correctly. Damn, that is scary. My validations where not stopping this data from making it into the model. The app served out XHTML 1.0 Strict pages with charaset set to UTF-8. By default browsers POST form data in the same encoding as the page containing the form; In this case UTF-8. All it takes is one cut&paste from MS Office into a form field (think bullet points etc.) to end up with Unicode in the form parameter values. The reason I was ending up with multi-byte values that were not displaying correctly it two-part.

The more trivial side of the issue is that it turns out Safari has a bug when handling response data from XmlHttpRequest that contains multi-byte encoded characters. This issue is resolved by explicitly setting the encoding on AJAX responses. The more troublesome issue is how the multi-byte characters made it past the validations on the ActiveRecord models. After a little experimentation I came up with this:

irb(main):036:0> utf8 = "\342\200\271script\342\200\272"
=> "\342\200\271script\342\200\272"
irb(main):037:0> test_regex = /^[\w]+$/u
=> /^[\w]+$/u
irb(main):038:0> m = test_regex.match utf8
=> #
irb(main):039:0> m[0]
=> "\342\200\271script\342\200\272"
irb(main):040:0> test_regex = /^[-a-zA-Z0-9_.]+$/u
=> /^[-a-zA-Z0-9_.]+$/u
irb(main):041:0> m = test_regex.match utf8
=> nil

Argh, \w in UTF-8 enabled Regexp matches most, if not all, of the UTF-8 character set above the single-byte encodable ASCII subset. Fortunately, it appears that Ruby's Regexp UTF-8 string handling does not honor overlong-encoded characters, so it appears if you are doing validations that exclude many of the characters generally needed to inject HTML or Javascript, it is unlikely that they can simply be avoided by using overlong UTF-8 encodings. That is good news. However, if multi-byte encoded characters make it into your models and you use methods like sanitize() or h() to do some safety filtering on presentation, don't expect them to do their jobs well.

In summary, there is potential for some security issues to arise out of acceptance of multi-byte encoded data, but the most straight-forward issues like direct tag and script injection will generally not escape past your input validations. As I research this more, I will keep you posted.

2 comments:

Pablos said...

Um, Input filtering isn't going to solve the problem. You need to do output filtering when rendering data because the multi-byte encodings are going to get past the input filters (as you pointed out). You need to do HTML encoding before rendering the data.

Dominique Brezinski said...

You are correct. You may have also caught that I mentioned sanitize() and h() won't work correctly on multi-byte data. For those not that familar with Rails, h() does HTML encoding and sanitize() tries its best to turn form and script tags into text and stomp on onXXX attributes and hrefs that start with javascript. So, one might classify those as output filtering. That might also be why I classified this issue as both a big and not so big of a deal. Multi-byte encoded data can be handled safely, but one must understand how Ruby can and cannot handle it.