Recent Posts

  • 24 days ago

    Thing 1:

    Opitter, A Twitter client for Opera. I learnt a lot about Ajax in the process of making it, and decided that jQuery is cool.

    Thing 2:

    Hot sauce. I haven’t tried it yet, but it’s about 50% scotch bonnet peppers. It hopefully has the potential to explode 100% of my head.

  • 2 months ago

    Hurrah, RevLob lives again!

    Tags: lobster
  • 3 months ago

    On Saturday Sam and I went to London to see the Queen Levellers at Brixton Academy. It was an event called Beautiful Nights, one of their 20th anniversary gigs, and it was awesome.

    That’s the band exploding in some kind of plamsa ray feature. Supporting the Levellers were 3 Daft Monkeys (who are very good), and Alabama 3. The Wikipedia article on Alabama 3 features this amusing quote:

    The band was formed when Jake Black met Rob Spragg at an acid house party in Peckham and they decided that a fusion of country music with acid house was a musical possibility.

    I mean, what the hell were they doing at this party? Playing Musical Twister?

    Rob: (spins spinner) “Ok Jake, left hand country and western”
    Jake: “Okay”
    Rob: “aaaaaaand, right foot… acid house”
    Jake: “Hmm… That could work”

    Just because something is possible, doesn’t mean you should go ahead and do it. Regardless of how they decided to embark on such a musical journey, the end result was pretty good. Worlds away from Dolly Parton at any rate.

    The other guest band was Dreadzone, who continued with the theme of musical fusion. Their flavour is a blend of reggae, dub, techno, and dance. Reggae has never really been my thing, but just as Alabama 3 have made me look at country music in a new light, Dreadzone I think have opened doors for me. Musical doors. Musical doors in a house of sound, in a city of noise, on a planet of shoes.

    We also went up the London Eye! Only three things can fly higher than the Eye in London; property prices; birds; and helicopters.

    Last night I finished reading the final volume of Neal Stephenson’s The Baroque Cycle; The System of the World. That’s nigh-on three thousand pages of late renaissance adventure, romance, war, politics, and Natural Philosophy. I started reading the first book over three and a half years ago, soon after I was given it for my 21st birthday by Daf. I did manage to fit a few books in between volumes, some Gibson and some Banks, and I did read Quicksilver twice, but I’ve been following the story of Daniel Waterhouse and Jack Shaftoe for such a long time now, that coming to the end of such an epic tale has left me somewhat stunned. There are no more pages to turn. I’ve already read Cryptonomicon.

    What the hell do I read now?!

    I had better find something soon, otherwise I may be tempted to read the whole thing again. I may be tempted to buy the whole thing again.

  • 5 months ago

    The Japanese are a crazy bunch. They write their names back-to-front, their addresses upside-down, and they wear their shoes on their hands and their gloves on their feet. Okay, so I made that last bit up, but there are certain aspects of their language and conventions which can give developers a headache.

    One problem which had me digging around the net, peering at verbose specifications and underdeveloped Wikipedia pages was that of spaces. To most of the Latin character-(ab)using world, a space is a space is a space. That behemoth of a key at the bottom of your keyboard can usually be counted on to conjure that familiar interword separator; the humble hexadecimal value 20, AKA space. However, when you’re using an input method for an East-Asian language such as Japanese, there’s no knowing what might be summoned.

    There’s a good deal of background when it comes to Japanese and input methods, but I’ll try and summarise why things are a bit of a mess here.

    Back when computers were made of grit and powered by coal furnaces, some guys said “Hey, wouldn’t it be useful if we could type some words onto this thing?” Thus the keyboard was born. However, several sets of guys invented the keyboard at the same time. This meant that there now existed more than one means of storing and projecting the characters entered by their keyboards onto their candlelit zoetropes.

    So a bunch of these guys got together and said “I would like to send you a file (files back then were chiseled into slate), but your runes are written in German, whereas mine are Reverse Polish.” Together they decided that they should probably agree on some sort of standard encoding to save everybody a lot of time and hassle, and they called it ASCII. Problem is, ASCII wasn’t just developed by a bunch of guys, but a bunch of American guys. I guess in the days before the Internet, the Americans couldn’t have imagined the possibility that you’d be exchanging files with Communists non-Americans. Therefore, ASCII was great if you spoke God’s English, but not so great if your heathen tongue supported mystic symbols like Ümlauts or Jamㅇ.

    Here’s where the gaps in my knowledge start to show. I’m not as clued up on the history of Japanese computing as I’d like to be, but they must have managed to get from Samurai to Gundam via some amount of technological progress.

    It was someone’s dream that one day, everything in the IT world would be encoded using an internationally recognised standard. An encoding which could represent any character, from any language in the whole world. Including Spain. That dream was Unicode. But, the first revision of the Unicode standard wasn’t unleashed until 1991, by which time the Japanese had grown weary of performing calculations on abaci, and had decided they’d join in with this computing malarkey and invent their own system. Or two. Pulling down the “Encoding” option from my FireFox View menu allows me to choose from no fewer than three available options, and I’m sure there are more. Here’s a “map” of where some of these popular Japanese encoding methods lie in relation to ASCII:

    Is that clear? Good.

    Now this poses all sorts of issues for the web developer. When a user stumbles onto your website, if you have any sense at all you will have sent the page with a valid encoding type, either inside the HTTP header or via a META tag. Or through the XML declaration, if you’re using one. This tells your user’s client how to handle the deluge of characters the server has just sent you. Sometimes, websites have forms, which may be used to sign your guestbook, or to answer questions to determine your chances of surviving a zombie apocalypse (mine is a fair 63%).

    Forms are where users have decided that they’ve had enough of your data, and want to send you some back. Unfortunately, this can include text, and since this is the Internet, you can expect all kinds of drivel in all kinds of languages, and it won’t be just Good English and Bad English either. When a user hits “Submit”, the contents of the form fields will be shot back to your application, and it’s up to you to guess how they are encoded. “Surely it should be fair to assume that the form data uses the same encoding as the form page?” I hear you ask. Yes, it should be. But this is the Internet, where Sod’s Law is just another protocol in the TCP/IP stack.

    Whether or not you choose to detect and translate other encodings is up to you and/or your spec, but let’s just assume for a moment that we have built a simple form, and both the page and the submitted data are UTF-8 (just one of the several Unicode encodings). Let us also assume that you have a single field in which you are asking the user to input their full name. We want to extract the first name and last from from this single field. A simple way to capture these elements would be to use a regular expression to detect the whitespace in the field, and assume everything on the left is the first (family name in Japan), and everything on the right is the second (given, or ‘first’ name).

    We have made a whole bunch of assumptions right there, such as the existence of a single set of whitespace characters, and that the user hasn’t given themselves a title (e.g. ‘Mr’ or ’-さん’). Suspend your disbelief for just a moment, and assume we have some nice, clean, valid data. Now, how do we detect that whitespace again?

    PHP has access to Perl Compatible Regular Expressions (PCRE) via a set of functions prefixed with ‘preg_’. PCREs are special in that they are Quite Good, especially when compared to those of the POSIX Extended-variety, which are Not So Good. Using preg_match(), we can test for certain patterns within text, and it allows us a means for easily defining tests such as ‘does this variable contain any whitespace?’

    preg_match(”/\s/”, $foo);

    The interesting bit of the above line is in between the forward-slashes, ’\s’. This is an escape sequence, shorthand for a series of characters. In this example, \s is shorthand for whitespace. Testing for whitespace as opposed to just a normal space could be considered overkill, seeing how it is unlikely someone is going to separate their names with a vertical tab, or any of the other characters in the whitespace set, but we’re just being careful here.

    Now, if you’re still with me, great, because this is where I finally begin to highlight the issue I am ranting about. What caused me to write all of this is that when a user types via a Japanese input method (and this may be true for other languages too; I’m not sure), the space character they type might not be the same as a normal space. It depends on whether they are typing in hiragana, katakana full-width, katakana half-width, or romaji, but your application should be prepared to receive an “Ideographic Space”.

    This is important because ideographic spaces are not included in the PCRE standard set of whitespace characters, meaning our regular expression above will fail, even though the user has entered what appears to them to be a space character. However, all is not lost, and you won’t have to resort to detecting long strings of hexadecimal in order to filter these invisible terrors.

    If you’re lucky enough to be using an environment whose PCREs have been built with Unicode support, then you have access to an additional set of escape sequences, particular to Unicode characters. These sequences detect properties of Unicode characters, and are defined with \p. There’s a large list of properties that can be detected, but the one we are interested in is Space Separation, \p{Zs}. Here’s a comparison of how the two sequences behave against various whitespace characters:

    Character Unicode /\s/ /\p{Zs}/u
    Space U+0020
    Newline/Linefeed (\n) U+000A
    Carriage Return (\r) U+000D
    Tab (\t) U+0009
    Ideographic Space U+3000

    The table above makes use of some nice little Unicode ticks and crosses, which I hear don’t work in IE6. If you happen to be reading this in IE6: sucks to be you.

    So neither sequence matches the full range of whitespace, which may be important if you plan on handling Unicode, and interpreting whitespace. Luckily, regular expressions allow us to combine strings of characters and sequences together, so we can glue \s and \p{Zs} together to get everything, if we need to.

    PHP’s handling of Unicode is far from perfect, but it’s a lot better than it used to be, and I hear version 6 will bring it closer to the likes of Java, doing automagical things with strings so that we don’t have to. Until then, where written language meets binary, I’m going to keep on running into little surprises like these. I dread to think what developing an application handing a right-to-left writing system like Aramaic is like…

    All this fuss over a character you can’t even see!

  • 8 months ago