Friday, December 1, 2006

UTF-8 in mysql 5.0

To set the server default character set, use the --character-set-server=utf8 option. Or, put character-set-server=utf8 in the [mysqld] block of /etc/my.cnf.

To check what the character set variables are, do 'show variables' (in the mysql client) and look for variables starting with character_set.

To find out what character set a table's columns are using, do 'show full columns in my_table_name'. Look at the Collation column.

To convert all columns in a table to use utf8, do

ALTER TABLE my_table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

You can also export and re-import.

Sunday, November 19, 2006

hCard Search Engine

What?

  • hCards are a microformat version of vCard (contact info)
  • Microformats are "semantic web"
  • Microformats have geek chic

Why?

  • Familiarize me with AWSP / get fresh pair of eyes on it
  • Java sample app for Developer's Corner
  • Showcase uniqueness of AWSP: searching on tag contents

Demo
中国


Sunday, November 12, 2006

Saturday, November 11, 2006

Handling UTF-8 Form Posts

Problem: non-European characters aren't handled correctly in form posts.

Solution: make sure everything is UTF-8 encoded.

  1. If you're using JSP, add a charset parameter to your page directives:

    <%@ page contentType="text/html; charset=UTF-8" %>

    This sets the encoding that the server will use to encode the page, and adds a Content-Type header line to tell the browser how to decode the bytes it gets from the server. It's probably not strictly required if what you're only concerned about posting. But you want the results of your posts to display correctly, right?

  2. For POST requests:

    • In your form tags, add an accept-charset parameter:

      <form action="/foo" accept-charset="utf-8">

      This tells the browser to encode the user's form input as utf-8. Works with the Struts html:form tag, too.

    • Add a request filter that sets the character encoding to utf-8. This tells the server how to decode the form parameters correctly. Otherwise, it will try to decode them as Latin-1.

      There is a sample class called SetCharacterEncodingFilter in the Apache Tomcat distribution that will work fine.

  3. For GET requests: non-ASCII characters in request parameters should be URL-encoded by your browser as unicode bytes. In Tomcat, you can set URIEncoding="UTF-8" in conf/server.xml to make sure it handles these bytes correctly.


I just found this page, which goes into more detail.