Unicode support in websolr

heroku-4381c9a88c's Avatar

heroku-4381c9a88c

26 Jan, 2012 08:55 PM via web

Is there any special configuration required to support unicode? I'm not getting any results for unicode queries.

Locally against the solr that comes with sunspot:

curl -d "qf=producer_text&defType=dismax&wt=ruby&q=Château" http://localhost:8982/solr/select             
{'responseHeader'=>{'status'=>0,'QTime'=>6,'params'=>{'qf'=>'producer_text','wt'=>'ruby','q'=>'Château','defType'=>'dismax'}}

Against my websolr index (note the Château):

curl -d "qf=producer_text&defType=dismax&wt=ruby&q=Château" http://index.websolr.com/solr/<api-key>/select
{'responseHeader'=>{'status'=>0,'QTime'=>3,'params'=>{'qf'=>'producer_text','wt'=>'ruby','defType'=>'dismax','q'=>'Château'}}
  1. Support Staff 2 Posted by Nick Zadrozny on 26 Jan, 2012 10:16 PM

    Nick Zadrozny's Avatar

    Interesting, I'll look into this.

    It seems like this is happening for the body of a POST requests. If you pass in the query with the Unicode in the URL it gets through correctly. It may be we just need another test case to check this.

    curl http://index.websolr.com/solr/abcdef/select?defType=dismax&qf=producer_text&q=Château
    
  2. 3 Posted by heroku-4381c9a88c on 26 Jan, 2012 10:24 PM

    heroku-4381c9a88c's Avatar

    Yep, it works for me in GET requests (I actually went and tested it in safari and it was fine there), and your curl command works.

    In production I'm using sunspot to query the index, and I believe it's using POST there, although I'm not sure how to see the requests it's sending.

  3. Support Staff 4 Posted by Kyle Maxwell on 30 Jan, 2012 12:48 AM

    Kyle Maxwell's Avatar

    I believe we already answered this in more detail in heroku's support forum, but the key is to set the Content-Type/charset header of your POST to UTF-8.

  4. Kyle Maxwell closed this discussion on 30 Jan, 2012 12:48 AM.

  5. Nick Zadrozny re-opened this discussion on 30 Jan, 2012 01:11 AM

  6. Support Staff 5 Posted by Nick Zadrozny on 30 Jan, 2012 01:11 AM

    Nick Zadrozny's Avatar

    I think it was a different question that Kyle refers to, so here are those details for you as well:

    We had previously made the modification for GET requests to be being interpreted as UTF-8. The POST requests will be interpreted as whatever you set your explicit charset to in the content-type header, e.g. Content-Type: application/x-www-form-urlencoded; charset=UTF-8. Our default charset for POSTs is ISO-8859-1, per the servlet spec. This ends up being overridden by XML's UTF-8 assumption, if your POST is XML.

    It turns out that this has been addressed in more recent versions of RSolr (~>1.0.3) included in more recent versions of Sunspot (~>1.3.0).

    https://github.com/sunspot/sunspot/issues/167
    https://github.com/sunspot/sunspot/pull/119

    Here's another example POST for you with an explicit Content-type and encoding.

    curl -d "qf=producer_text&defType=dismax&wt=ruby&q=Château" -H 'Content-type: application/x-www-form-urlencoded; charset=utf8' http://index.websolr.com/solr/abcdef/select
    {'responseHeader'=>{'status'=>0,'QTime'=>0,'params'=>{'qf'=>'producer_text','wt'=>'ruby','defType'=>'dismax','q'=>'Château'}},'response'=>{'numFound'=>0,'start'=>0,'docs'=>[]}}
    
  7. Nick Zadrozny closed this discussion on 30 Jan, 2012 01:11 AM.

Comments are currently closed for this discussion. You can start a new one.