Tuesday, October 25, 2011

Django with solr - Now you are true web-dev



Apache Solr is an extremely powerful, enterprise level search engine, and can be used to store billions of records. For anyone with experience in MySql, you will understand how query time starts to degrade after reaching around 1,000,000 rows for any given table. After doing tons of research to try to find an alternative method for a quick and reliable search database, I stumbled upon the Apache Solr Project. The general consensus about Apache Solr is that it’s lightning fast, and after using it for a recent project I will definitely agree to this.

So this should be great news for a web developer who is looking for such a solution. Just go to the Apache Solr website, download and install the software, and you’re set right? Wrong! To give you a fair warning, the integration of Apache Solr onto your web server is a complete project in itself. The reason it’s so difficult is because of the lack of quality information online, so I’d like to share my knowledge to all you Apache Solr Noobs so you don’t have to rip your hair out your skull. If I can save just one hair follicle from this guide, then I’ve done my job. For women with mustaches, you may not want to continue reading as you may want to lose some facial hair.




The Downloads

Before I get started, I should let everyone know that this guide is mainly for Windows XP users, although there is a slight variation to the steps for anyone using the new Windows 7.

1. Download Xampp For Windows, Basic Package. http://www.apachefriends.org/en/xampp-windows.html

2. Download Tomcat Add-on. Tomcat is a java server and because Solr is run on Java getting Tomcat is necessary.

3. Download Java JDK http://java.sun.com/javase/downloads/index.jsp

4. Download Apache Solr from one of the mirrors. I got version 1.4.0 but I believe any version will do. http://www.proxytracker.com/apache/lucene/solr/

5. Download the Solr PHP Client. http://code.google.com/p/solr-php-client/




The Installation

1. Install Xampp, and follow the instructions.

2. Install Tomcat, and follow the instructions.

3. Install the latest java JDk.

4. There should now be a folder called /xampp in your C Drive. Enter the xampp folder and find the ‘xampp-control’ application, and start it.



5. Place a check mark for the Svc for Apache, MySQL, and Tomcat. This is so you install these applications as windows services.



6. Click the ‘SCM’ button and you should get a Windows Service Window.



7. Find the Apache Tomcat Service, then Right click it and go to ‘Properties’. Here you will set the Startup Type to Automatic, and close the properties window. We want Tomcat to start every time Windows boots up.




8. Now highlight Apache Tomcat in the Services Window, and click the option to Stop the Service if it’s not already Stopped. Tomcat has to be disabled for the next few steps.



9. Extract Apache Solr, then go into the /dist folder. There should be a file called apache-solr-1.4.0.war, copy this file.



10. Now find a folder in C:/xampp/tomcat/webapps/ and copy the apache-solr-1.4.0.war file into this folder. Rename apache-solr-1.4.0.war to solr.war.



11. Go back to the extracted Apache Solr folder and go to /example/solr/ then copy these files.



12. Create a New directory in C:/xampp/ called /solr/. You will now paste the /example/solr/ files into this directory.



13. Now find C:/xampp/tomcat/bin/tomcat6w, click on the Java Tab, and copy the command “-Dsolr.solr.home=C:xamppsolr” into the Java Options section.



14. Now go back to the Windows Services Window, and start Apache Tomcat.

15. Open up a browser and type “http://localhost:8080/solr/admin/” into the browser to confirm a successful installation of Apache Solr. You should see the Apache Solr Administrative Screen, if you see a bunch of error codes then you messed up. You might want to consider uninstalling everything, then start over and follow directions more carefully next time.






Python libraries to access Solr

Python API

There is a simple client API as part of the Solr repository: http://svn.apache.org/viewvc/lucene/solr/tags/release-1.2.0/client/python/

Note: As of version 1.3, Solr no longer comes bundled with a Python client. The existing client was not sufficiently maintained or tested as development of Solr progressed, and committers felt that the code was not up to our usual high standards of release.

solrpy

solrpy is available at The Python Package Index so you should be able to:

easy_install solrpy

Or you can check out the source code and:

python setup.py install

PySolr

There is a independent "pysolr" project available ... http://code.google.com/p/pysolr/

And Python Solr, And enhanced version of pysolr that supports pagination and batch operations.

insol

Another independent Solr API, focused on easy of use in large scale production enviroments, clean and fast, still in development

http://github.com/mdomans/insol

sunburnt

Sunburnt is an actively-developed Solr library, both for inserting and querying documents. Its development has aimed particularly at making the Solr API accessible in a Pythonic style. Sunburnt is in active use on several internet-scale sites.

http://pypi.python.org/pypi/sunburnt

http://github.com/tow/sunburnt

Using Solr's Python output

Solr has an optional Python response format that extends its JSON output in the following ways to allow the response to be safely eval'd by Python's interpreter:

  • true and false changed to True and False
  • Python unicode strings used where needed
  • ASCII output (with unicode escapes) for less error-prone interoperability
  • newlines escaped
  • null changed to None

Here is a simple example of how one may query Solr using the Python response format:

from urllib2 import *
conn = urlopen('http://localhost:8983/solr/select?q=iPod&wt=python')
rsp = eval( conn.read() )

print "number of matches=", rsp['response']['numFound']

#print out the name field for each returned document
for doc in rsp['response']['docs']:
  print 'name field =', doc['name']

With Python 2.6 you can use the literal_eval function instead of eval. This only evaluates "safe" syntax for the built-in data types and not any executable code:

import ast
rsp = ast.literal_eval(conn.read())

Using normal JSON

Using eval is generally considered bad form and dangerous in Python. In theory if you trust the remote server it is okay, but if something goes wrong it means someone can run arbitrary code on your server (attacking eval is very easy).

It would be better to use a Python JSON library like simplejson. It would look like:

from urllib2 import *
import simplejson
conn = urlopen('http://localhost:8983/solr/select?q=iPod&wt=json')
rsp = simplejson.load(conn)
...

Safer, and as you can see, easy.


For Django developers ...... Use Django Application

Haystack for Django

Haystack provides modular search for Django. It features a unified, familiar API that allows you to plug in different search backends (such as Solr, Whoosh, Xapian, etc.) without having to modify your code.

http://docs.haystacksearch.org/dev/toc.html

Ubuntu users follow this link to install solr and access it through haystack

http://yuji.wordpress.com/2011/08/18/installing-solr-and-django-haystack-on-ubuntu-with-openjdk/

4 comments:

  1. Hi Priyu,

    Was interested in your Django experience for my startup. What is your email so I can contact you directly/privately?

    Thanks,
    Stavan (sshah [at] travtar [dot] com)

    ReplyDelete
  2. Hi SShah .....

    You can contact me from button 'contact me' above "About Me" in this page.......

    Thanks,
    priyu.

    ReplyDelete
  3. Hi priyu,

    I have started to learn python-django development. I am getting problems with configuration of djano with mysql.
    Can u help me Please?

    Thanks!
    Sarvan kumar

    ReplyDelete
  4. Of course @Sarvan ......what type of problems u r getting??

    ReplyDelete