Friday, January 4, 2008

Pugoob Image Search Tool

A few weeks back I wrote an image search engine as a toy project to test Amazon's EC2. I have been busy so I didn't have time to bundle the AMI (Amazon Machine Image) until now. The AMI I have published which runs the search engine is running 32-Bit Ubuntu (Dapper).

The AMI ID is: ami-112fca78


Introduction

This image search engine uses Hausdorff distance to compare a given image with a database of existing images. The Hausdorff comparison done is a model search within an image. This implementation assumes the image being searched for is the model and the database consist of images that might contain that model. An analogy would be searching for an object in a scene, the image you submit is the object and the search engine database consist of scenes that might contain that object.


All the Haursdorff stuff comes from this paper and the implementation based on it. The only changes I made were small modifications to correct some issues with compilation due to the fact that the code is about 15 yrs old and some of the C syntax is outdated. Besides that, I made some mild changes to the make files so I can compile it on my Linux box, luckily for me everything seems to work great! Please read the readme files for usage instructions.


Breifly, how it works

Basically, you specify the error thresholds, such as I want a comparison to be considered a match if a certain fraction (should be specified) of pixels in the model are within a certain Hausdorff distance (should be specified) from some pixels in the image. There are two programs, one that does the comparison based on translations of models within the image and one that does comparison based on scaling (and translation simultaneous) of models within the image. The key to the paper /implementation above is that it can efficiently find the transformations (translation and scaling) that are likely to produce matches, once those transformations are found, it tests each of them to see if they actually produce matches.


What all this means is that if you were to supply the following two images:















The Hausdorff comparison is able to guess by how many pixels the model must be translated within the image, in order that it will overlap the circle within that image. Like wise it can guess by how much to shrink the model in order that it will be similar to the image circle. The implementation above can return false positives especially if the model image is small because it becomes easy to satisfy the thresholds. The important thing is that it doesn't miss actual matches. In other words you might get back a lot of garbage but if there is a gem it will not miss it.


Note: my implementation only does search for models that have been translated, models that have been scaled would not be found. So if you try to search for a picture of someone, if that picture is scaled somewhere else on the web, the translation program wont find it, but the scaling program would. Finding both scaled and translated transformations is computationally expensive, so I only implemented translation.


Unfortunately rotation isn't supported by the above paper, that would have made it complete and more useful. If I am bored enough one of these days I'll examine the paper closely to see if it is trivial to add rotation.

The search engine at work


I have a domain that can forward to the search engine whenever it is up. I'll only put this up occasionally. If it isn't up and running, I'll leave a schedule for when I might start it up again. Visit this test page.

How the image search engine works

How images are indexed:

  1. Start Nutch
  2. Fetch page
  3. Extract all src attribute values from img html tags.
  4. For each img tag, fetch image.
  5. For each fetched image, convert to intensity edge and save to disk.
  6. For each img, insert into database: image url, image source page url and intensity file name.

How image searching works:

  1. User submits an image to search for.
  2. Convert submitted image to intensity image and save in temp location.
  3. For each entry in database, do Hausdorff comparison between submitted image and the intensity image for the current database row. if it is a match, save in result set.

Using the AMI to search

The AMI I provided already has a small flickr.com image index on it. It doesn't have the whole of flickr!!! That would require massive compute resources. As a result, you can't go to flickr and grab some random image to search, it most likely wont work since our index is very small and doesn't include most of the images on flickr.

The way to search is to use a seed image, which would be a very small image maybe 20X20 pixels, make sure the seed image isn't uniformly black or white, otherwise the resultant intensity image wont have any black pixels. Use the seed image to run your first search which would return a lot of false positives, then you use the false positives to search for models. For example, if a false positive has a lot of people or objects in it, you can crop the image around someone or some object, then search for that person or object, the original image should be returned with a green flashing around the point where the model is located.

Ideally if the computing resources are available, the whole web could be indexed in this way and you can grab any random image on the web and try to find out where else on the web it is being used, you could be surprised by what you might find!!

Note: For the AMI I provide, the results would have a link to the flickr page on which you're supposed to find the image, however flickr is very dynamic. So most likely the image would not be on that page anymore. You can try to search for static images on the flickr homepage, maybe you'll be lucky.


For Amazon EC2 AMI users


Nutch and Hausdorff stuff is in /home/bitlooter

The code that does the search is simple php code that invokes Hausdorff using the Linux shell, that code can be found in /var/www/apache/pugoob.


The current AMI is using images index from flickr.com

To index a different site:

  1. modify: /home/bitlooter/nutch_live/urls/nutch
  2. modify: /home/bitlooter/nutch_live/conf/crawl-urlfilter.txt
  3. on command line type: export NUTCH_JAVA_HOME=/usr/local/java/jre1.5.0_14;
  4. on command line type: rm /home/bitlooter/pugoob_image_index/*.pbm
  5. on command line type: rm -R /home/bitlooter/nutch_live/crawl
  6. on command line type: mysql
  7. on the mysql prompt type: delete from pugoob.image_index;exit;
  8. on command line type: cd /home/bitlooter/nutch_live
  9. on command line type: bin/nutch crawl urls -dir crawl -depth 3 -topN 50


It will take a while to finish indexing, afterwards you should be able to search as usual.

For more information on using nutch: visit nutch home page


For those who want to hack the version on the AMI


Nutch is slightly modified to extract the img tags. The modification is done in the html-parser plugin. You may need to familiarize yourself with nutch codebase a bit. Specifically, the following files in nutch were modified:

org.apache.nutch.parse.html.HtmlParser.java

org.apache.nutch.parse.ParseData.java

The above class is a data structure that is widely used within nutch, so all references to the class were also modified to accommodate some small changes I made to the initialization constructors.

I also added a new indexing filter:

org.apache.nutch.indexer.kokoii.KokoiiIndexingFilter

The above filter does the actual image indexing, it does the intensity generation and the database insert.

send questions to:

Sunday, December 2, 2007

Let Pugoob begin!

So this is where I'll be musing about Pugoob. Pugoob is another toy project of mine. It a search tool where the query is an image instead of text. There are major performance issues involved as can be expected from any image processing task. The algorithm used for doing the actual image search isn't trivial, whether that means it can overcome the performance obstacle is yet to be determined. I am playing with Amazon's EC2 and S3 as my lab for testing.