I'm currently writing a Bag of visual words-based image retrieval system which is similar to the Vector Space Model in text retrieval. Under this framework, each image is represented by a vector (or sometimes also called histogram in the literature). Basically each number in the vector counts the number of times each "visual word" occur in that image. If 2 images have vectors which are "close" together, this means they have many image features in common and are hence similar.
I'm basically trying to create the inverted file index for a set of such vectors. I want something that can scale from thousands (during trial stage) to hundred of thousands or million+ images so a home made data structure hack will not work.
I've looked at Lucene but apparently it only indexes text (correct me if I'm wrong) whereas in my case I want it to index numbers (i.e. the vectors themselves). I've seen cases where people convert the vector to a text document in the following way:
<3, 6, ..., 5> --> "w1 w2... wn". Basically any component that is non-zero is replaced by a textual word "w[n]" where n is the index of that number. This "document" is then passed to Lucene to index.
The problem with using this method is that the text representation for the vector does not encode how frequently the particular "word" occur so the ranking of the retrieved images would not be good.
Does anyone know of a mature indexing API that can handle vectors or perhaps suggest a different encoding scheme for my vectors so that I can continue to use Lucene? I've also looked at Lucene for Image Retrieval (LIRE) project and have tried the demo that came with it but the number of exceptions that were generated when I ran that demo makes me unsure about using it.
As for language of API, I'm open to C++ or Java.
Thanks in advance for any replies.