Roundup Wiki

Rewriting the full-text indexer

The current rdbms indexer is problematic if you want to import a lot of issues. I've almost finished doing an import from SF to a Roundup postgresql database and the words table now has 2.237.103 rows. Richard already added a new index to 0.7.7 to speed up adding text, but hey, it's more than 2 million rows! That's problematic, new index or no new index.

So, what's my alternative? I suggest we use MySQL's and PostgreSQL's built in full text indexers, and fall back to our current scheme if those aren't available. I have many ambitions in life, but writing full-text indexers is not one of them. Also, I don't think we could get it anywhere near as fast as the MySQL/PostgreSQL hackers can.

That would mean we should explicitly state the full text search won't be the same as the rest of the searches. This is already the case, as this issue illustrates, but I'd like it to be explicit.

From richard Wed Oct 13 10:07:23 +1000 2004 From: richard Date: Wed, 13 Oct 2004 10:07:23 +1000 Subject: full-text indexing Message-ID: <20041013100723+1000@www.mechanicalcat.net>

The indexer is abstracted out and we could replaced it wholly in the backends that provide their own full-text indexing. I'd be perfectly happy to do so, BTW, I just never found the time to look into it.

As for the searches being slightly different - I really don't care - it's not like we're going to have people comparing different backends' full-text searching :)

Indexer goals and non-goals:

Indexer internal design issues:

   1 class Indexer:
   2     def __init__(self, db):
   3         self.db = db
   4         self.should_reindex = 0
   5 
   6     # Formerly add_text: I removed the `text` argument, because some indexers
   7     # (e.g. tsearch2) can only index data already in the database, so passing
   8     # the text again would be clutter.
   9     def indexProperty(self, identifier):
  10         """Index the property identified by `identifier`.
  11 
  12         `identifier` is (classname, itemid, property)
  13         """
  14         raise NotImplementedError
  15 
  16     def search(self, search_terms, klass):
  17         """Display search results looking for the words in the iterable
  18         `search_terms` associated with the hyperdb Class `klass`.
  19         """
  20         raise NotImplementedError
  21 
  22     # Moved from the different backends to remove duplication.
  23     def reindex(self):
  24         for klass in self.db.classes.values():
  25             for nodeid in klass.list():
  26                 klass.index(nodeid)
  27         self.indexer.save_index()
  28 
  29     # See this a lot in the various backends.
  30     def reindexIfNecessary(self):
  31         if self.should_reindex:
  32             self.reindex()

Modifications to standard indexer