Rewriting the full-text indexer
The current rdbms indexer is problematic if you want to import a lot of issues. I've almost finished doing an import from SF to a Roundup postgresql database and the words table now has 2.237.103 rows. Richard already added a new index to 0.7.7 to speed up adding text, but hey, it's more than 2 million rows! That's problematic, new index or no new index.
So, what's my alternative? I suggest we use MySQL's and PostgreSQL's built in full text indexers, and fall back to our current scheme if those aren't available. I have many ambitions in life, but writing full-text indexers is not one of them. Also, I don't think we could get it anywhere near as fast as the MySQL/PostgreSQL hackers can.
That would mean we should explicitly state the full text search won't be the same as the rest of the searches. This is already the case, as this issue illustrates, but I'd like it to be explicit.
From richard Wed Oct 13 10:07:23 +1000 2004 From: richard Date: Wed, 13 Oct 2004 10:07:23 +1000 Subject: full-text indexing Message-ID: <firstname.lastname@example.org>
The indexer is abstracted out and we could replaced it wholly in the backends that provide their own full-text indexing. I'd be perfectly happy to do so, BTW, I just never found the time to look into it.
As for the searches being slightly different - I really don't care - it's not like we're going to have people comparing different backends' full-text searching
Indexer goals and non-goals:
- Simple to install: users should only have to install tsearch2 in PostgreSQL and set a config option to tell Roundup to use tsearch2.
richard: I'm leery of asking users to install more software...
johannes: That's why I don't want to get rid of the old-style full-text indexer, but only provide tsearch2 as an option. That's the way Roundup usually does it: installation is very simple, but if you need more scalability, you can add a new backend/mod_python/tsearch2. Note, by the way, that tsearch2 is part of the postgresql distribution, it just isn't enabled by default. If we choose to use tsearch2, I'll provide installation instructions for source distributions and Debian (apt-get install postgresql-contrib ;).
- If a new indexer requires installing new software (e.g. tsearch2), we should also keep providing an indexer that works without the extra software.
- Switching between indexers inside backends should only require a reindex.
The full-text indexer doesn't have to give the exact same results as the normal search, nor do the results have to be the same across backends. Some similarity is nice, though.
Indexer internal design issues:
- tsearch2 can't index non-RDBMS data. That means we'd need to either move all message text (and other text file attachments) into the database, or figure how to make tsearch2 index on-disk data.
johannes: I would choose the first option, if only because I have a clue as to how to move text into the database, and don't have a clue how to extend tsearch2.
- The implementation would be drastically simplified by getting rid of the 'property' level of indexing. With property level indexing, every indexed column would have to get its own index column (or maybe we would have to add another table), just so we can keep the properties apart. If we only keep the 'item' level, we'd only need one column per table. The only place we're using the property level indexing right now is the 'ignore' argument to Indexer.search(). Fortunately, no Roundup code uses that argument. So I would like us to get rid of property level indexing, unless someone can think of some better use cases.
richard: hmmm, I can't recall why it was added.
- The indexer API isn't well defined. I would like to see an indexer_common module with a base Indexer class, which defines the (very) minimal set of methods every Indexer should have. However, an indexer is closely tied to a particular database, interchangeability is not much of an issue, so we should allow backend-specific public functions. A suggested minimal API with some helper methods:
1 class Indexer: 2 def __init__(self, db): 3 self.db = db 4 self.should_reindex = 0 5 6 # Formerly add_text: I removed the `text` argument, because some indexers 7 # (e.g. tsearch2) can only index data already in the database, so passing 8 # the text again would be clutter. 9 def indexProperty(self, identifier): 10 """Index the property identified by `identifier`. 11 12 `identifier` is (classname, itemid, property) 13 """ 14 raise NotImplementedError 15 16 def search(self, search_terms, klass): 17 """Display search results looking for the words in the iterable 18 `search_terms` associated with the hyperdb Class `klass`. 19 """ 20 raise NotImplementedError 21 22 # Moved from the different backends to remove duplication. 23 def reindex(self): 24 for klass in self.db.classes.values(): 25 for nodeid in klass.list(): 26 klass.index(nodeid) 27 self.indexer.save_index() 28 29 # See this a lot in the various backends. 30 def reindexIfNecessary(self): 31 if self.should_reindex: 32 self.reindex()
richard: yes, the current way indexer_rdbms extends indexer_dbms is wrong.
johannes: I think this is the full proposal. I'll implement it if you want to see it in Roundup 0.8.
richard: I think it's worthwhile.
Modifications to standard indexer
These may not be needed if we can hook in the built-in indexing in mysql and postgresql, but it might be worth investigating an adaptive stop-words list (anything over 1000 or so entries) and using a stemmer though the latter will only help English, and may screw up other languages.