June 2, 2002 12:00am@ trainedmonkey

the documentation is out of date and generally unclear. (at both ends of the scale—it's missing some things that would help a new user get going, and the information to stretch it in more interesting ways is hard to find.)
the order in which Disallow and Allow configuration options are processed is not documented. (the first rule matched from indexer.conf will be used.)
the minimal configuration file is so minimal, it lacks the bits that causes page content to actually get indexed. (you need the various Section lines that you can grab from etc/indexer.conf.)
the indexer doesn't have a mode that corresponds to index the whole site right now. it only indexes pages that are new or expired when the indexer is started, and records the address of new pages it finds so they will be indexed later.
while it checks robots.txt files, it doesn't use the information to avoid storing the url in its table of urls to be visited. (it just deletes it when it goes to index that page and realizes that the robots.txt disallows it.)
oh yeah, the robots.txt support is broken in the most recent version. (here is a patch to fix this.)
there's no simple command-line search tool. you have to run the cgi version and deal with the html output.

all that said, the results seem to be pretty good, and searching is fast (using mysql for the backend, of course). once you've figured out how the disallow and allow rules work, they appear to allow for more flexibility than htdig does.

» Sunday, June 2, 2002 @ 12:00am » Comment

« Saturday, June 1, 2002 @ 9:54pm • Sunday, June 2, 2002 @ 4:07pm »

Add a comment

Sorry, comments on this post are closed.