Solr/Lucene based
on-premise Search Service 2

latest edit Jan 2022


    This project contains the reinvented on-premise Search Service running on Linux, using the Solr/Lucene search engine. It replaces an earlier project which used crawler Apache Nutch. A description of the facility as a whole and installation/configuration instructions are in the downloadable bundle offered below.

    The material offers a web based crawler configuration editor, a compact command line based crawler (crawls both web and local files), a web based user query program, and a schema bundle for Solr. The query program is designed to work smoothly with both desktop and mobile devices, use least network bandwidth, offer multiple indices and be easy to use.

    The suite is small and efficient (Nutch was neither), uses least resources, has a flexible file selection filter, can replace or add incrementally to a Solr index, paces material into Solr for least resource consumption, and it scales to hundreds of thousands of documents in an individual crawl.

    Components (crawler config web program plus a crawler, Solr/Lucene engine(s), many copies of query web programs) may be located together or in different machines, use one or more Solr indexers and schema, and can run without interference in ordinary Linux production servers.

    Attributes include: index both local and web-based remote file stores, with access authentication if required. Store results within an index a replacement access pathway rather than the crawler's pathway, and each query program can show users a chosen file access pathway which may differ from that stored in the index. Allow complex queries using Solr's dismax parser (permits regular expressions, selecting any or all components, specify must-have or must-ignore, etc). Each query program can be configured to deal with only certain indices, and that configuration is not overridable by query users.

    File kinds include txt, html, PDF and both MS Office and LibreOffice suites. For each index, a manager can choose which files are accepted and which are ignored. Crawling can be instructed to add only recent files to an existing index, thus perform incremental updates.

    The code is written in PHP v5.3 and compatible with PHP v7.x. Solr itself requires Java JDK v1.8 or later. If using Oracle's JDK v1.8 (not the JRE) then this project can isolate that from the rest of the machine (thus avoid Java version conflicts). IBM Java JRE v1.8 has worked well in my tests; OpenJDK is also accepted.

    The schema bundle, myconf.tar.gz, has been updated on 29 Sept 2017 to include a revised bundle appropriate to current Solr v7 and v8. The Installation presentation has been enhanced to show how to recreate such a bundle in the future. Crawler.php was tweaked to accomodate both PHP v5.3 and newest v7.x.

    Further small enhancements, as version 2.1, were made to both crawler.php and the Solr schema bundle on Feb 2017. Both conserve disk space and ease installation. On 28 May 2018 were introduced logging configuration (yes or no) and automatic index Optimization if the number of deleted files in an index exceeds value MAXDELETEDDOCS (default 100, settable in crawler.php) to compensate for the removal of the Optmize button on Solr's admin web page.
Update: Solr v8.1 has removed even this capability: the team has a blind spot about cleanups.
June 2019: add a query results start value slider control (works on modern browsers).
5 Sept 2019, small adjustments to config.php and query.php to support reaching Solr itself via a web proxy which may require credentials, and expose document length.
11 Dec - fix bug with highlighted text.
July 2020, remove rubbish characters by replacing non-printable characters with a dot.
Sept 2020, better cleaning up of content snippet and correct bug in file crawling.

IMPORTANT: currently Solr 8.6.0 has significant internal problems, apparently within its jetty material. Thus for now I recommend staying with Solr 8.5.0 until matters are resolved.
Update Nov 2020: Solr 8.7.0 has cured these problems and does work as we wish.
Update Aug 2021: Add configuration control extentionless (yes/no) to control indexing files which lack a filename extension (web pages themselves are examples, often names are just paths). Default is no. Plus minor fixups here and there.
Sept 2021, add user control for length of shown text fragment
Dec 2021, add into installation instructions (add a line in file solr.in.sh) a fix for Log4j2 security blunder  CVE-2021-44228. Please see https://logging.apache.org/log4j/2.x/security.html for current full details.

    Open Horizons has published an article about this work. See
    https://ohmag.net/an-on-premise-search-service  or   a local copy

    This material is open source, uses the Apache license, with no warranty. Joe Doupnik
jrd@netlab1.net

© Joe Doupnik 2016