Saturday, November 14, 2015

Indexing and Searching through Lucene

Why Lucene in WSO2 Data Analytics Server ?

A common use-case for using Lucene indexing in Data Analytics Server (DAS) is to perform a full-text search on one or more persisted event stream data. DAS provides interactive data analysis (means it is used where a stored dataset can be queried in an ad-hoc manner in finding useful information more quickly and more accurately) for allowing you to search for persisted events using the Data Explorer .

What is Lucene ?

Lucene is an extremely rich and powerful full-text search (information retrieval) library which is written in Java. You can use Lucene to provide full-text indexing across both database objects and documents in various formats. Lucene provides search over documents. A document is essentially a collection of fields, where a field supplies a field name and value (name-value pair).

The primitive concept behind the Lucene is to take dataset and place it in fields to either be stored, indexed, or both indexed and stored. Indexed means you can search against that field, stored means you cannot search against the field but you can retrieve it’s contents. There are also non-stored and non-indexed fields but they are primarily used for the storage of metadata.

You can retrieve the dataset stored in the database, put it into fields (as name-value pair), put those fields into a "document", and then add the document to the indexing process. The index is a set of files on disk or in memory. There are multiple files contained in an an index and the files are platform independent.

Searching and Indexing through Lucene

Lucene is able to retrieve informations fast and efficiently because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

What actually gets indexed is a set of terms. A term (eg:- title:"Modern") combines a field name with a token that may be used for search. For instance, a title field like Modern Operating Systems, 2nd Edition might yield the tokens modern, operat, 2, and edition after case normalization, stemming and stoplisting. The index structure provides the reverse mapping from terms, consisting of field names and tokens, back to documents. This type of index is called an inverted index, because it inverts a page-centric data structure (page -> words) to a keyword-centric data structure (word -> pages). 

The following diagram shows how the indexing process happens in Lucene.
In WSO2 DAS, published events by data agents through event receivers can be persisted in RDBMS such as MySql and denormalizing the tables (RDBMS) into Lucene Documents when performing the lucene indexing.

The pseudo code will look something like this:

//The sql query to be performed
String sql = "SELECT DISTINCT processInstanceId, duration FROM PROCESS_USAGE_SUMMARY";
//ResultSet to hold the data retreived from the database  
ResultSet rs = stmt.executeQuery(sql);
while (rs.next()) {
    Document doc = new Document();
    doc.add(new Field("processInstanceId", rs,getString("processInstanceId"), Field.Store.YES, Field.Index.TOKENIZED));
    doc.add(new Field("duration", rs,getLong("duration"), Field.Store.YES, Field.Index.UN_TOKENIZED));
    // ... repeat for each column in result set
    writer.addDocument(doc);
}

When you perform a Search operation, it involves creating a Query (usually via a QueryParser) and handing this Query to an IndexSearcher, which returns a list of Hits. Actually this returns a set of documents according to the query you provided and from that extract the information in the documents and finally display the results. You can build the query string as the format provided in the WSO2 DAS (as a JSON string) and then pass it to its REST API  to return the result in the JSON format.

The Lucene query language allows the user to specify which field or fields to search on, which fields to give more weight, the ability to perform boolean queries (AND, OR, NOT) and other functionality as well. For more about Lucene query parser syntax click here.

References