Architecture of Apache Lucence


Architecture of Lucence consist of:

Document: It is a main data carrier used during indexing and search, containing one or more fields, that contain the data we put and get from Lucene.

Field: It is a section of the document which is built of two parts: the name and the value.

Term: It is a unit of search representing a word from the text.

Token: It is an occurrence of a term from the text of the field. It consists of term text, start and end offset, and a type.

Apache Lucene writes all the information to the structure called inverted index. It is a data structure that maps the terms in the index to the documents, not the other way round like the relational database does. You can think of an inverted index as a data structure, where data is term oriented rather than document oriented.


Each index is divided into multiple write once and read many time segments. When indexing, after a single segment is written to disk, it can’t be updated. For example, the information about deleted documents is stored in a separate file, but the segment itself is not updated. However, multiple segments can be merged together in a process called segments merge. After forcing, segments are merged, or after Lucene decides it is time for merging to be performed, segments are merged together by Lucene to create larger ones. This can be I/O demanding; however, it is needed to clean up some information because during that time some information that is not needed anymore is deleted, for example the deleted documents. In addition to this, searching with the use of one larger segment is faster than searching against multiple smaller ones holding the same data. However, once again, remember that segments merging is an I/O demanding operation and you shouldn’t force merging, just configure your merge policy carefully.