So in Part I, I promised some detail about how to index and display line numbers in the search results. The code examples I provide may be a bit messy and are not very modular at all, but that's okay because they're short, to the point, and therefore should make fairly decent examples. I hope. I don't expect anyone to copy my code verbatim, but rather I hope that anyone who needs this functionality out of Lucene.Net will use them a the guide I didn't have when figuring this out for myself.
First, you will need to ensure that offset data is stored as part of the indexed content, when creating the Lucene.Net Field, include the Field.TermVector.WITH_POSITIONS_OFFSETS flag as a parameter:
return new Field("content", new StreamReader(filePath), Field.TermVector.WITH_POSITIONS_OFFSETS);
Originally for generating previews, I used the Highlighter.Net contrib (which is distributed as part of Lucene.Net) to format the search results into HTML fragments that I could format into a document for display. This works fine for basic display of the search results, but doesn’t provide any means of getting or displaying the offset information as line/column data for displaying line numbers as part of the results. Therefore, I had to build a mean of formatting the results from scratch.
First, I created a class that “explodes” the original text into an array of lines. I won’t go into too much detail about how this class is built (it should be easy enough to figure out), but here’s the method that does the splitting, which should be fairly straightforward.
public static string[] Explode(string text)
{
string[] explodedText = text.Replace("\r\n", "\n").Split('\n');
// Remove the trailing empty line that occurs when splitting.
Array.Resize<string>(explodedText, explodedText.Length - 1);
return explodedText;
}
And here’s the logic that gets the line and column positions of the specified offset (which is based on the original text). There is probably a more graceful way of doing this, but this was the quick and dirty method I wrote to get it working:
public void GetPosition(int offset, out int line, out int column)
{
int charpos = offset;
line = column = -1;
for (int i = 0; i <>
{
if (charpos <>
{
line = i;
column = charpos;
break;
}
else
charpos -= (_lineLength[i] + 2); // +2 for the missing \r\n
}
}
The method loops through each line and checks if the offset falls within that line. If it does, set the line and column out parameters and break out of the loop; otherwise, keep looking.
The Exploder gets called once we have our Hits object from the Lucene searcher, within a loop that gets the document for each hit. The original document is read in and exploded.
Next, we get an array of the search hit tokens, which we will use to get the location of each hit in the document, for formatting the fragment and addline line data:
List<PositionedToken> tokenPositions = GetTokenPositions(parser.GetAnalyzer().TokenStream("content", new StreamReader(filePath)), explodedText);
parser is of course the original QueryParser.
Here’s what GetTokenPositions looks like:
private List<PositionedToken> GetTokenPositions(TokenStream tokenstream, ExplodedText explodedtext)
{
List<PositionedToken> tokenPositions = new List<PositionedToken>();
Token token;
while ((token = tokenstream.Next()) != null)
{
int line, column;
explodedtext.GetPosition(token.StartOffset(), out line, out column);
tokenPositions.Add(new PositionedToken(line, column, token));
}
return tokenPositions;
}
PositionedToken is a lightweight class that simply stores the line and column position of the start of the token, the token length, and a reference to the original Token object.
Based on this, it should be fairly clear that the next step will be to build some kind of preview using all the PositionedTokens to get the lines on which tokens appear and format those lines for display. My solution was to build HTMLPreviewBuilder:
public class HTMLPreviewBuilder
{
private List<FragmentLines> _fragments;
private ExplodedText _explodedText;
public HTMLPreviewBuilder(List<PositionedToken> tokens, ExplodedText explodedtext)
{
_fragments = new List<FragmentLines>();
_explodedText = explodedtext;
foreach (PositionedToken token in tokens)
_fragments.Add(new FragmentLines(token, explodedtext));
// If for whatever reason we have no fragments, return the original text.
if (_fragments.Count == 0)
_fragments.Add(new FragmentLines(explodedtext));
FormatLinesAndTokens();
}
}
FragmentLines is a class that builds an array of lines which includes the line on which the token resides from the exploded text, and a buffer of preview lines before and after; in the example below I have simply hard-coded it to grab 2 lines before and 2 lines after:
public class FragmentLines
{
public string[] Lines;
public int StartLineNumber;
public int EndLineNumber;
public PositionedToken[] Tokens;
public FragmentLines(ExplodedText explodedtext)
{
Lines = explodedtext.Lines;
StartLineNumber = EndLineNumber = 1;
}
public FragmentLines(PositionedToken token, ExplodedText explodedtext)
{
Tokens = new PositionedToken[1] { token };
StartLineNumber = Math.Max(0, token.Line - 2); // 2 lines prior
EndLineNumber = Math.Min(explodedtext.LineCount - 1, token.Line + 2); // 2 lines after
int numLines = (EndLineNumber - StartLineNumber) + 1;
Lines = new string[numLines];
for (int i = 0; i <>
Lines[i] = explodedtext[StartLineNumber + i];
}
}
So now our preview has an array of these, each containing a preview fragment for each token. What if other tokens are within the 2-line preview, or even on the same line, you ask? We can merge those fragments together, and I will address that in Part III. Note that Tokens in the above class is in fact an array; this is set up for this reason. For now, we'll only have one PositionedToken element in there.
Here is where the line numbers are added and the tokens formatted for HTML display:
private void FormatLinesAndTokens()
{
// Gets the width of the string representation of the largest line number, so we can pad the line numbers appropriately.
int maxLineNumWidth = explodedtext.LineCount.ToString().Length;
foreach (FragmentLines frag in _fragments)
{
// Inserts the line number on each line, and formats any tokens
for (int lineNum = frag.StartLineNumber; lineNum <= frag.EndLineNumber; lineNum++)
{
string line = frag.Lines[lineNum - frag.StartLineNumber];
int lineNumDisplay = lineNum + 1; // File line numbers start at 1.
string lineNumPrefix = lineNumDisplay.ToString().PadLeft(maxLineNumWidth) + ": ";
// Get the original Token
PositionedToken token = frag.Tokens[0];
if (token.Line == lineNum)
{
StringBuilder lineBuilder = new StringBuilder();
int startPos = 0;
int endPos = line.Length;
// Get key positions in the line so we can insert HTML
int startPosToTokenLen = token.Column - startPos;
int tokenEndPos = token.Column + token.Length;
int tokenEndPosToEndPos = endPos - tokenEndPos;
lineBuilder.Append(lineNumPrefix);
lineBuilder.Append(EncodeForHTML(line.Substring(startPos, startPosToTokenLen)));
lineBuilder.Append("<span style=\"background-color:#FFFF00;font-weight:bold\">");
lineBuilder.Append(EncodeForHTML(line.Substring(token.Column, token.Length)));
lineBuilder.Append("</span>");
lineBuilder.Append(EncodeForHTML(line.Substring(tokenEndPos, tokenEndPosToEndPos)));
frag.Lines[lineNum - frag.StartLineNumber] = lineBuilder.ToString();
}
else
frag.Lines[lineNum - frag.StartLineNumber] = EncodeForHTML(line)
}
}
}
EncodeForHTML escapes any angle-brackets and ampersands for the HTML. Now that our lines are formatted with line numbers and the search terms highlighted using more hard-coded stuff (please feel free to do one better than that), we can wrap it in an HTML document for returning to the user:
public override string ToString()
{
StringBuilder preview = new StringBuilder("<html>");
preview.Append("<body style=\"font-family: Courier New; font-size: 8pt; background-color: #FFFFE1\">");
foreach (FragmentLines fragment in _fragments)
preview.Append("<pre>" + fragment.ToString() + "</pre><hr />");
preview.Append("</body></html>");
return preview.ToString();
}
Et voila! HTML preview complete with line numbers. As mentioned earlier, this will display a preview fragment for each token, regardless of overlap. In the next segment, Part III, I will show you how to merge the fragments that overlap and how to format the merged segments.

0 comments:
Post a Comment