In the last entry, I showed what a concordance looks like. Today, I'm going to talk about the changes that I'll need to make to what we already have and about what I'll need to add to the system that's here now.
The biggest underlying change is that tokens will need to know their position in the input document. The system that we're building can only handle plain text documents, so I get to pretend that position is a simple concept. But it's really not. For example, in an XML document, the position could be an XPath expression.
Even for a text document, position isn't entirely clear. Is the position of a token the byte it started on in the original raw file? Is it the after-Unicode-decoding character of the token in the text of the file?
For our case, I'm going to say that the position of a token is the line number and beginning and ending character in the Unicode data of the document. This is what my example yesterday had.
To keep track of that, instead of slurping in a file's contents all at once, I'll now need to read in each line separately and keep track of the line numbers. That shouldn't be difficult, though.
Tokens will also need to be more than just plain strings. They'll now need to be structures that keep track of their location: file names, line numbers, and starting and ending character indices.
As an added bonus, tokens will also be able to keep track of the original form of the word, as well as a normalized or a stemmed form.
To display the documents in alphabetical order and to pull all the occurrences of each word together easily, I'll need to index the tokens by word. The index won't be industrial-strength, really. It will probably bog down with too large of a document. Other options would be to use Lucene or store the index in a database of some form. For now, though, I'll just keep things simple.
I can imagine displaying a concordance a number of different ways: text files, HTML, a GUI form. To keep the system flexible, I'm going to defer that decision until later, and the core concordance generator will just return some basic Clojure data types. I'll also add a function that prints them to the screen.
That's it. This should be a lot simpler than the Stemmer we just worked on, but it should give us a good idea of how various words are being used in the documents.