In the last posting, we added the ability to filter a list of input tokens
with a list of stop words, usually words that are so common that they aren’t
interesting for many types of analysis. But because we don’t want to use a
stop list every time, we didn’t add such filtering to the
function. Today, we’ll do that.
Clojure allows what can be called function overloading. Basically, you can have several different versions of the same function all assigned to the same variable name. Each version has to have a different number of arguments. (The number of arguments is sometimes referred to as the function’s arity, so each different version of the function has to have a different arity.) Since the overloaded functions have different argument lists, Clojure can tell which version to use when you call that function.
(Actually, calling this function overloading is misleading. That makes it sound as if we’re assigning multiple function objects to the same variable. The overloading really happens inside a function object, and a single function object is assigned to a variable.)
For reference, here is the current version of
tokenize-str and an example of
using it and filtering with a stop words list:
(defn tokenize-str [input-string] (map to-lower-case (re-seq token-regex input-string))) (filter (complement stop-words) (tokenize-str "Input string here."))
tokenize-str as an overloaded function, wrap each set of arguments
and the function body that belongs with it in parentheses:
(defn tokenize-str ([input-string] (map to-lower-case (re-seq token-regex input-string))))
Next add another set of arguments plus function body on the same parentheses-level as the one already there:
(defn tokenize-str ;; This is the first, original version of the function. ([input-string] (map to-lower-case (re-seq token-regex input-string))) ;; This is the second, overridden version of the function. ([input-string stop-word?] (filter (complement stop-word?) (map to-lower-case (re-seq token-regex input-string)))))
There are a couple of things that are new here. Let’s break them down.
First, I’ve included a couple of comments. When Clojure reads a semicolon
";"), it ignores everything to the end of the line. Comments are there for
you to read.
Second, I’ve included a new override for the function that takes two arguments
(arity 2, also written
tokenize-str/2). The second argument is a function
that returns true if a token is a stop word.
filter then calls the
complement of this function on what is essentially the body of the original
(Hint: It can be difficult to read functional programming. Sometimes it’s
easier to read from right to left. First, look at
(re-seq ...) and read left
(map ...) and
So how do we use this? Call
(load-file "word.clj") to load the most recent
code, and let’s run the example from yesterday with this function.
user=> (word/tokenize-str "The cat in the hat.") ("the" "cat" "in" "the" "hat") user=> (word/tokenize-str "The cat in the hat." word/stop-words) ("cat" "hat")
This works, but it’s a little messy. What if we want to change how we tokenize at the most basic level? We’ll need to change our code in two places, and that’s twice as many places to make an error.
It would be useful if we could call the first version of the function from inside the second version, and we can. Just call it the way we would from any other place in the code. It would look like this:
(defn tokenize-str ;; This is the first, original version of the function. ([input-string] (map to-lower-case (re-seq token-regex input-string))) ;; This is the second, overridden version of the function. ([input-string stop-word?] (filter (complement stop-word?) (tokenize-str input-string))))
Now, if we want to modify the way we tokenize, we only need to do it once, in
tokenize-str/1 (the original version of the function).
implements stop-word filtering.
Of course, a function calling itself could be dangerous: if you’re not careful, it could end up calling itself forever. Whenever we have a function call itself, we need to make sure that there is a path through the function that must get traveled at some point and that does not call itself. In this case, the first version acts as such a path: if we call the second version, it calls the first version, which just returns with a list of tokens.
Recursion can be a little mind-bending. Don’t worry about that at this point. We’ll see recursion again many times, and soon it will become an old friend.
A little strange, but an old friend nevertheless.
So far, we’ve only been tokenizing strings that we type in. Of course, in practice we’ll want to read data in from a file. Next time, I’ll talk about how to do that.