Naive Bayes Classifier in Haskell

June 7th 2012

I just started learning Haskell. So far, this language has consistently blown my mind, at every turn. I feel like every new feature is a revelation. But damn, is it hard. Till now, I’ve kept to trying simple exercises from the book I’m using, Real World Haskell (A great book). Yesterday and today, I tried my first not-totally-trivial program: I made a Naive Bayes Classifier. My classifier is in two parts: A module for the classifier and a command-line interface. The main classifier is only 34 lines, including comments! Anyway,I thought I’d put it online so that anyone can give me any feedback or comments, and if it can help anyone else, all the better. DISCLAIMER: I’m just learning Haskell, so don’t consider this a tutorial, or assume it uses any ‘best practices’, or is even remotely idiomatic.

To learn more about Naive Bayes Classifiers, see it on wikipedia, read this great tutorial in Ruby, learn about the basics of Bayes Theorem, and, if you try making a classifier and keep getting zero probability, see the Laplacian Correction section of this PDF.

Here’s My Code:

 
module BayesClassifier where
-- Text Classifier Using Bayes Formula
import Data.List
import Data.Char
type Category = String
newtype Classifier = Classifier { training :: [(Category, [String])] } deriving (Eq, Show)
-- Get a new classifer with no training
classifier :: Classifier
classifier = Classifier []
-- classifier probabilities
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfCategory :: Classifier -> Category -> Double
-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory (Classifier training) word category = let allInCategory = filter (\(cat, _) -> cat == category) training
                                                                      allInCategoryContainingWord = filter (\(_, text) -> word `elem` text) allInCategory
                                                                  in (fromIntegral $ length allInCategoryContainingWord + 1) / (fromIntegral $ length allInCategory + 1)
probabilityOfCategory (Classifier training) category =  let allInCategory = filter (\(cat, _) -> cat == category) training 
                                                        in (fromIntegral $ length allInCategory) / (fromIntegral $ length training)
-- Train a classifier
train :: Classifier -> String -> Category -> Classifier 
train (Classifier training ) text category = Classifier $ (category, cleanInput $ text):training
-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify classifier text = fst $ head $ sortBy (\(_, a) (_, b)  -> b `compare` a) $ probabilities classifier text
-- Get Probability for each Category
probabilities :: Classifier ->  String -> [(Category, Double)] 
probabilities classifier@(Classifier training) text =  map (\cat -> (cat, probabilityForCategory classifier text cat)) $ nub $ map (\(cat, _) -> cat) training
-- Get Probability for a passage in a certain category
probabilityForCategory :: Classifier -> String -> Category -> Double
probabilityForCategory classifier text category = (+) (log $ probabilityOfCategory classifier category)  (sum $ map (\word -> log $ probabilityOfWordInCategory classifier word category) $ cleanInput text) 
-- Lowercase, Remove Punctuation
cleanInput :: String -> [String]
cleanInput text = filter (\w -> not (w `elem` stopWords)) $ words $ filter (`elem` ' ':['a'..'z']) $ map toLower text 
                  where stopWords = ["a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours ","ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"]
 

import System.IO
import BayesClassifier

interactionLoop myClassifier function = case function of 
                                          "start" ->  
                                            do
                                              putStrLn "Enter an action [train|classify]"
                                              action <- getLine
                                              interactionLoop myClassifier action
                                          "train" -> 
                                            do
                                              putStr "Category: "
                                              category <- getLine
                                              putStr "Material: "
                                              material <- getLine
                                              interactionLoop (train myClassifier material category) "start"
                                          "classify" ->
                                            do
                                              putStr "Material: "
                                              material <- getLine
                                              putStrLn $ classify myClassifier material
                                              putStrLn . show $ probabilities myClassifier material
                                              putStrLn "\n\n\n\n"
                                              interactionLoop myClassifier "start"
                                          _ ->
                                              interactionLoop myClassifier "start"
main = do
        hSetBuffering stdout NoBuffering
        hSetBuffering stdin NoBuffering
        interactionLoop classifier "start" 

`

blog comments powered by Disqus