November 25, 2018

1101 words 6 mins read

Using Scalpel to build a simple web scraper with Haskell

Lately I was just curious about what functional programming language is considered to be the best suited to the area of web development, I started looking the web for answers, suddenly this post on HackerNews got my attention and it came to mind: why not to build a scraper to see which programming language has more mentions?, of course it is not a guarantee that the comments are praising the language but instead they might be more criticized. As to why I picked Haskell for the task? well just for fun really!

Scalpel

I wanted to solve the task quickly so I just looked around and Scalpel seemed to be a popular choice for Haskell scraping, it works fairly easy one just have to use the scrapeURL function from the Text.HTML.Scalpel package and supply the website URL and a Scraper value that can be constructed by using Selectors, for example comments in HN are in a <span> tag with the class commtext so by using the text selector we can build the Scraper we need like this:

comments :: Scraper Text [Text]
comments = texts $ "span" @: [hasClass "commtext"]

A Scraper data type expects a string like type (in this case Text) and a type describing the values returned by the scraper, here we used a [Text] because we want to capture a list of comments, if you have a custom data type then use that here and build it inside the Scraper definition.

Notice we’re using Text instead of String for our text representation so enabling the OverloadedStrings extension is recommended. For simplicity we can include the scrapper definition inside a generalized function hoping to get a list of comments (and thus the Maybe [Text]). Since this performs an HTTP request it needs to return an IO action.

scrapper :: String -> IO (Maybe [Text])
scrapper url = scrapeURL url comments
where
    comments :: Scraper Text [Text]
    comments = texts $ "span" @: [hasClass "commtext"]

That’s the scraper part! quick and easy! But let’s go further and get the most popular languages mentioned in the comments.

Functional programming design is mainly about creating a pipe of data transformations, the pipe we need to create here is one that takes the result of the scraper (Maybe [Text]) and output a [(Text, Integer)] were the first component of the tuple is the programming language name and the second component is how many mentions it has. We will follow a map-reduce approach (minus parallelism), I’ve written about this technique before but using MongoDB’s map-reduce facilities feel free to read that if you’re not familiarized with the concept.

We will apply the pipe function to the result of the scraper, however note that it lives inside a Maybe which in turn lives inside the IO monad. So we lift our piping function over both monads.

fmap pipe <$> scrapper url

What this means is that the pipe function should have the type [Text] -> [(Text, Integer)] and this transformation will happen inside both the Maybe and IO monad. The first function in the data transformation pipeline is to break the long comments in word tokens, I’ll call this the tokenizer function.

tokenizer :: [Text] -> [Text]
tokenizer = concatMap (concatMap processor . Text.words)
where
    cleaner   = Text.filter (`notElem` (",.><?!-:;\"\'()" :: String))
    processor = Text.splitOn "/" . Text.toLower . cleaner

Inside this function not only token words are extracted but also “pruned” by removing special characters of them (any of ,.><?!-:;\"\'()), also they are transformed to lowercase and split by /. The output of this function will be ready to serve as input to the map-reduce process, we define both in a single function:

mapReduce :: [Text] -> [(Text, Integer)]
mapReduce = reducer . mapper
where
    mapper :: [Text] -> [(Text, Integer)]
    mapper = map (, 1)

    reducer :: [(Text, Integer)] -> [(Text, Integer)]
    reducer = Map.toList . foldr (uncurry $ Map.insertWith (+)) Map.empty

As we already know (you’ve read previous posts right?) the map-reduce is a two-step process, it involves a transformation function and a reducer function, the transformation step will turn a Text word to a (Text, Integer) pair, the count is set to 1 because each word occurrence will count by one. The definition uses the TupleSections extension that should be enabled, otherwise (, 1) is a syntax error. The reducer takes multiple outputs of the mapper and somehow combine the elements, here we fold the inputs into a Data.Map combining elements with the same key (multiple word occurrence) by summing the Integer component of the tuple (that’s what Map.insertWith (+) does), finally is transformed back into a list to comply with the type signature.

Finally we sort the tuples by the occurrence count (the second component):

sorted :: [(Text, Integer)] -> [(Text, Integer)]
sorted = sortBy $ \(_, x) (_, y) -> compare y x

As this is a simple function we include it right in the pipe function definition, this function is just the composition of the transformations we’ve been talking about:

pipe = sorted . mapReduce . tokenizer
where
    sorted :: [(Text, Integer)] -> [(Text, Integer)]
    sorted = sortBy $ \(_, x) (_, y) -> compare y x

That’s it, at this point we have our result (inside both the Maybe and IO monads), we can extract and pattern match to inspect the elements:

result <- fmap pipe <$> scrapper url

case result of
  Nothing     -> putStrLn "Failed to scrap or parse"
  Just tokens -> mapM_ print $ filter byRelevancy tokens
    where
      byRelevancy :: (Text, Integer) -> Bool
      byRelevancy (token, count) = popular && relevant
        where
          popular  = count > 3
          relevant = token `elem` languages

Here I added a function to filter only those tokens where the occurrence is popular (has more than 3 occurrences) and relevant (I’m interested in that token), for the later I defined a list of tokens I’m interested in:

languages :: [Text]
languages = [ "scala", "haskell", "clojure", "f#", "fsharp", "java", "erlang"
            , "javascript", "python", "scheme", "elixir", "lisp", "racket"
            , "elm", "purescript", "ml", "ocaml", "react", "reason", "reasonml"
            , "net", "jvm", "beam", "llvm", "scalajs", "clojurescript", "ghcjs"
            ]

If you’ve been following why don’t you try (as an exercise) to add this filtering step to the pipe function definition?

Conclusion

We’ve used Scalpel an easy to use library to define scrapers, it also makes the request to the web page.

So? what are the results?

("f#",34)
("elixir",21)
("scala",18)
("net",17)
("clojure",14)
("java",13)
("jvm",9)
("lisp",9)
("erlang",6)
("javascript",6)
("python",6)
("clojurescript",4)
("haskell",4)

So is F# 8-ish times more suitable for web development than Haskell? I don’t know, let’s read what people say.