Using Scalpel to build a simple web scraper with Haskell
Lately I was just curious about what functional programming language is considered to be the best suited to the area of web development, I started looking the web for answers, suddenly this post on HackerNews got my attention and it came to mind: why not to build a scraper to see which programming language has more mentions?, of course it is not a guarantee that the comments are praising the language but instead they might be more criticized. As to why I picked Haskell for the task? well just for fun really!
Scalpel
I wanted to solve the task quickly so I just looked around and Scalpel seemed
to be a popular choice for Haskell scraping, it works fairly easy one just
have to use the scrapeURL
function from the Text.HTML.Scalpel
package and
supply the website URL and a Scraper
value that can be constructed by using
Selector
s, for example comments in HN are in a <span>
tag with the class
commtext
so by using the text
selector we can build the Scraper
we need
like this:
comments :: Scraper Text [Text]
comments = texts $ "span" @: [hasClass "commtext"]
A Scraper
data type expects a string like type (in this case Text
) and
a type describing the values returned by the scraper, here we used a [Text]
because we want to capture a list of comments, if you have a custom data type
then use that here and build it inside the Scraper
definition.
Notice we’re using Text
instead of String
for our text representation so
enabling the OverloadedStrings
extension is recommended. For simplicity we
can include the scrapper definition inside a generalized function hoping to
get a list of comments (and thus the Maybe [Text]
). Since this performs an
HTTP request it needs to return an IO
action.
scrapper :: String -> IO (Maybe [Text])
scrapper url = scrapeURL url comments
where
comments :: Scraper Text [Text]
comments = texts $ "span" @: [hasClass "commtext"]
That’s the scraper part! quick and easy! But let’s go further and get the most popular languages mentioned in the comments.
Most popular language in the comments
Functional programming design is mainly about creating a pipe of data
transformations, the pipe we need to create here is one that takes the result
of the scraper (Maybe [Text]
) and output a [(Text, Integer)]
were the
first component of the tuple is the programming language name and the second
component is how many mentions it has. We will follow a map-reduce approach
(minus parallelism), I’ve written about this technique before but using
MongoDB’s map-reduce facilities feel free to read that if you’re not
familiarized with the concept.
We will apply the pipe
function to the result of the scraper, however note
that it lives inside a Maybe
which in turn lives inside the IO
monad. So
we lift our piping function over both monads.
fmap pipe <$> scrapper url
What this means is that the pipe
function should have the type
[Text] -> [(Text, Integer)]
and this transformation will happen inside both
the Maybe
and IO
monad. The first function in the data transformation
pipeline is to break the long comments in word tokens, I’ll call this the
tokenizer
function.
tokenizer :: [Text] -> [Text]
tokenizer = concatMap (concatMap processor . Text.words)
where
cleaner = Text.filter (`notElem` (",.><?!-:;\"\'()" :: String))
processor = Text.splitOn "/" . Text.toLower . cleaner
Inside this function not only token words are extracted but also “pruned”
by removing special characters of them (any of ,.><?!-:;\"\'()
), also they
are transformed to lowercase and split by /
. The output of this function
will be ready to serve as input to the map-reduce process, we define both in
a single function:
mapReduce :: [Text] -> [(Text, Integer)]
mapReduce = reducer . mapper
where
mapper :: [Text] -> [(Text, Integer)]
mapper = map (, 1)
reducer :: [(Text, Integer)] -> [(Text, Integer)]
reducer = Map.toList . foldr (uncurry $ Map.insertWith (+)) Map.empty
As we already know (you’ve read previous posts right?) the map-reduce is a
two-step process, it involves a transformation function and a reducer
function, the transformation step will turn a Text
word to a
(Text, Integer)
pair, the count is set to 1
because each word occurrence
will count by one. The definition uses the TupleSections
extension that
should be enabled, otherwise (, 1)
is a syntax error. The reducer takes
multiple outputs of the mapper and somehow combine the elements, here we fold
the inputs into a Data.Map
combining elements with the same key (multiple
word occurrence) by summing the Integer
component of the tuple (that’s what
Map.insertWith (+)
does), finally is transformed back into a list to comply
with the type signature.
Finally we sort the tuples by the occurrence count (the second component):
sorted :: [(Text, Integer)] -> [(Text, Integer)]
sorted = sortBy $ \(_, x) (_, y) -> compare y x
As this is a simple function we include it right in the pipe
function
definition, this function is just the composition of the transformations
we’ve been talking about:
pipe = sorted . mapReduce . tokenizer
where
sorted :: [(Text, Integer)] -> [(Text, Integer)]
sorted = sortBy $ \(_, x) (_, y) -> compare y x
That’s it, at this point we have our result (inside both the Maybe
and IO
monads), we can extract and pattern match to inspect the elements:
result <- fmap pipe <$> scrapper url
case result of
Nothing -> putStrLn "Failed to scrap or parse"
Just tokens -> mapM_ print $ filter byRelevancy tokens
where
byRelevancy :: (Text, Integer) -> Bool
byRelevancy (token, count) = popular && relevant
where
popular = count > 3
relevant = token `elem` languages
Here I added a function to filter only those tokens where the occurrence is popular (has more than 3 occurrences) and relevant (I’m interested in that token), for the later I defined a list of tokens I’m interested in:
languages :: [Text]
languages = [ "scala", "haskell", "clojure", "f#", "fsharp", "java", "erlang"
, "javascript", "python", "scheme", "elixir", "lisp", "racket"
, "elm", "purescript", "ml", "ocaml", "react", "reason", "reasonml"
, "net", "jvm", "beam", "llvm", "scalajs", "clojurescript", "ghcjs"
]
If you’ve been following why don’t you try (as an exercise) to add this
filtering step to the pipe
function definition?
Conclusion
We’ve used Scalpel an easy to use library to define scrapers, it also makes the request to the web page.
So? what are the results?
("f#",34)
("elixir",21)
("scala",18)
("net",17)
("clojure",14)
("java",13)
("jvm",9)
("lisp",9)
("erlang",6)
("javascript",6)
("python",6)
("clojurescript",4)
("haskell",4)
So is F# 8-ish times more suitable for web development than Haskell? I don’t know, let’s read what people say.