Uncategorized

Web scraping with Node.js

Web scraping these days is often considered a fairly well understood craft, however there are definitely some complexities that modern web sites are bringing to the table. The days of AJAX long polling, XMLHttpRequest, WebSockets, Flash Sockets etc make things a little more difficult than just your average crawler can handle.

Let’s start with the basics of what we needed at Hubdoc – we are crawling bank, utilities, and credit card company web sites looking for bill amounts, due dates, account numbers, and most importantly, PDFs of most recent bills. My initial take on this problem (aside from using an expensive commercial product which we were evaluating) was that it is simple – I had done a crawling project before at MessageLabs/Symantec in Perl and it was relatively straightforward. However it turns out that spammers create web pages that are a LOT easier to crawl than banks and utilities companies.

How would we build this? Basically start with mikeal’s excellent request library, do the request in the browser watching the Network window for exactly what request headers were being sent, and copy that into code. Simple. Just follow it through the login process to the point of downloading the PDF and do all the same requests. To make things easier, and more reasonable for web developers to write these crawlers, I pushed the HTML results into jQuery (using the lightweight cheerio library), allowing relatively simple selection of page elements using CSS selectors. The whole thing is wrapped in a framework which does all the additional work of fetching credentials from the database, loading individual robots, and communicating with the UI via socket.io.

For some web sites this worked. But the problem is simply Javascript – not my node.js code – the code that these companies put on their sites. They have layered complexity on top of complexity because of legacy issues, making it extremely hard to figure out exactly what you have to do to get to the point of being logged in. With some sites I tried for days in vain to get it working with the request() library.

At that point after much frustration I reached for node-phantomjs, which allowed me to control the phantomjs headless webkit web browser from node. This seemed like an easy solution except for a few problems which phantomjs suffers from which required work-arounds:

  • It only tells you if a page has finished loading, but you have no idea if it is about to redirect via some JavaScript or meta tag. Especially if that Javascript executes on setTimeout() rather than immediately.
  • It provides you with a pageLoadStarted hook, which allows you to mitigate some of that problem, but it only works if you keep a count of how many pages have started loading and decrement that number when a page is finished loading, and provide some timeout leeway in between (because these things don’t happen instantly), then call your callback when that count gets to zero. This works but feels a little hacky.
  • It requires an entirely separate process per crawl because it has no way to separate your cookies out from one “page” to the next. If you use the same phantomjs process your cookies from one login session get sent to other pages you are browsing.
  • There’s no way to download resources with phantomjs – the only thing you can do is create a snapshot of the page as a png or pdf. That’s useful but meant we had to resort back to request() for the PDF download.
  • Because of the above point, I had to work out a way to pass the cookies out of the phantomjs session into the request() library’s cookie jar. This is done by passing out the document.cookie string and parsing it, then injecting those cookies into request()s cookie jar.
  • There’s no easy way to inject variables into the browser session. To do that I had to create a properly escaped string to build up a Javascript function:
Robot.prototype.add_page_data = function (page, name, data) {
page.evaluate(
"function () { var " + name + " = window." + name + " = " + JSON.stringify(data) + "}"
);
}
  • Web sites dick around with things like console.log(), redefining them to their own whim. To get around that I had to do this ridiculousness:
if (!console.log) {
var iframe = document.createElement("iframe");
document.body.appendChild(iframe);
console = window.frames[0].console;
}
  • There’s no easy way to tell the browser to click a normal <a> link, so I had to add in this code:
var clickElement = window.clickElement = function (id){
    var a = document.getElementById(id);
    var e = document.createEvent("MouseEvents");
    e.initMouseEvent("click", true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
    a.dispatchEvent(e);
};
  • I had to implement limits on the maximum number of concurrent browser sessions so we wouldn’t nuke the server. Having said that, this limit could be significantly higher than the expensive commercial solution would allow us to go to.

At the end of all that work, I have a mostly decent phantomjs + request based crawler solution in place. Once logged in with phantomjs you can switch back to request() and it will use the cookies set in phantomjs to continue with the logged-in session over there. This is a huge win, as we can then use request()’s Stream support for streaming down PDFs into our systems.

The overall plan is to make it relatively simple for web developers who know jQuery and CSS selectors to create crawlers for the different web sites we need to crawl. I haven’t proved the success of that yet, but hopefully will soon.

Standard

14 thoughts on “Web scraping with Node.js

    • CasperJS solves some (most) of the problems with phantomjs, but it doesn’t solve the problem of the reason I wanted to go with straight-up Node in the first place – scalability and access to socket.io. With Node I can (as long as the scraper can get out of PhantomJS mode quickly, or doesn’t even need to use it) simultaneously scrape 10s of thousands of web pages concurrently from a single host without breaking a sweat. There’s no way I can spawn 10s of thousands of webkit instances concurrently.

      • paddysrailsportal says:

        Well I just implemented a scraping solution and i used Phantom along with Node. To scale, its quite easy really. There is no need to spawn 10s of thousands of webkit instances. Its a simple hack. You just need to use a WebSocket client library (something like “ws”) and connect it to a phantom instance. Here is a gist:

        http://pastebin.com/Sm2WEfiZ

        Some of the scenarios that you pointed out are explained with solution in the gist as comments.
        I have included the HTML in raw form. You can actually use the phantom’s Filesystem API (filesystem.read) to read the HTML. You can also inject scripts (like jquery etc) via the same Filesystem API.

      • Shripad K says:

        You have to of course connect from Node to the Phantom process via WebSocket. So you would just need only 1 Phantom process. You can reuse the same process for every message sent via Node to it. :)

      • That doesn’t solve the problem of concurrency.

        Here’s a scenario it breaks down under: You’re processing a bank login for two people concurrently. If you use one browser instance it shares cookies set from one user with the other’s session. You can’t log in simultaneously.

        Trust me I’ve tried this.

      • Shripad K says:

        Umm i don’t know how it affects concurrency. You just load another iframe. For instance, if you have 2 concurrent users, you just load 2 iframes (each iframe communicates with the parent frame via postMessage). Once the session is over, tear down the iframe. Trust me, I have been using it for quite sometime now. It works well.

      • Sorry I guess this is hard to explain.

        The point of using the browser instead of just doing it all in Node is that some sites redirect with complicated JS. Take for example the RBC web site. It requires JS to work.

        In your solution, if I had two iFrames, in one browser instance, and loaded up the RBC login page in there, and attempted two simultaneous logins, it just won’t work. The server will send back Set-Cookie values upon login, and store in that browser instance’s memory. But the second login will stomp over the first’s Cookies. Thus when you get pages back the first user’s bank statements won’t be returned – it will give you only the second user’s statements.

        The only way to solve this would be to hack webkit to allow per-iframe cookies. And that’s beyond what I have time for, or possibly even impossible.

        That’s the concurrency issue I’m referring to.

    • Thanks – I still think it’s a great project – just these are things you have to work around. Ultimately a hybrid solution worked best, and I wanted to share it.

  1. I don’t understand why you would use iframes with phantomjs. Instead, you can use multiple WebPage objects, or multiple instances of phantomjs as a fleet of workers reading from a message queue.

  2. Nice work. I’m doing a side project that requires scraping authenticated sites, and I was wondering if you had any plans to open source your Robot class?

    Also, I assume you implemented some form of PhantomJS instance pooling? Could you expand on that? Thanks again for an informative scraping article.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s