Interesting web scraping problem

Since phantomjs has no API for access to cookies, I showed in my last post how we extract the cookies via a call to document.cookies so that we can transfer them to Node/request.

One problem we recently found with this scheme is that the web browser doesn’t provide you with cookies that have the HttpOnly flag set. Awesomely in anticipation of this problem, a new cookies API has been added to the git source: https://github.com/ariya/phantomjs/pull/268

Just something to be aware of, that document.cookies doesn’t tell you the whole truth.


3 thoughts on “Interesting web scraping problem

  1. In case anyone comes to this later with the same issue, I solved it eventually by compiling my own version of phantomjs, with the code for checking the HttpOnly flag commented out in CookieJarQt.cpp

    • Yes. It uses JSDOM which is too strict at HTML parsing, and is also very slow.

      I like the look of the API a bit more than what I’m doing though, so I may steal some ideas from it in the future.

