Uncategorized

Interesting web scraping problem

Since phantomjs has no API for access to cookies, I showed in my last post how we extract the cookies via a call to document.cookies so that we can transfer them to Node/request.

One problem we recently found with this scheme is that the web browser doesn’t provide you with cookies that have the HttpOnly flag set. Awesomely in anticipation of this problem, a new cookies API has been added to the git source: https://github.com/ariya/phantomjs/pull/268

Just something to be aware of, that document.cookies doesn’t tell you the whole truth.

Standard

3 thoughts on “Interesting web scraping problem

  1. In case anyone comes to this later with the same issue, I solved it eventually by compiling my own version of phantomjs, with the code for checking the HttpOnly flag commented out in CookieJarQt.cpp

    • Yes. It uses JSDOM which is too strict at HTML parsing, and is also very slow.

      I like the look of the API a bit more than what I’m doing though, so I may steal some ideas from it in the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s