noodle

Web service

Noodle can be used as both a web service and a node module. In each case, key/value objects are used as queries to fetch and extract data from web documents.

noodle currently supports multiple web documents with an almost uniform query syntax for grabbing data from the different types (html, json, feeds, xml).

noodle is ready to run as a web service from bin/noodle-server.

Run the server

$ cd noodle
# or `cd node_modules/noodlejs` if installed via npm
$ bin/noodle-server
  Server running on port 8888

GET or POST

The server supports queries via both GET and POST.

GET

The query itself can be sent in the q parameter either as a url encoded JSON blob or as a querystring serialised representation (jQuery.param()).

noodle supports JSONP if a callback parameter is supplied.

GET http://example.noodlejs.com?q={JSONBLOB}&callback=foo

POST

noodle also supports a query sent as JSON in the POST body.

POST http://example.noodlejs.com

Rate limiting

The web service also provides rate limiting out of the box with connect-ratelimit.

Configuration

Server port

The specify what port the noodle web service serves on just write it as the first argument to the binary.

$ bin/noodle-server 9000
 Server running on port 9000

Behaviour settings

Various noodle settings like cache and ratelimit settings are exposed and can be edited in lib/config.json.

{
  // Setting to true will log out information to the 
  // terminal

  "debug":                 true,

  "resultsCacheMaxTime":   3600000,
  "resultsCachePurgeTime": 60480000, // -1 will turn purging off
  "resultsCacheMaxSize":   124,

  "pageCacheMaxTime":      3600000,
  "pageCachePurgeTime":    60480000, // -1 will turn purging off
  "pageCacheMaxSize":      32,

  // If no query type option is supplied then 
  // what should noodle assume

  "defaultDocumentType":   "html",

  // How the noodle scraper identifies itself 
  // to scrape targets

  "userAgent":             "",

  // Rate limit settings
  // https://npmjs.org/package/connect-ratelimit#readme

  "rateLimit": {
    "whitelist": ["127.0.0.1", "localhost"],
    "blacklist": [],
    "categories": {
      "normal": {
        "totalRequests": 1000,
        "every":         3600000000
      },
      "whitelist": {
        "totalRequests": 10000,
        "every":         60000000
      },
      "blacklist": {
        "totalRequests": 0,
        "every":         0
      }
    }
  }
}

Query syntax

A simple query looks like this:

{
  "url": "http://chrisnewtn.com",
  "type": "html",
  "selector": "ul.social li a",
  "extract": "href",
}

It says to go to a friend’s website and for noodle to expect a html document. Then to select anchor elements in a list and for each one extract the href attribute’s value.

The type property is used to tell noodle if you are wanting to scrape a html page, json document etc. If no type is specified then a html page will be assumed by default.

A similar query can be constructed to extract information from a JSON document. JSONSelect is used as the underlying library to do this. It supports common CSS3 selector functionality. You can familiarize yourself with it here.

{
  "url": "https://search.twitter.com/search.json?q=friendship",
  "selector": ".results .from_user",
  "type": "json"
}

An extract property is not needed for a query on JSON documents as json properties have no metadata and just a single value were as a html element can have text, the inner html or an attribute like href.

Different types (html, json, feed & xml)

html

Note: Some xml documents can be parsed by noodle under the html type!

The html type is the only type to have the extract property. This is because the other types are converted to JSON.

The extract property (optional) could be the HTML element’s attribute but it is not required.

Having "html" or "innerHTML" as the extract value will return the containing HTML within that element.

Having "text" as the extract value will return only the text. noodle will strip out any new line characters found in the text.

Return data looks like this:

[
  {
      "results": [
        "http://twitter.com/chrisnewtn",
        "http://plus.google.com/u/0/111845796843095584341"
      ],
      "created": "2012-08-01T16:22:14.705Z"
  }
]

Having no specific extract rule will assume a default of extracting "text" from the selector.

It is also possible to request multiple properties to extract in one query if one uses an array.

Query:

{
  "url": "http://chrisnewtn.com",
  "selector": "ul.social li a",
  "extract": ["href", "text"]
}

Response:

[
  {
    "results": [
      {
          "href": "http://twitter.com/chrisnewtn",
          "text": "Twitter"
      },
      {
          "href": "http://plus.google.com/u/0/111845796843095584341",
          "text": "Google+"
      }
    ],
    "created": "2012-08-01T16:23:41.913Z"
  }
]

In the query’s selector property use the standard CSS DOM selectors.

json and xml

The same rules apply from html to the json and xml types. Only that the extract property should be ommitted from queries as the JSON node value(s) targetted by the selector is always assumed.

In the query’s selector property use JSONSelect style selectors.

feeds

The same rules apply to the json and xml types. Only that the extract property should be ommitted from queries as the JSON node value(s) targetted by the selector is always assumed.

In the query’s selector property use JSONSelect style selectors.

The feed type is based upon node-feedparser so it supports Robust RSS, Atom, and RDF standards.

Familiarize yourself with its normalisation format before you use JSONSelect style selector.

Getting the entire web document

If no selector is specified than the entire document is returned. This is a rule applied to all types of docments. The extract rule will be ignored if included.

Query:

{
  "url": "https://search.twitter.com/search.json?q=friendship"
}

Response:

[
  {
    "results": ["<full document contents>"],
    "created": "2012-10-24T15:37:29.796Z"
  }
]

Mapping a query to familiar properties

Queries can also be written in noodle’s map notation. The map notation allows for the results to be accessible by your own more helpful property names.

In the example below map is used to create a result object of a person and their repos.

{
    "url": "https://github.com/chrisnewtn",
    "type": "html",
    "map": {
        "person": {
            "selector": "span[itemprop=name]",
            "extract": "text"
        },
        "repos": {
            "selector": "li span.repo",
            "extract": "text"
        }
    }
}

With results looking like this:

[
    {
        "results": {
            "person": [
                "Chris Newton"
            ],
            "repos": [
                "cmd.js",
                "simplechat",
                "sitestatus",
                "jquery-async-uploader",
                "cmd-async-slides",
                "elsewhere",
                "pablo",
                "jsonpatch.js",
                "jquery.promises",
                "llamarama"
            ]
        },
        "created": "2013-03-25T15:38:01.918Z"
    }
]

Getting hold of page headers

Within a query include the headers property with an array value listing the headers you wish to recieve back as an object structure. 'all' may also be used as a value to return all of the server headers.

Headers are treated case-insensitive and the returned property names will match exactly to the string you requested with.

Query:

{
  "url": "http://github.com",
  "headers": ["connection", "content-TYPE"]
}

Result:

[
  {
    "results": [...],
    "headers": {
      "connection": "keep-alive",
      "content-TYPE": "text/html"
    }
    "created":"2012-11-14T13:06:02.521Z"
  }
]

noodle provides a shortcut to the server Link header with the query linkHeader property set to true. Link headers are useful as some web APIs use them to expose their pagination.

The Link header will be parsed to an object structure. If you wish to have the Link header in its usual formatting then include it in the headers array instead.

Query:

{
  "url": "https://api.github.com/users/premasagar/starred",
  "type": "json",
  "selector": ".language",
  "headers": ["connection"],
  "linkHeader": true
}

Result:

[
  {
    "results": [
      "JavaScript",
      "Ruby",
      "JavaScript",
    ],
    "headers": {
      "connection": "keep-alive",
      "link": {
        "next": "https://api.github.com/users/premasagar/starred?page=2",
        "last": "https://api.github.com/users/premasagar/starred?page=21"
      }
    },
    "created": "2012-11-16T15:48:33.866Z"
  }
]

Querying to a POST url

noodle allows for post data to be passed along to the target web server specified in the url. This can be optionally done with the post property which takes an object map of the post data key/values.

{
  "url": "http://example.com/login.php",
  "post": {
    "username": "john",
    "password": "123"
  },
  "select": "h1.username",
  "type": "html"
}

Take not however that queries with the post property will not be cached.

Querying without caching

If cache is set to false in your query then noodle will not cache the results or associated page and it will get the data fresh. This is useful for debugging.

{
  "url": "http://example.com",
  "selector": "h1",
  "cache": "false"
}

Query errors

noodle aims to give errors for the possible use cases were a query does not yield any results.

Each error is specific to one result object and are contained in the error property as a string message.

Response:

[
  {
    "results": [],
    "error": "Document not found"
  }
]

noodle also falls silently with the 'extract' property by ommitting any extract results from the results object.

Consider the following JSON response to a partially incorrect query.

Query:

{
  "url": "http://chrisnewtn.com",
  "selector": "ul.social li a",
  "extract": ["href", "nonexistent"]
}

Response:

The extract “nonexistent” property is left out because it was not found on the element.

[
  {
    "results": [
      {
        "href": "http://twitter.com/chrisnewtn"
      },
      {
        "href": "http://plus.google.com/u/0/111845796843095584341"
      }
    ],
    "created": "2012-08-01T16:28:19.167Z"
  }
]

Multiple queries

Multiple queries can be made per request to the server. You can mix between different types of queries in the same request as well as queries in the map notation.

Query:

[
  {
    "url": "http://chrisnewtn.com",
    "selector": "ul.social li a",
    "extract": ["text", "href"]
  },
  {
    "url": "http://premasagar.com",
    "selector": "#social_networks li a.url",
    "extract": "href"
  }
]

Response:

[
  {
    "results": [
      {
        "href": "http://twitter.com/chrisnewtn",
        "text": "Twitter"
      },
      {
        "href": "http://plus.google.com/u/0/111845796843095584341",
        "text": "Google+"
      }
    ],
    "created": "2012-08-01T16:23:41.913Z"
  },
  {
    "results": [
        "http://dharmafly.com/blog",
        "http://twitter.com/premasagar",
        "https://github.com/premasagar",
    ],
    "created": "2012-08-01T16:22:13.339Z"
  }
]

Proxy Support

When calling a page multiple times some sites can and will ban your IP address, Adding support for proxy IP addresses allows the rotation of IP addresses.

Query:

{
  "url": "http://chrisnewtn.com",
  "selector": "ul.social li a",
  "extract": ["text", "href"],
  "proxy": "XXX.XXX.XXX.XXX"
}

Noodle as node module

Note: Since noodle’s internal cache uses an interval this will keep the related node process running indefinately. Be sure to run noodle.stopCache() in your code when you’re finished with noodle.

Methods

noodle.query

The main entry point to noodle’s functionality is the query method. This method accepts a query or an array of queries as its only parameter and returns a promise.

var noodle = require('noodlejs');
noodle.query(queries).then(function (results) {
  console.log(results);
});

The makeup of query(s) is analagous to using noodle as a web service (as stated above). The exception being that you supply a proper object and not JSON.

noodle.fetch

This method returns a promises. Which upon resolutions hands over the requested web document.

noodle.fetch(url).then(function (page) {
  console.log(page);
});

noodle.html.select

For applying one query to a html string and retrieving the results.

noodle.html.select(html, {selector: 'title', extract: 'innerHTML'})
.then(function (result) {
  console.log(result);
});

noodle.json.select

For applying one query to a parsed JSON representation (object).

var parsed = JSON.parse(json);
noodle.html.select(parsed, {selector: '.name'})
.then(function (result) {
  console.log(result);
});

noodle.feed.select

Normalises an RSS, ATOM or RDF string with node-feedparser then proxies that normalised object to noodle.json.select.

noodle.xml.select

Proxies to noodle.json.select.

noodle events

noodle’s noodle.events namespace allows one to listen for emitted cache related events. Noodle inherits from node’s EventEmitter.

// Called when a page is cached
noodle.events.on('cache/page', function (obj) {
  //obj is the page cache object detailing the page, its headers
  //and when it was first cached
});

// Called when a result is cached
noodle.events.on('cache/result', function (obj) {
  //obj is the result cache object detailing the result and when
  //it was first cached
});

// Called when the cache is purged
noodle.events.on('cache/purge', function (arg1, arg2) {
  //arg1 is a javascript date representing when the cache was purged
  //arg2 is the time in milliseconds until the next cache purge
});

// Called when a cached item has expired from the cache
noodle.events.on('cache/expire', function (obj) {
  //obj is the cache item
});

Configuration

Configuration is possible programmatically via noodle.configure(obj).

This accepts a conig object which can be partly or fully representing the config options.

This object is applied over the existing config found in the config.json.

Example for change just two settings:

var noodle = require('noodlejs');

// Do not display messages to the terminal and set
// the default document type to json

noodle.configure({
  debug: false,
  defaultDocumentType: "json"
});

Error handling

Noodle will fire various errors which one can listen for with the fail() handler.

noodle.html.fetch(query)
.then(function (result) {
  console.log('The results are', results);
})
.fail(function (error) {
  console.log('Uh oh', error.message);
});

Possible errors

The noodle module itself emits only one error:

  • "Document not found" when a targetted url is not found.

Were as the specific document type modules emit their own but should bubble up to the main noodle.query method.

  • 'Could not parse XML to JSON'
  • 'Could not parse JSON document'
  • 'Could not match with that selector'
  • 'Could not match with that selector or extract value'

Caching

noodle includes an in memory cache for both queried pages and the query results to help with the speed of requests.

This cache can be configured in the noodlejs/lib/config.json file.

This cache is included in the noodle library core not at its web service.

Caching is done on a singular query basis and not per all queries in a request.

By default the page cache and results cache’s individual items have a life time of an hour. With a cache itself having total size of 124 recorded items in memory at one time. A cache is also cleared entirely on a weekly basis.

These values can all be changed from noodle’s json config.

HTTP caching headers

The noodle web service includes Expires header. This is always set to the oldest to expire query result in a result set.

Take not however that the browser may not cache POST requests to the noodle server.

Adding to noodle

noodle is an open-source project maintained on github so raising issues and forking is encouraged.

Supporting different web documents

By default noodle supports html, json, standard feeds and xml web documents but noodle also provides a concise environment for developers to write their own type modules with prior knowledge only needed in promises.

To add their own type, one creates the script for that type in noodlejs/lib/types with the name being what one would type in a query.

$ touch noodlejs/lib/types/csv.js

As for the content of the script a developer should expose at least 2 methods (_init & fetch) and is recommended to expose a select method. These methods must be written with a promise interface interoperable with the q library. It is reccomended you just use q.

Required methods

exports._init = function (noodle) {}

This function is passed the main noodle library. You should keep hold of this reference so you can make use of some important noodle methods covered in a bit.

exports.fetch = function (url, query) {}

This method is the entry point to your module by noodle and possibly other developers. This is the function which leads to all of your processing.

Make use of noodle.cache.get to resolve your promise early with a cached results without the need to fetch the page and process the query.

It is higly recommended you do not fetch the page yourself but use the core noodle.fetch since this handles page caching for you.

When you have the document pass it and the query to your select function for processing with the query. function fetch (url, query) { var deferred = q.defer(); if (noodle.cache.check(query)) { deferred.resolve(noodle.cache.get(query).value); return deferred.promise; } else { return noodle.fetch(url, query).then(function (page) { return select(page, query); }); } }

Recommended methods

exports.select = function (document, query) {}

This method is where you do your actual selecting of the data using the web document given from your fetch method via noodle.fetch.

In your algorithm do not account for multiple queries. This is done at a higher level by noodle which iterates over your type module.

It is also highly recommended that you cache your result this is done simply by wrapping it in the noodle._wrapResults method.

deferred.resolve(noodle._wrapResults(results, query));

What defines query properties like extract or select is what your own select function expects to find in the query object passed in. For example:

// Query
{
  "url": "http://example.com/data.csv",
  "type": "csv",
  "from": "row1",
  "to": "row10"
}

// Your interpretation
function select (document, query) {
  ...
  csvparser.slice(query.from, query.to);
  ...
}

Example script

An example implementation could look like this:

var q      = require('q'),
    noodle = null;

exports._init = function (n) {
  noodle = n;
}

exports.fetch = function (url, query) {
  var deferred = q.Defer();
  if (noodle.cache.check(query)) {
    deferred.resolve(noodle.cache.get(query).value);
    return deferred.promise;
  } else {
    return noodle.fetch(url).then(function (page) {
      return exports.select(page, query);
    });
  }
}

exports.select = function (page, query) {
  var deferred  = q.Defer(),
      myResults = [];

  /* 
    your algorithm here, dont forget to
    deferred.resolve(noodle._wrapResults(myResults, query))
    or
    deferred.fail(new Error("Selector was bad or something like that"))
  */

  return deferred.promise;
}

Tests

The noodle tests create a temporary server on port 8889 which the automated tests tell noodle to query against.

To run tests you can use the provided binary from the noodle package root directory:

$ bin/tests