A Microdata parser and extractor library for ruby. This is based on the latest Published version of the Microdata Specification dated 5th April 2011.
Mida keeps RubyGems up-to-date with its latest version, so installing is as easy as:
gem install mida
-
Nokogiri
To use the command line tool, supply it with the urls or filenames that you would like to be parsed (by default each item is output as yaml):
mida http://lawrencewoodman.github.io/mida/news/
If you want to search for specific types you can use the -t
switch followed by a Regular Expression:
mida -t /person/i http://lawrencewoodman.github.io/mida/news/
For more information look at mida
‘s help:
mida -h
The following examples assume that you have required mida
and open-uri
.
All the Microdata is extracted from a page when a new Mida::Document
instance is created.
To extract all the Microdata from a webpage:
url = 'http://example.com' open(url) {|f| doc = Mida::Document.new(f, url)}
The top-level Items
will be held in an array accessible via doc.items
.
To simply list all the top-level Items
that have been found:
puts doc.items
If you want to search for an Item
that has a specific itemtype
/vocabulary this can be done with the search
method.
To return all the Items
that use one of Google’s Review vocabularies:
doc.search(%r{http://data-vocabulary\.org.*?review.*?}i)
Each Item
is a Mida::Item
instance and has four main methods of interest: type
, vocabulary
, properties
and id
.
To find out the itemtype
of the Item
:
puts doc.items.first.type
To find out the itemid
of the Item
:
puts doc.items.first.id
Properties are returned as a hash containing name/values pairs. The values will be an array of either String
or Mida::Item
instances.
To see the properties
of the Item
:
puts doc.items.first.properties
Mida allows you to define vocabularies, so that input data can be constrained to match expected patterns. By default a generic vocabulary (Mida::GenericVocabulary
) is registered which will match against any itemtype
with any number of properties.
If you want to specify a vocabulary you create a class derived from Mida::Vocabulary
. As an example the following describes a subset of Google’s Review vocabulary:
class Rating < Mida::Vocabulary itemtype %r{http://data-vocabulary.org/rating}i has_one 'best' has_one 'worst' has_one 'value' end class Review < Mida::Vocabulary itemtype %r{http://data-vocabulary.org/review}i has_one 'itemreviewed' has_one 'rating' do extract Rating, Mida::DataType::Text end end
When you create a subclass of Mida::Vocabulary
it automatically registers the Vocabulary.
Now if Mida is parsing some input and manages to match against the Review
itemtype
, it will only allow the specified properties and will reject any that don’t have the correct number. It will also set Item#vocabulary
accordingly, e.g.
doc.items.first.vocabulary # => Review
If you want to include the properties of another vocabulary you can use include_vocabulary
:
class Thing < Mida::Vocabulary itemtype %r{http://example.com/vocab/thing}i has_one 'name', 'description' end class Book < Mida::Vocabulary itemtype %r{http://example.com/vocab/book}i include_vocabulary Thing has_one 'title', 'author' end class Collection < Mida::Vocabulary itemtype %r{http://example.com/vocab/collection}i has_many 'item' do extract Thing end end
In the above if you gave a Book
as an item of Collection
this would be accepted because it includes the Thing
vocabulary.
If you find a bug or want to make a feature request, please report it at the Mida project’s issues tracker on github.
Copyright © 2011-2013 Lawrence Woodman <[email protected]>. This software is licensed under the MIT Licence. Please see the file, LICENCE.rdoc, for details.