Skip to content
ceteri edited this page Jun 30, 2012 · 28 revisions

Cascading for the Impatient

Welcome to Cascading for the Impatient, a series of blog posts and Cascading 2.0 code examples to get you started. Quickly. Like, yesterday.

  • Implements simplest Cascading app possible
  • Copies each TSV line from source tap to sink tap
  • Roughly, in about a dozen lines of code
  • Physical plan: 1 Mapper
  • https://gist.github.com/2911686
  • Implements a simple example of WordCount
  • Uses a regex to split the input text lines into a token stream
  • Generates a DOT file, to show the Cascading flow graphically
  • Physical plan: 1 Mapper, 1 Reducer
  • https://gist.github.com/3020297
  • Uses a custom Function to scrub the token stream
  • Shows how to sort the output (ascending, based on token counts)
  • Physical plan: 1 Mapper, 1 Reducer
  • https://gist.github.com/3021655
  • Shows how to use a HashJoin on two pipes
  • Filters a list of stop words out of the token stream
  • Physical plan: 2 Mappers, 2 Reducers
  • Calculates TF-IDF using a custom Function
  • Shows how to use a SumBy and a CoGroup
  • Physical plan: 10 Mappers, 10 Reducers
  • Includes unit tests in the build
  • Shows how to use other TDD features: checkpoints, assertions, traps, debug
  • Implements switch to run the example in local mode (without Apache Hadoop)
  • Uses an R script to analyze/visualize the results

If you want to read in more detail about the classes in the Cascading API which were used, see the Cascading 2.0 User Guide and JavaDoc.

For more discussion, see the cascading-user email forum.

Clone this wiki locally