-
Notifications
You must be signed in to change notification settings - Fork 3
High Level Problem Statement
WWW, for over 20 years now, has catered to the literati. Although the promise of the Web is its multi media content that can effectively enable anyone - literate or not - to publish on the Web, it is still to be seen as to how effectively someone not literate can make sense of what might be written on a Web page. And also as how someone who do not understand the language of the author/page may make sense of the material that is on a page.
We call this Alipi or in full "Web Accessibility for Print/Text Impaired". Alipi, we pronounce A-lipi, and is the name of this collaborative project and model for a Web frame work.
Let G be a directed graph where the nodes are documents that exist on the web. There is an edge from d1 to d2 with a label L, iff d2 is related to d1 in the sense described by label L. Strictly speaking, d2 and d1 could reference the same URI-accessible document, but d2 could be a transformation of d1.
For example, d2 could be a a re-rendering of d1 where d2 is WAI-accessible to someone with color-blindness, or d2 could be accessible to vision-impaired people.
WAI concerns itself with generating relatedness, not with identifying relatedness, i.e. the standard effectively makes it possible to generate d2 given d1. This kind of relatedness is primarily presentational (and thus, implicitly semantically
related in a somewhat obvious way).
Alipi concerns itself with more generic semantic relatedness of documents, and also concerns itself with identifying relatedness as well as generating relatedness. i.e. given a document d1, it is interested in finding (either by identifying an existing one, or by generating one) a d2 that is related to d1 in the sense of L.
This is a really hard problem to solve efficiently for different notions of L-relatedness. Given a document d1, how will the set of L-related documents be discovered? Will they be generated (ex: machine translation across languages)? Or will they be fetched based on existing semantic markup on d1? Or, will they be fetched based on existing semantic markup on d2's? Or, will a document repository (ex: the web) be crawled to identify the set of L-related documents? If so, given a candidate document d2, what metrics will be used to determine if d1 and d2 are sufficiently closely L-related? Clearly, different domains and applications will will require different standards of L-relationship distance between d1 and d2.
In light of the previous discussion, to avoid getting lost in an overgeneralized problem, Alipi is going to focus on a set of different projects in specific subdomains where L is well-defined, and it is going to specify standards which will enable either the identification or generation of L-related documents. For example, it could specify semantic markup that a publisher of a document d1 that enables the development of a browser extension to generate a L-related document d2.
The rest of this document is going to develop a couple such proposals.
This project is going to address documents that live on the WWW.
Given a document d1, we are going to focus on discovering a set of documents{d2} that are essentially re-narrations of 1. The notion of what constitutes a renarration (i.e. the L-relation above) is deliberately left unspecified because these relationships are not generated algorithmically. They are left to humans to generate based on their subjective interpretations.
NOTE: The 'Google approach' subtitle is used rather loosely only to make it easier to see the similarities and to juxtapose it with the other approach below.
Given d1, the discovery of d2's is going to be based on markup on the target documents. Put another way, if we pick 2 documents d1 and d2 and draw an arrow between them that establishes their L-relatedness, the arrow will be directed from d2 to d1 to signify that d2 is responsible for specifying all documents d1 that it is a renarration for. This could be done via semantic tags in the header of d2. Towards this end, the project will provide a tagging specification that page authors can use to markup the renarration web. This is somewhat similar to the HTTP hyperlinking tag (<A>) that specifies a directed relationship between web documents.
So, given d1, discovery of d2's requires the construction of a web crawler to crawl WWW documents and build a directed graph of the renarration web based on semantic markup. So, at this generic level, you can conceive of building a renarration search engine that is based on the renarration document graph built by the crawler. We can now imagine a host of related problems that would need to be solved to make this a reality. It is possible to imagine where there are multiple available renarrations of a document. For example, given an English document, there could be a Kannada renarration and a Hindi renarration. So, for this example, the renarration semantic markup tags should be rich enough to specify the type of renarrations that a document provides. As another example, given a legal judgement for a popular case, there could be several summaries and explanatory articles that explain in lay language and in far fewer words what the judgment is, and what is various implications are. So, all these articles could be renarrations of the legal judgment. So, this requires a different kind of renarration markup compared to the language translations above.
Here again, multiple approaches exist. The search engine that is responsible for serving up renarrations could provide for a rich set of search options that the semantic complexity (in the sense of possibilities) of the renarration markup
i.e. if the semantic markup lets me specify that this is a language-translation renarration, and specifically a Kannada language-translation renarration, then the search engine could provide an option where the searcher would look for a
Kannada language-translation renarration.
Alternatively, the renarration web could be semantically-flat like the <a> http tag which only specifics a directional link between documents without conveying any additional information about the kind of link it is. This makes markup
simple, but makes search engines complex -- google is based on discovering and building implicit and explicit semantic information out of these flat http links.
Similarly, the renarration markup could be a single meta-tag that is semantically-flat which would then spawn a whole bunch of projects that approach the search problem differently. At one level, someone searching for renarrations of the legal judgment would be present with every possible renarration that exists for it: multiple English summaries, Kannada translations of these summaries, original Kannada interpretations of the judgment, and so on. The burden of sifting through these results would then be on the searcher. Alternatively, the search engine would have to concern itself with identifying trust, relevance, importance, and all those good things and suitably order search results. It might even require the searcher to encode his/her renarration preferences somewhere that the search engine is somehow privy to which it could then use to filter and order search results. But, this doesn't relieve the search engine of the burden of crawling, analyzing, and organizing the renarration web.
NOTE: The 'social media approach' subtitle is used rather loosely only to make it easier to see the similarities and to juxtapose it with the other approach above.
Here, the renarration web is not embedded within the document web via markup. Instead, it sits as a layer on top of the document web. This layer could itself be a part of the web (or not), thus making them candidates for additional renarrations (which is similar to how it is in the scenario above). For example, this information is available via Facebook status updates, tweets, or Buzz updates. i.e. I might post a status update on Facebook that says d2 is a Kannada translation of d1. So, there is no markup on either d1 or d2. In this scenario, the renarration web search engine relies on analyzing status updates, tweets, etc and building the renarration graph.
In an even more structured approach, you could conceive of a web application where users register renarrations alongwith annotations about renarrations. This is somewhat like Wikipedia. So, registered users with different roles (editors, curators, users, etc.) are responsible for building a knowledge base of the renarration web.
Various other approaches are conceivable: mechanical turk approach,newsrack approach, google translate approach, etc.
In all these scenarios, there are two independent problems that would have to be solved independently.
(a) Building (implicitly or explicitly) the renarration document graph-- via crawling,social-media-analyzing, wikipedia-ing, newsrack-filtering,mechanical-turking, however.
(b) A search interface for discovering renarrations -- this can bedone via a web application, browser plugins, etc. But, this crucially relies on the availability of the renarration web from (a) above