-
Notifications
You must be signed in to change notification settings - Fork 914
Discussion
Suggestions, feature requests and discussion go here.
Things have been going well. WhatWeb 0.4.5 is a good, stable tool and has earned community recognition.
We've been tearing webpages apart and fingerprinting them piece by piece. We've built plugins for many web applications, client side libraries and HTML elements, but now we have a few important issues to consider regarding WhatWeb's direction.
Design Philosophy
- Always use an intuitive interface. never force a user to choose an option when a default is better. the following command must always work:
./whatweb slashdot.org
- Never take choices away from the user. Each automatic decision should be a default for a configurable option. examples: follow redirects.
- Avoid premature over-engineering. do not implement core code to handle types of information that few plugins currently return. Allow plugins to return the information in generic formats such as :string instead. Wait until many plugins are returning the same type of information, such as operating system, filepaths, versions, or modules before considering how to solve this problem in the core. Premature over-engineering is the type of error that kills a project.
- When a solution to a problem is inelegant then do not implement it in WhatWeb. Instead continue to meditate on the problem for as long as required. If you need a fast solution then hack up your own version of WhatWeb and do not introduce the patch into the core, I have done this many times.
- WhatWeb must grow horizontally and vertically together. WhatWeb must be good at solving a type of problem before entering a new area. for example, WhatWeb must be competent at identifying a system before it starts becoming good at identifying versions of systems. If WhatWeb is known to be patchy in it's coverage this could kill the project. this is the rationale behind not implementing security checks yet. This also works with the unix philosophy of doing one thing but doing it really well.
- Breaking backwards compatibility is OK.
Multi-App plugins
We're at a fork in the road on this one. On one side, we can fingerprint each application individually and write a plugin for each one. On the other we can incorporate many different applications of the same type into one plugin, for example all third party javascript libraries.
bcoles: This isn't really a design issue so much. I'm in favor of categorizing plugins rather than combining multiple applications into a single plugin. I'd rather see output Google-Analytics[713526426]
than Third-Party-Library[Google-Analytics[713526426]]
Exceptions:
An exception would be fingerprinting generic wep apps: admin panels or web backdoors for example. Applications where you're only able to fingerprint generically using subtle clues, such as "/admin/", "/login/" or "?cmd=" in the URL. It doesn't necessarily mean that an admin panel or backdoor are present, but it's a good indication.
It also acceptable to write plugins which return different models for hardware. It is not feasible to write a different plugin for every model.
Output becomes a wall of text
We now have numerous plugins which return a file path from the source of an HTML element. For example,
- Meta-Refresh
- Meta-Author
- Meta-Generator
- Redirect-Location
- Frame
- RSS-Feed
- Mailto
- Title
- Script
- Shortcut Icon
These types of plugins are great for plugin development, data mining or noticing patterns across networks.
The problem is the WhatWeb output becomes a massive wall of text, even in --log-brief
mode.
One way around this is by putting these types of plugins in a "plugin development" category and allowing the user to enable/disable certain categories.
For now most of these plugins are in the "plugins-disabled" directory.
One solution is a new output format combined with plugin categories (see below).
Should plugins be categorized? If so, should they be layered?:
Aung Khant: It would be great if WhatWeb supported scanning by categories in the future.
- Server
- Language
- Program
- Third Party Library
or
- HTML Elements
- Program
- Vendor
- Server
- Development
- Config/Log files
or
- HTTP Server. Apache, Nginx
- Language. PHP, ASP, ASP.NET, ColdFusion
- Framework. Cake, Zend, Ruby on Rails ( can u tell this from the language and CMS?)
- CMS/Blog. WordPress, Joomla, Drupal
- JS Library. Scriptaculus, Prototype, JQuery, Google Analytics
- Hardware devices. Xerox Printers, Cisco routers, D-link cameras
- Common. Title, Subdomains, Uncommon-headers, X-Powered-By, Mailto
- Hashes. Header-hash, footer-hash
I (Andrew) like the above categories best but it is far from complete. The first categories break down into an OSI-like set of layers nicely. The 'hardware devices' category should be considered covering all layers from the server to the JS library. The common category defines plugins that are common to all types of websites, not necessarily commonly found plugins. The hashes are kept separate from the common plugins as hashes are primarily used to discover common content after a scan and a user may wish to disable these.
Here is a set of categories from builtwith.com:
- Ads
- Analytics
- Blog
- CDN
- CMS
- DocInfo
- Ecommerce
- Encoding (utf-8, big5)
- Feeds (feed types and feed providers)
- Framework (includes languages and frameworks)
- JS (javascript libraries, not including analytics)
- Media (Media provider such as youtube)
- Server
- Software (operating systems)
- Widgets
Here is a set of categories from Wappalyzer:
- CMS
- Message Boards
- Database managers
- Documentation tools
- Widgets
- Web shops
- Photo galleries
- Wikis
- Hosting panels
- Analytics
- Blogs
- JavaScript frameworks
- Issue trackers
- Video Players
- Comment Systems
- CAPTCHAs
- Font scripts
- Web frameworks
- Miscellaneous
- Editors
- LMS
- Web servers
- Cache tools
Some problems are:
- Encoding should be a plugin value, not a plugin
- Ecommerce has a lot of CMS's
- Blogs and CMS's have cross over, such as WordPress
- Client-Side fits into a lot of categories, but should probably be kept separate
Some notes are: The Analytics category could be included in JS but it's better to have it's own category.
How should categories for plugins be defined?
- option 1) define the category within the plugin's file
- option 2) define the category by the directory it is within
- option 3) a list of tags within the plugin file
option 2 works well for multiple categories when used with symlinks.
Categorization Trial
bcoles: I've trialled categorizing the plugins to determine the easiest way to implement directory based categorization. I know I've dropped a few plugins in the wrong directories. If you want to meticulously re-categorize 500+ plugins then be my guest. Here's what I came up with : http://whatweb.net/plugins-categorized.zip
Categories:
client-side
framework
hardware
host
language
misc
server
vendor
web-app
Issues I faced while categorizing :
- About 50 plugins (~10%) are in ./plugins/misc category
- Should proxy servers go under ./plugins/misc or ./plugins/server? Should HTTP and Proxy servers be in separate directories?
- ./plugins/vendor could perhaps be split into ./plugins/hosting and ./plugins/third-party, however ./plugins/client-side libraries are often third-party resources as well.
- ./plugins/web-app is rather vague as a category. About 400 (~60%) plugins are in this category. Splitting this category results in duplication of content which is probably best solved with symlinks?
- Do we need sub-categories? Do we need more categories? Is that over specializing?
- Indecision - For example, where does WebDAV belong?
Suppress 404s
Users can just grep for 200 or -v 404
Follow frames
Many websites still use frames on intro pages. A --follow-frames
option would allow WhatWeb to grab these URLs instead of being stuck trying to fingerprint a HTML frameset.
Should frames be followed by default? Should following off-site frames be ignored or be a configurable option?
Andrew: this could be configured with --follow-frames off,on(on-site only),always
Is using on
for onsite a bad choice? the alternative is onsite
instead of on
--follow-frames never,frame-only,iframe-only,same-site,same-domain,always (default: same-domain)
bcoles: It should function like the --follow-redirect
option; that is:
--follow-frames=WHEN Control when to follow frames. WHEN may be `never',
`frame-only', `iframe-only', `same-site', `same-domain'
or `always'. Default: never
I'm undecided on whether never
or same-site
is the best default.
Types of authentication to potentially support:
- HTTP Basic Authentication - currently supported by
--header
- HTTP Digest Authentication - currently supported by
--header
- URL parameter with session token
- HTTP Cookies - currently supported by
--header
- SSL Certificate Support
- HTTP Forms with passwords
Curl supports these and it might make sense for WhatWeb to copy curl's command line syntax.
A method, not necessarily a good one is to load WhatWeb with username and password combinations which it will try whenever it discovers a password prompt.
Using HTTP authorization would be nice for fingerprinting devices with default credentials. This belongs in aggression level 5 which has not yet been implemented.
Aung Khant: Some frameworks issue unique error response when we do invalid post request
:url_post=>'/', :post_data=>'null=null'
bcoles: post can be achieved with custom ruby but POST request support would be worth adding. Also support for OPTIONS requests may be useful, for example WebDav.
Andrew: No. Not yet at least. I want good coverage of plugins to identify systems first including aggressive plugins to detect exact version numbers.
Plugins that test for vulnerabilities, if or when introduced, should be at a different aggression level, maybe 5. Exploiting full path disclosure, default credentials and weak access controls fit into this category.
According to the WhatWeb design philosophy: avoid premature over-engineering. Do not implement core code to handle types of information that few plugins currently return.
The following are candidates as data-types for plugins to return (such as :version
, :string
, :firmware
, etc) as it may be useful to separate them from results in :string=>
:
-
:hostname=>
- Internal host name - not widely used
-
:ip=>
- Used for internal IP addresses and the IP plugin - not widely used
-
:mac=>
- MAC address - not widely used
-
:year=>
- The age of an installation can often be roughly determined by the year(s) in copyright messages. Several plugins report the year.
add option to save html files and headers. optional folder. how to save the files?
option 1 (hostnames backwards by TLD, IPs forwards by octet) for login.yahoo.com and 208.51.4.1 u get: com/yahoo/login/head and download/com/yahoo/login/body 208/51/4/1/head and 208/51/4/1/body
option 2 (url, dots become -, every special char not allowed in a filename is converted to something? ) login-yahoo-com_index-html.head login-yahoo-com_index-html.body
option 3 (md5 hash of url, this is kind of brutal) 9e107d9d372bb6826bd81d3542a419d6.head 9e107d9d372bb6826bd81d3542a419d6.body
option 4 (URL encode every special character after the hostname. should dots remain dots?) login.yahoo.com%2findex.html.head login.yahoo.com%2findex.html.body
thoughts...: large sets - splitting the hostnames across directories (option 1) small sets - one directory for all hosts (keep the dots) URL encode every special character for the path
There should also be options for saving to DBs like gridfs, etc
This feature should provide a gentle introduction into custom usage of WhatWeb and eventually lead into plugin writing.
Aims of the feature :
Reduce barrier to entry for custom searching with WhatWeb and remove the need for anyone to write this :
echo "\n\n" | netcat whatweb.net 80 | grep -Eo "<title>([^<]+)<\/title>"
For example:
$ ./whatweb --custom-plugin "{:string=>/<title>([^<]+)<\/title>/i}" whatweb.net
This option allows WhatWeb to act as a powerful, threaded, grep-powered platform for HTTP(S).
Unfortunately the --custom-plugin
option needs to be escaped and in some cases, such as :regexp=>//
, needs to be double-escaped as it parsed directly from the command-line. This results in a complicated and unintuitive command line argument.
Splitting each match method up into its own command line argument would help reduce the complexity :
option 1
--custom-plugin-text, --custom-plugin-regex
option 2
--find-text, --find-regex, --find-md5
option 3
--match-text, --match-regex, --match-md5
option 4
--grep-text, --grep-regex, --grep-md5
Andre Gironda: i would love to see WhatWeb identify candidate insertion points for testing - especially marking insertion points that are user controllable HTML element attributes
bcoles: any suggestions on how the results for candidates for insertion should be formatted?
Andre Gironda: ProxMon and Casaba Watcher tools do it right - they are open-source
bcoles: This could be achieved with a plugin. Something like :
- GET params: split base_uri by ? then &
- Extract params from
/base_uri[^'"]+\?([^=]+)=([^&]+)/
- Extract params from
- POST params: The
./plugins-disabled/POST-Parameters.rb
plugin exists for this purpose - Elements: grep for the GET param values and extract the relevant HTML element type
- Will most likely result in false positives unless non-default GET parameter values are sent