Create a classifier that can take snippets of code and guess the programming language of the code.
After completing this assignment, you should understand:
- Feature extraction
- Classification
- The varied syntax of programming languages
After completing this assignment, you should be able to:
- Build a robust classifier
- A Git repo called programming-language-classifier containing at least:
README.md
file explaining how to run your project- a
requirements.txt
file - a suite of tests for your project
- Passing unit tests
- No PEP8 or Pyflakes warnings or errors
Option 1: Get code from the Computer Language Benchmarks Game. You can download their code directly. In the downloaded archive under benchmarksgame/bench
, you'll find many directories with short programs in them. Using the file extensions of these files, you should be able to find out what programming language they are.
Option 2: Scrape code from Rosetta Code. You will need to figure out how to scrape HTML and parse it. BeautifulSoup is your best bet for doing that.
Option 3: Get code from GitHub somehow. The specifics of this are left up to you.
You are allowed to use other code samples as well.
For your sanity, you only have to worry about the following languages:
- C (.gcc, .c)
- C#
- Common Lisp (.sbcl)
- Clojure
- Haskell
- Java
- JavaScript
- OCaml
- Perl
- PHP (.hack, .php)
- Python
- Ruby (.jruby, .yarv)
- Scala
- Scheme (.racket)
Feel more than free to add others!
Using your corpus, you should extract features for your classifier. Use whatever classifier engine that works best for you and that you can explain how it works.
Your initial classifier should be able to take a string containing code and return a guessed language for it. It is recommended you also have a method that returns the snippet's percentage chance for each language in a dict.
The test/
directory contains code snippets. The file test.csv
contains a list of the file names in the test
directory and the language of each snippet. Use this set of snippets to test your classifier. Do not use the test snippets for training your classifier.
This project should be laid out in accordance with the project layout from The Hacker's Guide to Python. It should have tests for things which can be tested. Your classifier should be able to be run with a small controlled corpus for testing.
Your project should also contain an IPython notebook that demonstrates use of your classifier.
In addition to the requirements from Normal Mode:
Create a runnable Python file that can classify a snippet in a text file, run like this:
guess_lang.py code-snippet.txt
where guess_lang.py
is whatever you name your program and code-snippet.txt
is any snippet. Your program should print out the language it thinks the snippet is.
To do this, you will likely want to either pre-parse your corpus and output it as features to load or save out your classifier for later use. Otherwise, you'll have to read your entire corpus every time you run the program. That's acceptable, but slow.
You may want to add some command-line flags to your program. You could allow people to choose the corpus, for example, or to get percentage chances instead of one language. To understand how to write a command-line program with arguments and flags, see the argparse module in the standard library.