Skip to content

Latest commit

 

History

History
210 lines (155 loc) · 5.76 KB

README.md

File metadata and controls

210 lines (155 loc) · 5.76 KB

GATD: Get All The Data

GATD is a cloud based system for managing and storing data streams. It was born out of a need to handle data generated by disparate sensors with varying data types, transmission protocols, and end-use goals.

GATD has three major design goals:

  1. Modularity. GATD is a relatively loose collection of modules connected with infinite length queues and a database layer. Each module is part of a certain block of the system and many modules can exist for the same block. For example, in the receiver block there is one module that listens for UDP packets and another module that listens for HTTP requests. This allows GATD to be trivially extended as functionality changes and new sensors come online.

  2. Flexibility. GATD makes virtually no assumptions about the format, type, or content of any data coming into the system. The exclusive requirement is that a sensor must be able to identify its data stream to the system so it can be processed properly. Each data stream has a custom parser that knows how to make sense of its own data. The parser simply returns key,value pairs with no restrictions on the key names or value types. GATD is designed to adapt to the sensors, and not vice-versa.

  3. Timeliness. GATD is specifically designed to support real-time streaming applications where data comes in as it is generated and is sent out to interested clients immediately. Every component is optimized for this workflow. Additionally, all data is stored and can be retrieved and processed later if necessary.

Structure

diagram

The major blocks of GATD are as follows:

  • Receiver. Responsible for accepting data from any sensors. Records all relevant metadata with the data before passing it all to the formatter.

  • Formatter. The formatter is a stateless block that converts raw data from sensors into key,value pairs. The formatter calls the appropriate parser to interpret the raw data before storing them in a database and passing them on to any streamers.

  • Streamer. The streamer block sends data to any interested clients. Clients register a query with a streamer and any matching packets are sent to the client.

Implementation

The current version of GATD is a research oriented implementation designed for speed of development and experimentability rather than performance. Most modules are written in Python, although due to the loose, modular approach some are written in Node.js and C as well.

GATD uses RabbitMQ for the inter-module queues and MongoDB for data storage.

Requirements

  • Python 2.7.*
  • MongoDB
  • RabbitMQ
  • Node.js
  • tup

Installation

Ubuntu / RHEL

  1. Install MongoDB and RabbitMQ Server.

  2. Install dependencies

    sudo apt-get install python-pip git python-dev screen
       --- or ---
    sudo yum install python-pip git python-devel screen
       --- or ---
    sudo port install py27-pip git-core
    
  3. Setup user and checkout gatd. You will also want to add yourself to the gatd group and then log out and back in. Probably can skip this step on Mac.

    sudo adduser gatd
    cd /opt
    sudo git clone https://github.com/lab11/gatd.git
    sudo chown gatd:gatd gatd -R
    sudo chmod -R g+w gatd
    sudo usermod -a -G gatd <username>
    
  4. Copy the example GATD config file and set the necessary values. You will want to make sure any passwords set in the next steps are reflected in this file.

    cd /opt/gatd/config
    cp gatd.config.example gatd.config
    
  5. Configure MongoDB using the template config file in the mongo folder.

  6. Copy the config file to /etc/mongodb.conf.

    sudo cp /opt/gatd/mongo/mongodb.conf /etc/mongodb.conf
    
  7. Edit the config file with the port you want to use.

  8. Create a directory for the database.

    sudo mkdir -p /data/mongodb
    sudo chown mongodb:mongodb /data/mongodb
    
  9. Restart the MongoDB daemon.

    sudo service mongod restart
    
  10. Add the gatd user to the Mongo database

    mongo --port <mongo db port>
    use getallthedata
    db.createUser({
        user: "reportsUser",
        pwd: "12345678",
        roles: [
                 { role: "dbAdmin", db: "getallthedata" }
               ]
        }
    )
    
  11. Configure RabbitMQ using the config files in the rabbitmq folder.

  12. Copy the config files to /etc/rabbitmq.

    sudo cp /opt/gatd/rabbitmq/rabbitmq* /etc/rabbitmq
    
  13. Edit rabbitmq-gatd.config with the port you want to use.

  14. Restart the rabbitmq server.

    sudo rabbitmqctl stop
    sudo service rabbitmq-server start
    
  15. Delete the default rabbitmq user, create a GATD user, and set permissions.

    sudo rabbitmqctl delete_user guest
    sudo rabbitmqctl add_user gatd <password>
    sudo rabbitmqctl set_user_tags gatd administrator
    sudo rabbitmqctl set_permissions -p / gatd ".*" ".*" ".*"
    
  16. Set up Python environment.

    sudo pip2 install virtualenv
    cd /opt/gatd
    virtualenv .
    source ./bin/activate
    pip2 install -r requirements.pip
  17. Setup the database in MongoDB.

    cd /opt/gatd/mongo
    ./init_mongo.py
    
  18. Run GATD

  19. Start the receivers.

    cd /opt/gatd/receiver
    ./run_receiver.sh
    
  20. Run the formatter.

    cd /opt/gatd/formatter
    ./run_formatter.sh
    
  21. Run the streamers.

    cd /opt/gatd/streamer
    ./run_streamer.sh