This file covers how to deploy the analysis framework in production. The detail is explained per analysis, including how to run, monitor and terminate, as well as some other instructions, such as the restrictions on machines, network etc.
All the analysis except for install/dynamic analysis runs celery workers inside docker containers. Since we want to enable sandobox and system call isolation for install/dynamic analysis, we run each package in a separate container, and celery workers sits natively and outside docker containers. This brings high overhead to the filesystem and docker daemon.
-
Dependency analysis
- input: list of packages from package managers (generated by crawl command)
- output: the metadata and dependency information for all packages
- desc: this analysis queries registries to get metadata and install them to get list of dependencies. do the following to deploy this analysis.
- pull the image
sudo docker pull malossscan/maloss
- for all the machines, create
maloss/main/config
frommaloss/main/config.tmpl
and customize it- comment out TRACING
- customize CELERY_BROKER_URL and METADATA_DIR
- for all the machines, create and customize
maloss/main/.env
- PARALLELISM can be set to 2 * num_of_cpus
- METADATA_DIR should be the same as config
- run rabbitmq on master node
cd main && sudo docker-compose --compatibility -f docker-compose-master.yml up -d && cd ..
- master node should have port 5672, 15672, 25671 open for rabbitmq
- run maloss on worker nodes
cd main && sudo docker-compose --compatibility up -d && cd ..
- add jobs to rabbitmq on master node
cd main && python3 detector.py get_dep -i ../data/npmjs.with_stats.csv --native
- pull the image
-
Install analysis
- input: list of packages from package managers (generated by crawl command)
- output: the sysdig tracing files during installation process
- desc: this analysis installs packages and uses sysdig to capture invoked system calls.
- pull the image
sudo docker pull malossscan/maloss
- for all the machines, create
maloss/main/config
frommaloss/main/config.tmpl
and customize it- customize CELERY_BROKER_URL and METADATA_DIR
- for all the machines, customize TRACEPATH in
maloss/sysdig/.env
to different pacakge managers - run rabbitmq on master node
cd main && sudo docker-compose --compatibility -f docker-compose-master.yml up -d && cd ..
- master node should have port 5672, 15672, 25671 open for rabbitmq
- run scheduler.py to start install jobs on worker nodes
sudo python3 scheduler.py start -p 7 -i 30 -s -u $USER
- add job to rabbitmq on master node
cd main && python3 detector.py install -i ../data/npmjs.with_stats.csv
- stop scheduler.py and cleanup when needed
sudo python3 scheduler.py stop -s -u $USER
- pull the image
-
Dynamic analysis
- similar to Install analysis, except the command
python3 detector.py dynamic -i ../data/npmjs.with_stats.csv
- similar to Install analysis, except the command
-
AstfilterLocal analysis
- similar to Dependency analysis, except that RESULT_DIR should be set and the command is
python3 detector.py astfilter_local -i ../data/npmjs.with_stats.csv
- similar to Dependency analysis, except that RESULT_DIR should be set and the command is
-
TaintLocal analysis
- similar to Dependency analysis, except that RESULT_DIR should be set and the command is
python3 detector.py taint_local
- similar to Dependency analysis, except that RESULT_DIR should be set and the command is
-
Astfilter analysis
- input: dependency graph of packages from package managers
- desc:
- init docker swarm cluster on master node and copy/log the swarm join command
sudo docker swarm init
- master node should have port 5555 open for flower, 8080 open for webserver
- master node should have TCP port 2377, TCP and UDP port 7946, UDP port 4789 open for docker swarm
- join the docker swarm cluster from all other nodes
docker swarm join --token $token $ip:$port
- worker nodes should have TCP port 2377, TCP and UDP port 7946, UDP port 4789 open for docker swarm
- create and customize
maloss/airflow/.env
- AIRFLOW__WEBSERVER__BASE_URL should point to webserver on master node
- customize AIRFLOW_DAGS, METADATA_FOLDER and RESULT_FOLDER
- deploy the analysis by the following command
sudo bash -c "docker stack deploy --with-registry-auth -c <(docker-compose --compatibility -f docker-compose-CeleryExecutor.yml config) rubygems_astfilter"
- init docker swarm cluster on master node and copy/log the swarm join command
-
Compare Ast analysis
- similar to dependency analysis, except that the command to add job is
python3 detector.py compare -i ../data/rubygems.with_stats.popular.csv
- similar to dependency analysis, except that the command to add job is
-
Static analysis
- similar to Astfilter analysis, except the environments and service name
- Add analysis for a new language
- Add a file
$LANG_analyzer.py
insrc/static_proxy
, inherit and implementstatic_base.StaticAnalyzer
- Add analyzer script for this language in
astgen
,astfilter
,taint
,static
jobs
- Add a file
- Add analysis for a package manager
- Add a file
$PACKAGE_MANAGER.py
insrc/pm_proxy
, inherit and implementpm_base.PackageManagerProxy
- Add crawler for this package manager in
crawl
job - Add analyzer script for this package manager in
get_dep
,build_dep
,split_graph
,install
,dynamic
- Steps
- Run package crawler
crawl
to collect all packages of this package manager - Run dependency analysis
get_dep
to get dependency for all pacakges - Build the dep graph for packages using
build_dep
andsplit_graph
- Run the necessary analyses, such as
install
,dynamic
,astfilter
,static
- Run package crawler
- Add a file
- Add analysis for popular packages in a package manager
- Add selection of popular packages in
select_pkg
- Steps
- Build the dep graph for popular packages using
split_graph
with flagseedfile
to separate subgraph from whole dep graph - Run the necessary analyses, such as
install
,dynamic
,astfilter
,static
- Build the dep graph for popular packages using
- Add selection of popular packages in
- Add analysis for versions of popular packages in a package manager
- Add collection of versions in
get_versions
- Steps
- Run
get_versions
to get major versions of popular packages - Run dependency analysis
get_dep
to get dependency for all versions of packages - Build the dep graph for popular packages using
build_dep
andsplit_graph
- Run the necessary analyses, such as
install
,dynamic
,astfilter
,static
,compare
- Run
- Add collection of versions in
- Add metadata analysis for a package manager
- Add author information retrieval in
get_author
- Add author package graph building in
build_author
- Optionally add hash comparison of same packages across different package managers in
compare_hash
- Steps
- Run
edit_dist
to get packages that typosquats popular packages - Run
get_author
to fetch author information - Run
build_author
to get author package relationship - Run
compare_hash
to identify packages with different API usage among different package managers
- Run
- Add author information retrieval in