Skip to content

1. Installation and Quick Start

cecrob edited this page Dec 1, 2022 · 17 revisions

Installation

Saffron requires Java jdk (up to Java 11) and Apache Maven to run

Both need to be installed before trying to run Saffron:

Optionally, MongoDB can also be used to store the data. Without MongoDB, it will generate the results in JSON files. If used, MongoDB must be running when Saffron is run

  • Install the last version of MongoDb (use the default settings)
  • Open a terminal, and run MongoDB: sudo systemctl status mongod
  1. Clone the latest Saffron repository using the command line: git clone https://github.com/insight-centre/saffron.git

Or download the saffron-master.zip and unzip it in the folder of your choice

  1. Open a terminal. To build the dependencies Saffron requires, use the following command:
mvn clean install

Running

Note1: Saffron use deep learning models for some of its modules, and these files can be quite big. You will need about 3 GB of free hard disk memory to have Saffron fully installed with its models.

Note2: Running the pipeline the first time will download all the models needed by Saffron to work, so the first time it will take longer

Using the Command Line

All steps of Saffron can be executed by running the saffron.sh script from your SAFFRON_HOME folder.

  1. The corpus, which may be

    1. A folder containing files in TXT, DOC or PDF
    2. A zip, tar.gz or .tgz file containing files in TXT, DOC or PDF
    3. A Json metadata file describing the corpus (see Saffron Formats for more details on the format of the file)
    4. A Url (to crawl the corpus from)
  2. The output folder to which the results are written

  3. The configuration file (as described in Saffron Formats).

    $ cd SAFFRON_HOME

    $ ./saffron.sh PATH_TO_CORPUS ~/PATH_TO_EXPERIMENT_OUTPUT/ ~/PATH_TO_CONFIG_FILE/config.json`

In addition, some optional arguments can be specified:

-c <RunConfiguration$CorpusMethod>: The type of corpus to be used. One of CRAWL, JSON, ZIP (for the corpus as a zip, tar.gz or .tgz file containing files in TXT, DOC or PDF ). Default to JSON

-i <File> : The inclusion list of terms and relations (in JSON)
-k <RunConfiguration$KGMethod> : The method for knowledge graph construction: ie. whether to generate a taxonomy or a knowledge graph. Choose between TAXO and KG. Default to KG
--domain : Limit the crawl to the domain of the seed URL (if using the CRAWL option for the corpus)

--max-pages <Integer> : The maximum number of pages to extract when crawling (if using the CRAWL option for the corpus)
--name <String> : The name of the run

For example, try this test command:

./saffron.sh ./examples/presidential_speech_dataset/corpus_with_authors.json ./web/data/output_KG ./examples/config.json -k TAXO

and verify that you obtain the output JSON files in the ./web/data/output_KG folder

More detail on Saffron, ie. how to install it, how to configure the different features, and the approaches it is based on can be found in the Wiki (https://github.com/insight-centre/saffron/wiki)

Note: After the last step, you may see the following text in the logs. This can be ignored and is not impacting the analysis.

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Using the Web Interface (only until the taxonomy step)

  1. (optional) If you choose to use Mongo, install MongoDb (use the default settings)

    And start a session by typing 'mongod' on a terminal. MongoDB has to be running.

    The file ./saffron-web.sh contains some information, such as the name given to the database, the host and port it will run on. If using Mongo, you need to change the database name (default to saffron_test) edit the file saffron-web.sh and change the line: export MONGO_DB_NAME=saffron_test

    To change the Mongo HOST and PORT, edit the same file on the following:

     export MONGO_URL=localhost
     export MONGO_PORT=27017
    
  2. To start the Saffron Web server, simply choose a directory for Saffron to create the models and run the command as follows

    ./saffron-web.sh

  3. Then open the following url in a browser to access the Web Interface. All results from this pipeline (output JSON files) will be generated in the ./web/data/ folder.

    http://localhost:8080/

See the 2.3.-Using-the-Web-Interface for more details on how to use the Web Interface

Results

If the Web Interface is used and STORE_LOCAL_COPY was set to true, or Saffron was used with the command line, the following files are generated and stored in /web/data/. (see Saffron Formats for more details on each file and the different scores)

  • terms.json: The terms extracted
  • doc-terms.json: The document-term map (terms associated with each document)
  • author-terms.json: The connection between authors and terms
  • author-sim.json: The connection between each pair of authors
  • term-sim.json: The connection between each pair of terms
  • taxonomy.json: The final taxonomy extracted from the corpus (if the taxonomy extraction line is uncommented in the saffron.sh)
  • kg.json : The knowledge graph extracted from the corpus, as a JSON file
  • kg.rdf : The knowledge graph extracted from the corpus, as an RDF-XML file

To create a .dot file for the generated taxonomy, you can use the following command:

python taxonomy-to-dot.py taxonomy.json > taxonomy.dot

Upgrading from version 3.3 to 3.4

If you have results from using Saffron version 3.3, you will need to do the following to make it compatible with the version 3.4

Before starting Saffron, edit the following file:

upgrade3.3To3.4.sh

and change the following configurations to reflect the database you want to upgrade:

export MONGO_URL=localhost
export MONGO_PORT=27017
export MONGO_DB_NAME=saffron_test

Run the script by executing:

./upgrade3.3To3.4.sh