This is the deliverable with the necessary material and configuration to generate the complete platform with docker-composer. No external tools are required.
The docker-composer has to be built because I had to extend the airflow base image to include JAVA as it is required by the spark_submit command. Also, there are other images that have been extended and configured, like the HDFS.
docker-compose up --build
Once all the services all up (check with docker-compose ps
) , at least the Hadoop FS ones (name node and data node), you have to run the script at the root of the project :
./upload_person_input.sh
That script cleans the old data and reuploads the file to the HDFS.
To use the graph generator you have to add the input files person_inputs.json
with the data and sdg_template.json
with the metadata inside the input_files
directory.
The graph generator is inside the folder graph_generator.py
and it requires the file dag_template.py
since the common transformation operations (@task Taskflow tasks) are there. The graph generator takes the required fields from the template and, with the content of the dag_template.py
, generates a new file in the folder airflow/dags/
with a dynamic name as provided in the sdg_template.json
.
The generated file, requires in airflow runtime of the spark_template.py
file that is inside airflow/include/
folder.
Airflow autogenerated dag detail 1
Airflow autogenerated dag detail 2
docker rmi -f $(docker images -a -q)
docker container prune -f && docker image prune -f && docker volume prune -f