School project from subject Information Retrieval
Planned architecture:
Spravit si zakladny prehlad extraktivnej sumarizacie a porozmyslat ako by sa tam dal zapojit anchor. Pripadne pozriet ako pouzit ine explanatory
how to run pyspark shell with databricks spark-xml lib
pyspark --packages com.databricks:spark-xml_2.12:0.10.0
works when environment variables are set correctly: https://phoenixnap.com/kb/install-spark-on-ubuntu
to submit app to spark with spark-xml run:
spark-submit --packages com.databricks:spark-xml_2.12:0.10.0 example.py
mind the order, first type in packages, then file
Run master process
start-master.sh
Run random worker process
start-slave.sh spark://tomasmizera:7078 -c 4 -m 4G -d raw-data/spark-logs
Submit parser:
spark-submit --packages com.databricks:spark-xml_2.12:0.10.0 sections-parser.py <input-file> <output-folder> --master-addr <master-address> --spark <spark executor uri> --app-name <application name>
Navigate to index/elastickibana and run docker-compose up
if error
Exit 78
occurs, setsudo sysctl -w vm.max_map_count=262144