BDA: A Java repository from stefanhagiu

#Final project for Big Data Analytics 2017

P8 Publication tone (author confidence) detection. By tone we want to see if research results are described by the authors as outstanding, average, or “good-enough”.

Application:
Infer publication contribution based on confidence-level for the results described in the paper. This is useful information when correlating scientific impact (e.g. via bibliometric metrics such as times cited) and author self-confidence.

What is inside:

Some files are archived to conserve space - unzip them before running the project!

To run the project first we have to build the jar. For that run maven package, this will create 2 jar files in target file.
Pig file that process the initial xml files and retrives from them title of publication, the content of the publication etc. To run use
pig -x mapreduce [path to] extractData.pig
```
To see the output open src/main/resources/pmc_oa_processed_pig_example
```
The map reduce job in src/main/java :
- Main class: AnlSemanticText, start the mapreduce task.
- Mapper class: SemanticMapper, will read the files produced by pig and read 1 line at a time clear the input of non-words and make it lower case. Will output a key (holds the name of the publication, the owner etc.) and a value (holds the sentiment score).
- Reducer class: AnlSemanticText, simple reducer just takes the value and push it to the output.
- SentiWordNet holds a simple implementation of the SentiWordNet library, for each word it will check the score for each part of speach for that word. If that word is not in the dictionary then a 0 (has no influence on the word score) will be return
- FileProcessor will take the output of map-reduce job and merge the data into JSON files. For that download the map-reduce results file from hadoop and add the path of the files in FINAL_FINAL file Test.java line 18.
```
    To run the map-reduce job use hadoop jar MapReduce-1.0-SNAPSHOT.jar AnlSemanticText [pig_processed_files] [output file].
    To see the full output of the map-reduce job look at src/main/resources/pmc_oa_processed_mapreduce.
```
Data visualization can be seen by opening: d3-visualization.html. To change the data input, in the file line 104 change the path to a new file.
Prezentarea poate fii gasit la: https://slides.com/stefanhagiu/hadoop

# BDA

stefanhagiu/BDA

What is inside:

Some files are archived to conserve space - unzip them before running the project!