dbpedia/GSoC

Fusing the List Extractor and the Table Extractor

Closed this issue · 8 comments

mgns commented

Description

Currently, there are 2 different projects for extracting triples from lists and from tables. Both project's aim is to extract data from wikipedia pages and to create a dictionary for mapping elements found in those pages. The student has to study how these projects work (how they create dictionaries, how they call for services, etc.) and he has to merge them, in order to create a unified extractor. The student has to restructure both the projects such that both projects use a common dictionary, thus making it easier for the existing projects to be integrated into one. The student can also add a GUI so that it becomes easier for users with little/no knowledge about the project can add triples. The GUI should have a tool that can look up for existing classes and properties from the latest DBpedia ontology. Also, implement other facilities for users perspective (like add more comments, demo that shows all steps, etc.). Also, the student should add support for different languages, so that the extractor can extract triples from different editions (languages) of Wikipedia. This should include support for languages that don't support Latin alphabets (like Greek, Hebrew etc.). Multithreading implementation: try to create threads into extractors in order to make them faster.

Goals

There are two main goals to achieve:

  1. Merge two projects in order to get a unique way to analyze wikipedia structures (lists and tables).
  2. Create a GUI interface to help user. Furthermore it will be helpful adding more comments and tips.

Another aspect that could be studied is how to speed up this analysis' process. The entire work can be reorganized in different threads (this is an additional goal, it's not essential).

Impact

DBpedia will have only one program to extract data from Wikipedia article pages.
Furthermore, users will have new facilities, like a GUI or tips on how he could work better with this application.

Warm up tasks

Study parser's code and explain a possible dictionary structure that can be used for both projects.
Mockup of GUI interface that has to organize user's work (e.g. how users add new rules or how he can view statistics of domain analysis).

Mentors

Luca Virgili, Krishanu Konar

Keywords

Python, RDF, Java

Hi,
I am facing an issue when i run list extractor project in my mac. When I ran the command "python listExtractor.py s William_Gibson en" I got the following error

Exception in thread "main" java.lang.NoClassDefFoundError: com/machinelinking/main/JSONpediaException
	at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
	at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3139)
	at java.base/java.lang.Class.getMethodsRecursive(Class.java:3280)
	at java.base/java.lang.Class.getMethod0(Class.java:3266)
	at java.base/java.lang.Class.getMethod(Class.java:2063)
	at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:57)
Caused by: java.lang.ClassNotFoundException: com.machinelinking.main.JSONpediaException
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:563)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:496)
	... 6 more

I think the main problem is in execution of the jsonpedia_wrapper.jar
I am running it on macOS 10.13.2 and java version is 9.0.1
I tried googling the issue but didn't get any satisfactory results (I tried setting classpath to the jar location).
Please help!

mgns commented

Please report this issue directly at https://github.com/dbpedia/list-extractor, as the developers not necessarily see this comment here.

Hey everyone,
I have worked on the warm-up tasks and I want to show them.

  1. The possible dictionary structure can be the following,
   Class : { 
           Headers/Sections : {
                      lang1 : [ ],   #list of headers to explore
                      lang2 : [ ],
                      .. ,
                      ..
              }, 
           Ontology : {
                      lang1 : { " " : OntologyProperty,  #Ontology mappings
                                    " " : OntologyProperty, 
                                    ..
                                  }
                      lang2 : { .. }
           }
   }

Class represents the type/domain of resource
list of headers to explore may not be useful for table-extractor but is necessary for list-extractor.
Both the projects should be restructured to make use of the above possible dictionary.

  1. For the GUI , I have done some little research and found out Django to be helpful, so I learned on the fly and developed a mockup GUI with some functionalities even. This GUI is with respect to the table-extractor project and I also want to implement the same flow for unified extractor.

    Step 1: User needs to enter details of the resource, language, others and should click on the "explore" button.

step_1

Step 2: Then it will show all the headers/sections found in the tables and also show the mappings if present (basically showing the contents of domain_settings.py file), on clicking "edit mappings", user can add or edit the mappings (which is not implemented as of now). It even shows example mappings already present in the dictionary.

step_2

Step 3: Later on clicking "Extract Triples" will generate the corresponding .ttl file

Here is the link of my work https://github.com/sachinmalepati/table-extractor/tree/master/gui/gui_app
Please provide a feedback.

Thanks,
Sachin Malepati.

Continuing the last year's project, we plan on adding the following things to this years project. Following are the initial requirements:

  • Create a UI for the extractor that is intuitive, user friendly and easy to use. (Planning to use Django/Flask).
  • Need a scalable, cross platform application that can use existing codebase to generate more triple for DBpedia's dataset. (Dockerizing the existing application.)
  • Need a solution for integrating existing JSONpedia library with our existing application. (Docker again can help).
  • Generate more datasets for DBpedia from different domains.
  • Bonus: Improve the existing extractors

@lucav48 and me will mentor this project.

Hi,
I am facing an issue when i run list extractor project in my mac. When I ran the command "python listExtractor.py s William_Gibson en" I got the following error

Exception in thread "main" java.lang.NoClassDefFoundError: com/machinelinking/main/JSONpediaException
	at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
	at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3139)
	at java.base/java.lang.Class.getMethodsRecursive(Class.java:3280)
	at java.base/java.lang.Class.getMethod0(Class.java:3266)
	at java.base/java.lang.Class.getMethod(Class.java:2063)
	at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:57)
Caused by: java.lang.ClassNotFoundException: com.machinelinking.main.JSONpediaException
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:563)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:496)
	... 6 more

I think the main problem is in execution of the jsonpedia_wrapper.jar
I am running it on macOS 10.13.2 and java version is 9.0.1
I tried googling the issue but didn't get any satisfactory results (I tried setting classpath to the jar location).
Please help!

Did you find a solution for this? I am facing the same issue. I have a MacBook Air.

This is due to the mismatch in Java version.
This is still an existing issue, which needs to be resolved by isolating the underlying infrastructure. The plan for this years GSoC would be isolating this, probably by containerising it.

In the meantime, you can have another version of Java 8, and run this using that.

No issues.
It would be more useful right now if you look into the actual list and table extractor repos, it might give you a better idea on how the extractor really works. The plan for this project includes a decent code restructuring, and it would be helpful for you to understand the core logic behind the extractors.