Official paper link:
https://link.springer.com/article/10.1007/s10994-017-5640-x
Pre-print from KU Leuven library:
There are three steps to perform learning in tabular data. For a demonstration, have a look at the gif demonstraiting the process (on the fruit.csv) or at the steps separately on the main demo example:
-
Upload the file via web-interface: image
-
Select tables one by one: selection of one table, all tables are selected
-
Push Learn Constraints: output
The same steps for the different csv file, department.csv, demonstrating the SUMIF constraint: step 1, step 2 and step 3.
Let us also present some other constraints that can be learned by the system:
- Fuzzy-lookup as learned from BMI.csv
- Rank and series as learned from score.csv
- Arithmetic expressions as learned from belgium.csv
- Row and column sums as learned from baltimore.csv
- If-aggregates: MINIF, COUNTIF, AVGIF, etc as learned from rides.csv
links.txt contains the list of links to the Excel documents collected from the Internet for experiments
data.txt describes the split of documents into three categories (used in the paper in the experimental section)
csv has the original CSV files obtained by converting Excel into CSV (Removed due to legal reasons of GDPR)
groups has the description of the groups for the corresponding CSV files in the csv folder
truth contains the ground truth for the experiments, i.e., the constraints in the Excel documents
We collected spreadsheets from three main sources.
- After identifying popular Excel functions, the MS Office web page has an overview of popular functions, we searched for online tutorials about these functions and collected them under the category Tutorials. A link to each webpage found and used is provided in links.txt file.
- We have collected the exercises under the category Exercises from the introductory Excel book (MS Excel 2010 by Eddy Van den Broeck and Erik Cuypers) that focused 1) on popular Excel functions and 2) on datatypes supported by our system.
- We have collected under the category Data spreadsheets reporting data. More specifically, economic data, crime reporting data and data from runtime experiments. The spreadsheets on economic data and crime reporting are publicly available and originate from the U.S. Bureau of Economic Analysis (BEA), the RWE annual report 2014 and the U.S. FBI Uniform Crime Reporting (UCR) Program.
The technical description of the dataset is summarized in two tables: occurances of constraints in categories and fraction of spreadsheets that has these constraints.
The number of constraint occurances in spreadsheets
constraint | Data | Tutorials | Exercises |
---|---|---|---|
average (col) | 0 | 2 | 0 |
average (row) | 1 | 0 | 0 |
average-if | 0 | 0 | 1 |
count (col) | 0 | 1 | 0 |
count-if | 0 | 0 | 1 |
difference | 1 | 0 | 2 |
equal | 0 | 10 | 0 |
foreign-product | 0 | 0 | 1 |
fuzzy-lookup | 0 | 1 | 2 |
lookup | 0 | 0 | 1 |
max (col) | 0 | 2 | 0 |
max-if | 0 | 0 | 1 |
min (col) | 0 | 3 | 0 |
min-if | 0 | 0 | 1 |
percentual-diff | 1 | 0 | 5 |
product | 0 | 3 | 2 |
project | 0 | 0 | 1 |
rank | 0 | 0 | 1 |
series | 1 | 0 | 2 |
sum (col) | 0 | 10 | 5 |
sum (row) | 2 | 3 | 2 |
sum-if | 0 | 9 | 2 |
sum-product | 0 | 2 | 0 |
The fraction of spreadsheets containing specified constraints:
constraint | Data | Tutorials | Exercises |
---|---|---|---|
average (col) | 0.00 | 0.10 | 0.00 |
average (row) | 0.25 | 0.00 | 0.00 |
average-if | 0.00 | 0.00 | 0.11 |
count (col) | 0.00 | 0.05 | 0.00 |
count-if | 0.00 | 0.00 | 0.11 |
difference | 0.25 | 0.00 | 0.22 |
equal | 0.00 | 0.24 | 0.00 |
foreign-product | 0.00 | 0.00 | 0.11 |
fuzzy-lookup | 0.00 | 0.05 | 0.22 |
lookup | 0.00 | 0.00 | 0.11 |
max (col) | 0.00 | 0.10 | 0.00 |
max-if | 0.00 | 0.00 | 0.11 |
min (col) | 0.00 | 0.14 | 0.00 |
min-if | 0.00 | 0.00 | 0.11 |
percentual-diff | 0.25 | 0.00 | 0.11 |
product | 0.00 | 0.14 | 0.11 |
project | 0.00 | 0.00 | 0.11 |
rank | 0.00 | 0.00 | 0.11 |
series | 0.25 | 0.00 | 0.22 |
sum (col) | 0.00 | 0.33 | 0.56 |
sum (row) | 0.25 | 0.14 | 0.22 |
sum-if | 0.00 | 0.38 | 0.22 |
sum-product | 0.00 | 0.10 | 0.00 |