openml/OpenML

Trouble uploading datasets to the test server

sebffischer opened this issue · 5 comments

I have tried with the API, as well as with through the website.
When trying to upload a dataset to the test server, I encounter the following error:

A PHP Error was encountered

Severity: Warning

Message: simplexml_load_string(): Entity: line 70: parser error : Extra content at the end of the document

Filename: new/post.php

Line Number: 155

Backtrace:

File: /var/www/openml/OpenML/openml_OS/views/pages/frontend/new/post.php
Line: 155
Function: simplexml_load_string

File: /var/www/openml/OpenML/openml_OS/helpers/cms_helper.php
Line: 19
Function: view

File: /var/www/openml/OpenML/openml_OS/controllers/Frontend.php
Line: 89
Function: loadpage

File: /var/www/openml/OpenML/index.php
Line: 334
Function: require_once

This is just a warning. The test server is supposed to show these. The production server doesn't. Did you actually have a problem uploading the dataset?

Thanks for the clarification!
However this does not work (for me):

from openml.datasets import create_dataset
import sklearn
import numpy as np
from sklearn import datasets
import openml

openml.config.apikey = "API_TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"

diabetes = sklearn.datasets.load_diabetes()
name = "Diabetes(scikit-learn)"
X = diabetes.data
y = diabetes.target
attribute_names = diabetes.feature_names
description = diabetes.DESCR


data = np.concatenate((X, y.reshape((-1, 1))), axis=1)
attribute_names = list(attribute_names)
attributes = [(attribute_name, "REAL") for attribute_name in attribute_names] + [
    ("class", "INTEGER")
]
citation = (
    "Bradley Efron, Trevor Hastie, Iain Johnstone and "
    "Robert Tibshirani (2004) (Least Angle Regression) "
    "Annals of Statistics (with discussion), 407-499"
)
paper_url = "https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf"


diabetes_dataset = create_dataset(
    # The name of the dataset (needs to be unique).
    # Must not be longer than 128 characters and only contain
    # a-z, A-Z, 0-9 and the following special characters: _\-\.(),
    name=name,
    # Textual description of the dataset.
    description=description,
    # The person who created the dataset.
    creator="Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani",
    # People who contributed to the current version of the dataset.
    contributor=None,
    # The date the data was originally collected, given by the uploader.
    collection_date="09-01-2012",
    # Language in which the data is represented.
    # Starts with 1 upper case letter, rest lower case, e.g. 'English'.
    language="English",
    # License under which the data is/will be distributed.
    licence="BSD (from scikit-learn)",
    # Name of the target. Can also have multiple values (comma-separated).
    default_target_attribute="class",
    # The attribute that represents the row-id column, if present in the
    # dataset.
    row_id_attribute=None,
    # Attribute or list of attributes that should be excluded in modelling, such as
    # identifiers and indexes. E.g. "feat1" or ["feat1","feat2"]
    ignore_attribute=None,
    # How to cite the paper.
    citation=citation,
    # Attributes of the data
    attributes=attributes,
    data=data,
    # A version label which is provided by the user.
    version_label="test",
    original_data_url="https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html",
    paper_url=paper_url,
)

diabetes_dataset.publish()
print(f"URL for dataset: {diabetes_dataset.openml_url}")

gives me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
    xml_response = xmltodict.parse(response_text)
  File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 70, column 0

Similar for this:

import openml
from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree

openml.config.apikey = "TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"
# NOTE: We are using dataset 68 from the test server: https://test.openml.org/d/68
dataset = openml.datasets.get_dataset(68)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

dataset = openml.datasets.get_dataset(17)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
print(f"Categorical features: {categorical_indicator}")
transformer = compose.ColumnTransformer(
    [("one_hot_encoder", preprocessing.OneHotEncoder(categories="auto"), categorical_indicator)]
)
X = transformer.fit_transform(X)
clf.fit(X, y)


# Get a task
task = openml.tasks.get_task(403)

# Build any classifier or pipeline
clf = tree.DecisionTreeClassifier()

# Run the flow
run = openml.runs.run_model_on_task(clf, task)

print(run)


myrun = run.publish()
# For this tutorial, our configuration publishes to the test server
# as to not pollute the main server.
print(f"Uploaded to {myrun.openml_url}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
    xml_response = xmltodict.parse(response_text)
  File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 61, column 6

Hi Seb,

I recently fixed a number of issues with the test server.
Can you please check if this issue is now resolved?

Thanks!

It seems like listing datasets from the test server does not work. The other things I have not checked yet (e.g. upload) but will report when I did

> list_oml_data(test_server = TRUE)
INFO  [15:21:27.606] Retrieving JSON {url: `https://test.openml.org/api/v1/json/data/list/limit/1000`, authenticated: `TRUE`}
Error in parse_con(txt, bigint_as_char) :
  lexical error: invalid character inside string.
          t":"ARFF",    "md5_checksum":" <div style="border:1px solid
                     (right here) ------^
``