/random-items

Random items extracted from Wikipedia

OtherNOASSERTION

Random Items: data for randomitems.io

This repository contains random items extracted from Wikipedia. These items can be used to fill-in a mock database or web site, or for testing purpose. These data are used to feed my website radomitems.io.

The data are available as CSV files.

Description

The data are organized into tables that model real world objects, like a user profile (with name, address, e-mail, gender, etc.), a company (name, logo, country, web site url, number of employees, etc.), a battle (name, date, description, etc.), and so on.

Fields contain either short information (name, date, address, author, latitude, longitude,...), url to pictures hosted on Wikimedia servers (photos, maps, flags...), or longer texts like whole Wikipedia pages in plain text (wiki syntax) or html format.

The data are extracted from Wikipedia, so they are (mostly) consistent:

  • Some data are entirely extracted from the Wikipedia data, like companies, countries, cities, lakes, etc. This means that the entire row contains real data: the name, the picture (photo, logo, flag), the location (latitude, longitude), the population, etc.
  • In other cases, some data are consistent and real (the name of a person will match with their gender, country of birth, picture, etc.) while other are faked (e-mails, addresses, phone numbers, etc.).

For some tables, a Wikipedia ID is provided, so you can link to the original Wikipedia page.

List of tables

Available files:

  • addresses.csv: street, city, zip code, state, country, continent...
  • languages.csv: name, ISO code (2 and 3 characters), number of speakers...
  • battles.csv: name, wiki ID, date, outcome, summary (text and html), complete Wikipedia page (text and html)...
  • buildings.csv: name, wiki ID, country, latitude and longitude, photo, description
  • cars.csv: model, wiki ID, date, photo, dimensions, description...
  • cities.csv: name, wiki ID, zip code, country, state/region, number of people, area, latitude and longitude, web site, mayor, description...
  • companies.csv: name, wiki ID, country, number of employees, industry, web site, logo, description...
  • countries.csv: name, wiki ID, capital, latitude and longitude, currency, location map, flag, description
  • currencies.csv: name and code
  • fables.csv: title, wiki ID, author, summary (text and html), complete Wikipedia page (text and html)
  • family_names.csv: name, language (Simpson sounds more English and Dupont more French...), frequency (number of people of that name that have a Wikipedia entry (for English, only a subset of Wikipedia was considered)),
  • given_names.csv: name, gender, language, frequency
  • hobbies.csv: name and type (sport...)
  • lakes.csv: name, wiki ID, latitude and longitude, photo, description, continent...
  • western_languages.csv: name and ISO code
  • occupations.csv: name of the occupation
  • persons.csv: first and last name, wiki ID, gender, picture, email, occupation, birth date, company, address, email, etc.
  • products.csv: title, wiki ID, type (book, album, video game, movie...), date, country, author, language, cover picture, description
  • unique_emails.csv: first and last name, email, language (of the name), gender (of the first name)

More documention

Please refer to the file:

  • data_list_short.md for a list of tables and all their fields,
  • data_list_detailed.md for a list with additional details (number of rows, number of unique values, number of null values, type (string, integer, float), ranges, etc.).

Disclaimer

These data are designed to be used as fill-in data, for mock databases and websites or to test a setup with example data. Even if they are extracted from Wikipedia and so represent some state of the real world, they are not meant as a source of information. Some of these data are plainly incorrect, either because they have been faked (addresses, e-mails, etc.) or because the information found on Wikipedia was not correct or it has been badly retrieved.

Acknowledgement

The data are extracted from Wikipedia via Yago 3 (see also the older site, which is distributed under the terms of the Creative Commons Attribution 3.0 License

Licence

The data from this repository is licenced under the terms of the Creative Commons Attribution 3.0 License. You can share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material for any purpose, even commercially), as long as you give appropriate credit, provide a link to the license, and indicate if changes were made, without adding any other restrictions.

Author

I'm Bruno Oberle. You can contact me via Github or by email via my web site at boberle.com.