Automatic validation and extraction of data from publications in Chemical and Materials Sciences

Workshop at the Department of Chemistry, University of Cambridge

Register [here] (http://www.eventbrite.co.uk/e/contentmine-chemistry-hack-tickets-18534620549) (registration is FREE, places limited to 25 )

==============

Location: U202, Department of Chemistry, Lensfield Road CB2 1EW

Dates: 18-19 September 2015

Please bring laptops, and [pre-load software] (https://github.com/ContentMine/vms/blob/master/installation_intructions.md).

18 September 2015	19 September 2015
Training Workshop & Publisher Panel Session	Hackday
9:00 - 18:00	10:00 - 17:00

[@chemcambridge] (https://twitter.com/chemcambridge)

Trainers:

Peter Murray-Rust @petermurrayrust
Judith Rommel [@jbr_science] (https://twitter.com/jbr_science)
Jenny Molloy @jenny_molloy
The ContentMine Team [@TheContentMine] (https://twitter.com/TheContentMine)

Please read the [Pre-workshop Installation Instructions] (https://github.com/ContentMine/vms/blob/master/installation_intructions.md)

We would also appreciate your feedback

Workshop Purpose

Ever found that the key data you want is published in a text-based PDF journal?

...found yourself manually downloading 100 papers click-by-click?
...redrawing structures/spectra/graphs so you can recompute/analyze them?
...retyping data from tables?
...wishing that a computer can do the really boring discovery and retrieval of the data in the literature?

We all have. But new approaches are solving it. That's why Content-Mining (aka text-and-data mining, TDM) is one of the most exciting areas in scientific data. It's even been intensively debated in the European Parliament and Commission. And the UK is leading the way with new exemptions from copyright so that Universities like Cambridge are the ideal places to learn and develop the new techniques.

The workshop will bring together:

scientists with a need to discover data, especially in chemistry, materials, molecular bioscience - both experimental and computational
scientific publishers
library staff
technology developers.

We'll show how Open software can be used to

crawl the literature effectively using search APIs
scrape all the content from publisher web pages (supplemental data, structures)
normalize PDFs into semantic HTML
run search plugins to discover particular.

The first day will include overviews, installation of technology [1], and a panel of experts from the participants on policy and practice and a hands-on introduction. The second day will be a project-based hack where small groups will tackle their own communal problems. The event is sponsored by the EPSRC-IAA Knowledge Transfer Fund of the Chemistry Department. Facilitators are from Chemistry and Plant Sciences. Coffee, lunches and a Friday dinner are provided.

[1] all essential technology is Open and from contentmine.org, an Open project funded by the Shuttleworth Foundation.

Training Workshop and Publisher Panel Session Agenda

Times	Session
9:00	Introductions
9:15	What is content mining? Overview presentation from ContentMine staff
9:30	Think like a content miner Hands-on activity facilitated by ContentMine staff introducing entity extraction techniques, precision and recall.
	Scraping and the anatomy of scrapers Hands-on activity facilitated by ContentMine staff including use of quickscrape and custom scraper development.
11:00	Preparations for panel discussion with publishers
12:30	Lunch
13:30	Publishers Q&A
15:30	Tea time
16:00	Entity recognition using AMI Hands-on activity facilitated by ContentMine staff including extracting species names from OA papers using AMI-species.
18:00 onwards	Informal social event (dinner) Move as a group to nearby pub or late opening cafe (to discuss hackday projects).
Reservation to be confirmed at Browns from 18:00 onwards.

Workshop Hackday Agenda

Times	Session
10:00	Hacking in teams working on AMICHEM, Chemical tagger,...
12:30	Lunch
13:30	Hacking in teams working on AMICHEM, Chemical tagger,...
15:30	Coffee Break
16:00	Presentation of hackday projects Presentations delivered by participants, including future scope for development of their projects.
16:30	Panel discussion on accelerating uptake of content mining. Panel and Q&A with audience including workshop participants.
17:00	Event close

Intended Audience

This two day event is intended for researchers or research-related staff who are not currently heavily involved in text and data mining but have at least some pre-existing computational skills. At minimum we expect familiarity with a command line interface and basic coding abilities in some language.

ContentMine/CambridgeChemistryWorkshopSep2015