RedHenLab/NLP

Multiword expressions tagger

Liontooth opened this issue · 3 comments

Python tagger for multiword expression lexicon

Ver descripción en español.

This is a task related to research on language and gesture with the NewsScape Library of International Television News. NewsScape is hosted by the University of California Los Angeles Library and developed by the Red Hen consortium for research on multimodal communication. Besides UCLA, Red Hen has capture nodes and research teams at Case Western Reserve University, University of Illinois at Urbana Champaign, University of Southern Denmark, University of Oxford, University of Osnabrück, Texas Tech, National Institute for Advanced Studies in Bangalore, University of Navarra, University of Murcia, and other places (the consortium is constantly expanding). NewsScape contains more than 200.000 hours of television news in English, Spanish and other European languages, indexed by their subtitles/close captioning (more than 3 billion words). Among other functionalities, NewsScape is the first audiovisual database that allows for synchronized searches of subtitles and images. Its search results take to the exact moment of the show when the words in the subtitles/close captioning were uttered.

Almost all large linguistic corpora to this date are written corpora (Corpus of American English, CREA and CORDE from Spain’s Royal Academy, newspaper archives, etc.). NewsScape opens new horizons for the study of oral communication alongside the great variety of elements that accompany verbal expression: gesture and intonation, along with, in the case of television, music, image and sound effects, graphics, etc. NewsScape also facilitates the study of particular news, topics, statements by individuals or institutions, etc. We are developing automatic and manual search and annotation tools for semantic patterns. Besides verbal patterns, we are also developing tools for face recognition, detection of visual patterns, story segmentation, etc. The research groups at Navarra and Murcia are developing the SCHEMOTIME project, which compares language and gesture in the expression of emotions and time, two central concepts for theories of metaphor and cognition. Besides, the collaboration between Navarra and Murcia leads the development of NewsScape in Spanish.

The present task is to write a program that receives an input text in natural language and tags certain phrases. The phrases to be tagged are multiword expressions of time, such as "the years rolled by".

Python is the probably the right programming language for the libraries available (we recommend mwetoolkit).

Part of the job is already done by a preprocessor that tags Parts-of-Speech (prepositions, verbs, nouns, etc) in the raw text.

For instance, the raw text may be the sentence, "AND SO THE YEARS ROLLED BY."

A tool called MBSP, from the CLiPS research group at the University of Antwerp, tags it like this, using the pipe symbol as field separator:

"and/CC/O/O/and|so/IN/I-ADVP/O/so|the/DT/I-NP/O/the|years/NNS/I-NP/O/year|rolled/VBN/I-VP/O/roll|by/RP/I-PRT/O/by|././O/O/."

You are not expected to understand those annotations yet, just know that they exist and that they are what your program will use.

The multiword expressions are specified through a combination of lists of words and these prepared Parts of Speech tags. The full set of specifications is called a lexicon.

For instance, an expression may have the structure As + UNIT OF TIME + MOTION VERB + PREPOSITION. Some examples: As centuries float slowly by, As the seconds trickled past, As the holidays slowly snuck up on her. The construction is further specified as follows in the lexicon:

  • A list of words indicating units of time, such as afternoon, age, autumn, century, dawn, decade, evening, and November.
  • A list of motion verbs, including fly, shuffle, sneak up, come tumbling down, and roll past.
  • The PREPOSITION will be available in the parts-of-speech tags.

So the lexicon defines the multiword expression, and the program must locate that expression in the source text. Three steps are needed:

  • Identify the lemmatized form of each word (the lemmas are available in the Parts-of-Speech tags)
  • Match the word list in the lexicon against the candidate word in the source text
  • Match the parts of speech tag in the lexicon against the parts of speech specification in the lexicon

The final product is a utility that the user submits a sentence to, and the utility tags the sentence according to the multiword expression lexicon. The utility should support a socket server mode.

The project will be mentored by software developers in the Red Hen Lab, which includes faculty at University of Navarra in Spain and the University of California in Los Angeles.

Sample Lexicon of English Time Expressions

  1. UNIT OF TIME + MANNER-OF-MOTION VERB

    Example sentences:
    -Time flies. -Days shuffle. -Holidays sneak up on.
    -Months come tumbling down. - The years rolled slowly past
    UNITS OF TIME: afternoon, age, autumn, century, dawn, decade, evening, fall, holiday, holidays, hour, night, midday, midnight, millenium, milisecond, minute, moment, month, morning, morrow, noon, period, second, spring, summer, today, tomorrow, tonight, twilight, week, weekday, weekend, winter, yesterday. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. January, February, March, April, May, June, July, August, September, October, November, December. Time
    Also, some nouns and pronouns refering to processes: movie, stay, course, class, lecture, show, concert, exam, party, meeting, match, war, Christmas, summer, season, it (all), race, project, recording, visit. *This list is expandable.

    MANNER-OF-MOTION VERBS: fly, shuffle, sneak up, come tumbling down, roll slowly/quickly past, run, walk, bounce, drift, drop, float, glide, move, roll, slide, swing, revolve, rotate, spin, turn, twirl, twist, whirl, wind, amble, bolt, bounce, charge, coast, crawl, creep, dart, dash, dodder, drift ,flit float, fly, frolic, gallop, glide, hasten, hike, hobble, hop, hurry, inch, jump, leap, lurch, march, meander, mince, parade, perambulate, plod, promenade, prowl, race, ramble, roam, roll, run, rush, saunter, scurry, scutter, scuttle, shamble, shuffle, skedaddle, skip, slide, slink, slither, slog, slouch, sneak, speed, stagger, stray, streak, stroll, strut, stumble, swagger, sweep, swim, tear, tiptoe, toddle, totter, traipse, travel, troop, trot, vault, walk, wander, whiz, zigzag, zoom. * This list is expandable.

  2. As + UNIT OF TIME + go/pass + PREP.

    -As seconds go by. -As minutes pass on. -As days go on.
    -As centuries go by. -As hours pass by. - As years go by.
    This can already be captured, but we want to tag it automatically as a class of multi-word time expressions

  3. As + UNIT OF TIME + go/pass + VPG -As the years go marching on. -As the centuries go passing by. -As the weeks go marching on. - As the days go drifting by.
  4. As + UNIT OF TIME + MOTION VERB (as + some type-1 expressions above)
  5. It/that + (all) + take/last/go (on) + TIME EXPRESSION: for ages/a while/a long time/a short time/no time/a day/a month plus time units from type 1. -It all took ages. -That lasted a while. -It took a long time. -That went on for a while. -It lasted a month. -It took a short time.
  6. Verbs that indicate beginning/end of process in a neutral way: initiate, start, begin, end, finish, complete, open up/close down. OR verbs with a higher emotional value or metaphorical sense: explode, break loose, collapse, die, be born, break up, fade. \* This is an expandable list: we prefer to run a pilot with this reduced number of items first. The war started - The war exploded/arrived/came/burst upon us/erupted/sped up/stopped Swing began in the thirties - Swing was born in the thirties The application period opened up/closed down
  7. This sample lexicon will be expanded, but contains the typical construction types the program needs to handle.

    Web-based frontend

    The files to be annotated can be assumed to be present in a database, let's say mungodb, mysql, or solr.

    The user input consists instead of semantic categories that act as components of multiword expressions.

    Examples of such semantic categories are included in the backend task description at #1

    For instance, they may include the semantic categories "UNIT OF TIME" and "MANNER-OF-MOTION VERB".

    Do we use parameter files for the contents of these categories? If so, how do these parameter files interact with the mwetoolkit?

    If we can use parameter files, can we have a number that is small enough to fit the options into a user interface?

Hi,
I am Shahnawaz Ahmed from BITS Pilani Goa Campus, India. I am interested in the idea "Python tagger for multiword expression lexicon." I have worked on several projects related to text processing and classification which involved data scraping from web pages and structuring them to extract meaningful information. I will mention two relevant projects :
Automated Case - List SMS service :
This involved tagging cases from a list issued daily http://goo.gl/v7CLV4 using regular expressions and classifying them based on lawyer names and then sending a list of court cases for the next day. It required text splicing, catching specific strings sandwiched between keywords and constructing a database of different cases tagged by the lawyer names involving that case.
Data extraction from 99acres.com :
For an analysis regarding the price fluctuations of real estate in a city, I scraped html pages from 99acres.com to construct a data matrix with 5 year data of housing prices. The data was available only as points on a graph and had to be pruned to convert into a numpy array.
http://goo.gl/FPPA2Z
I like to work in python but I am also proficient in C and Java. I am currently trying out the MBSP library as mentioned in the ideas page. I also have experience in machine learning (Neural Networks, SVM) and classification algorithms. This might be helpful if you would like to implement a learning algorithm for improving tagging. I would love to discuss this further.
Email : shd339@gmail.com
irc : sahmed95

Hi @Liontooth

I am Roque López, a master student in Computer Science at São Paulo University-Brazil. I have strong interest in Natural Language Processing, for this reason I am doing my master about Opinion
Summarization at NILC ("Núcleo Interinstitucional de Linguística Computacional", http://www.nilc.icmc.usp.br/nilc/index.php). During the last years, I have worked on three projects in Peru and Brazil, in all of them I used Python.

Of the ideas listed on your GSoC page, I am very interested in this. I already exchanged some emails with professor Steen.

I performed some experiments for this tagger. The implementation of this is available here: https://github.com/rlopezc27/Multiword_Expression_Tagger In the little sample of the corpus I did not found any multiword expressions and I added some of them to verify if the script is correct. I wonder if I'm in the right way? I would appreciate any suggestion of you.

More about me, in my personal page [1] and github account [2].

Best regards,

[1] http://nilc.icmc.usp.br/nilc/pessoas/rlopez/ (in migration of servers) or http://br.linkedin.com/in/roquelopez
[2] https://github.com/rlopezc27

Hey guys I saw this and was confused about what the input will be exactly. We need a pattern to feed the mwetoolkit, so I'm guessing that the pattern (such as: UNIT OF TIME + MANNER-OF-MOTION VERB) will also be needed as input? Also, the "input text" you have specified, will this be a file or will it be a single string? (A single string seems to make little sense to me, but the issue, at one place, assumes the input to be "AND SO THE YEARS ROLLED BY.")