/Unique-Location-Extractor

Extracts unique locations from text.

Primary LanguageR

Unique Location Extractor

Overview

Text often contains references to the locations of events where we want to extract the location of the event. For example, consider this example tweet that reports a crash in Nairobi, Kenya, where we are interested in extracting the location of the crash:

crash occurred near garden city on thika road on your way towards roysambu.

The tweet contains three location references: (1) garden city, (2) thika road and (3) roysambu, where 'garden city' is the name of multiple locations that are over 20 kilometers apart. Here, we are interested in extracting the location of the garden city location on thika road that represents the crash site.

The Unique Location Extractor (ULEx) geoparses text to extract the unique location of events. The algorithm takes advantage of contextual information contained within text (references to roads or administrate areas, such as neighborhoods) and determines which location references do not reference the event of interest and should be ignored.

This package was originally developed to extract locations of road traffic crashes from reports of crashes via Twitter, specifically in the context of Nairobi, Kenya using the Twitter feed @Ma3Route.

Installation

ULEx is an an R package. Until the package is made available via devtools (coming soon!), the functions can be loaded by running the following script

source("https://raw.githubusercontent.com/ramarty/Unique-Location-Extractor/master/R/load_ulex.R")

Main functions

The package contains two main functions:

  • augment_gazetteer: The backbone of locating events is looking up location references in a gazetteer, or geographic dictionary. The augment_gazetteer facilitates cleaning a gazetteer that may have been constructed from sources such as Open Street Maps, Geonames or Google Maps.

  • locate_event: takes text as input and returns the location of the relevant event. Key inputs include the text to geoparse, a gazetteer of landmarks, spatial files of roads and areas (e.g., neighborhoods) and a list of event words.

Example

#### Packages
library(leaflet) # needed just to display output

#### Load Example Data
landmarks     <- st_read("https://raw.githubusercontent.com/ramarty/Unique-Location-Extractor/master/data/example_landmarks.geojson")
neighborhoods <- st_read("https://raw.githubusercontent.com/ramarty/Unique-Location-Extractor/master/data/example_areas.geojson")
roads         <- st_read("https://raw.githubusercontent.com/ramarty/Unique-Location-Extractor/master/data/example_roads.geojson")

#### Augment Gaztteer
landmarks_aug <- augment_gazetteer(landmarks,
                                   crs_distance = "+init=epsg:21037")

#### Locate Crashes in example tweets
tweets <- c("crash occurred near garden city on thika road on your way towards roysambu",
            "crash at garden city",
            "crash at intersection of juja road and outer ring rd",
            "crash occured near roysambu on thika rd",
            "crash at pangani")

crash_locs <- locate_event(text = tweets,
                           landmark_gazetteer = landmarks_aug,
                           areas = neighborhoods,
                           roads = roads,
                           event_words = c("accident", "crash", "collision", "wreck", "overturn"),
                           crs_distance = "+init=epsg:21037")

#### Display output
leaflet() %>%
  addTiles() %>%
  addCircles(data=crash_locs,
             label = ~text,
             opacity = 1,
             weight=10,
             color = "red")

augment_gazetteer

Description

The augment_gazetteer function adds additional landmarks to account for different ways of saying the same landmark name. For example, raw gazetteers may contain long, formal names, where shorter versions of the name are more often used. In addition, the function facilitates removing landmarks names that are spurious or may confuse the algorithm; these include landmark names that are common words that may be used in different contexts, or frequent and generic landmarks such as hotel. Key components of the function include:

  1. Adding additional landmarks based off of n-grams and skip-grams of landmark names. For example, from the original landmark garden city mall, the following landmarks will be added: garden city, city mall, and garden mall.
  2. Adding landmarks according to a set of rules: for example, if a landmark starts or ends with a certain word, an alternative version of the landmark is added that removes that word. Here, words along categories of landmarks are removed, where a user may not reference the category; for example, a user will more likely say McDonalds than McDonalds restaurant.
  3. Removes landmarks that refer to large geographic areas (e.g., roads). Roads and areas are dealt with separately; this function focuses on cleaning a gazetteer of specific points/landmarks.
  4. Determines whether a landmark should be categorized as specific or general. Specific landmarks are those where the name uniquely identifies a location. General landmarks are those where the names do no uniquely identify a location; however, a general landmark with contextual information such as a road can uniquely determine a location. Note that when multiple landmarks have the same name, but >90% of the landmarks are very closely clustered together, the landmarks in the cluster are designated as specific while the other 10% are designated as general. The locate_event function only considers general landmarks when contextual information (roads or areas) are also referenced in the text.

Parameters

Landmark Gazetteer

Parameters for the raw landmark gazetteer.

  • landmarks: Spatial Points Dataframe (or sf equivalent) of landmarks.
  • landmarks.name_var: Name of variable indicating name of landmark (default: "name")
  • landmarks.type_var: Name of variable indicating type of landmark (default: "type")

Remove Landmark Types

Removing landmarks based on the type of the landmark. 'types_rm' indicates which types of landmarks should be removed, and 'types_rm.except_with_type' and 'types_rm.except_with_name' indicate situations when 'types_rm' should be ignored. Note that a landmark can have more than one type.

  • types_rm: If landmark is one of these types, remove the landmark - unless prevented by 'types_rm.except_with_type' or 'types_rm.except_with_name'. Here, types that do not represent a single location are removed. (default: c("route", "road", "political", "locality", "neighborhood")).
  • types_rm.except_with_type: Landmark type to always keep if includes one of these types; overrides 'types_rm'. Includes types that indicate a specific location, even if another type category suggests it covers a larger area. For example, if a landmark has types: 'route' and 'flyover', we want to keep this landmark as flyovers represent specific locations, not longer roads. (default: c("flyover"))
  • types_rm.except_with_name: Landmark type to always keep if the landmark name includes one of these words; overrides 'types_rm'. Includes names that indicate a specific location, even if another type category suggests it covers a larger area. For example, if a landmark has type: 'route' and includes 'flyover' in name, we want to keep this landmark as flyovers represent specific locations, not longer roads. (default: c("flyover"))

N/Skip-Grams

Parameters that determine how N and Skip-Grams should be generated and when they should be added to the gazetteer.

  • grams.min_words: Minimum number of words in name to make n/skip-grams out of name (default: 2)
  • grams.max_words: Maximum number of words in name to make n/skip-grams out of name. Setting a cap helps to reduce spurious landmarks that may come out of really long names. (default: 6)
  • grams.skip_gram_first_last_word_match: For skip-grams, should first and last word be the same as the original word? (default: TRUE)
  • grams.add_only_if_name_new: Only add N/skip-grams if these names do not already exist in the gazetteer (default: FALSE)
  • grams.add_only_if_specific: Only add N/skip-grams if the name represents a specific location (ie, not a 'general' landmark with multiple, far away locations) (default: FALSE)

Parallel Landmarks

Changes the name of a landmark and adds the landmark as a new landmark to the gazetteer. Parameters indicate when and how to change names, and when parallel landmarks should be added to the gazetteer.

  • parallel.sep_slash: If a landmark has a slash, separate the landmark at the slash and add the components as new landmarks. (For example, landmark "a / b / c" will generate three new landmarks: "a", "b" and "c"). (default: TRUE)
  • parallel.rm_begin: If a landmark name begins with one of these words, add a landmark that excludes the word. (default: tm::stopwords("en"))
  • parallel.rm_end: If a landmark name ends with one of these words, add a landmark that excludes the word. (default: c("bar", "shops", "restaurant","sports bar","hotel", "bus station"))
  • parallel.word_diff: Generates parallel landmarks by swapping words with another word in a list. For example, "center" is replaced with "centre". OPTIONS: "none", "default" (accounts for some differences in british and american spelling), list of vectors (e.g., list(c("center", "centre"), c("theater", "theatre"))). (default: "default")
  • parallel.rm_begin_iftype: If a landmark name begins with one of these words, add a landmark that excludes the word if the landmark is a certain type. Input is a list of lists, where each sublist contains a vector of words and a vector of types (e.g., list(list(words = c("a", "b"), type = "t"))) (default: NULL)
  • parallel.rm_end_iftype: If a landmark name ends with one of these words, add a landmark that excludes the word if the landmark is a certain type. Input is a list of lists, where each sublist contains a vector of words and a vector of types (e.g., list(list(words = c("a", "b"), type = "t"))). (default: list(list(words = c("stage", "bus stop", "bus station"), type = "transit_station")).
  • parallel.word_diff_iftype: If the landmark includes one of these words, add a landmarks that swaps the word for the other words. Only do if the landmark is a certain type. (default: list(list(words = c("stage", "bus stop", "bus station"), type = "transit_station")
  • parallel.add_only_if_name_new: Only add parallel landmarks if the name doesn't already exist in the gazetteer (default: TRUE)
  • parallel.add_only_if_specific: Only add parallel landmarks if the landmark name represents a specific location (ie, not a 'general' landmark with multiple, far away locations) (default: FALSE)

Add Parallel Landmarks: Same name, but add type

Add a parallel landmark that includes an additional type

  • parallel_type.word_begin_addtype: If the landmark begins with one of these words, add the type. (default: NULL)
  • parallel_type.word_end_addtype: If the landmark ends with one of these words, add the type. For example, if landmark is "X stage", this indicates the landmark is a bus stage. Adding the "stage" to landmark ensures that the type is reflected. (default: list(list(words = c("stage", "bus stop", "bus station"), type = "stage")))

Remove Landmarks

After N/Skip-grams and parallel landmarks are added, parameters to decide which landmarks to remove based on the name

  • rm.contains: Remove the landmark if it contains one of these words. Implemented after N/skip-grams and parallel landmarks are added. (default: c("road", "rd"))
  • rm.name_begin: Remove the landmark if it begins with one of these words. Implemented after N/skip-grams and parallel landmarks are added. (default: c(stopwords("en"), c("near","at","the", "towards", "near")))
  • rm.name_end: Remove the landmark if it ends with one of these words. Implemented after N/skip-grams and parallel landmarks are added. (default: c("highway", "road", "rd", "way", "ave", "avenue", "street", "st"))

Other

  • close_dist_thresh: The distance to consider landmarks close together; relevant when generating 'specific' and 'general' landmarks. Distance is in spatial units of 'crs_distance'; if projected, then meters. (default: 500)
  • crs_distance: Coordinate reference system to use for distance calculations.
  • crs_out: Coordinate reference system for output. (default: "+init=epsg:4326")
  • quiet: Show algorithm progress (default: FALSE)

locate_event

Description

The locate_event function extracts landmarks from text and determines the unique location of events from the text.

To extract location references from text, the function implements the following steps. Some parts of each step will extract the same landmark so to some extent are redundant; however, they all in some circumstances uniquely add landmarks.

  1. Determines whether any text matches names in the gazetteer. Both exact and 'fuzzy' matches (allowing a certain levenstein distance) are used.
  2. Relying on words after prepositions to find locations. The algorithm starts with a word after a preposition and extracts all landmarks that contain that word. Then, the algorithm takes the next word in the text and further subsets the landmarks. This process is repeated until adding a word removes all landmarks. If a road or area (eg, neighborhood) is found in the previous step, only landmarks near that road or neighborhood are considered. Landmarks with the shortest number of words are kept (i.e., if this process finds 5 landmarks with 2 words and 7 landmarks with 3 words, only the 5 landmarks with 2 words are kept).
  3. If a road or area is mentioned and a landmark is not near that road or landmark, longer versions of the landmark that are near the road or area are searched for. For example, if a user says crash near garden on thika road, the algorithm may extract multiple landmarks with the name garden, none of which are near thika road. It will then search for all landmarks that contain garden in them (e.g., garden city mall) that are near thika road.
  4. If two roads are mentioned, the algorithm extracts the intersection of the roads.

After extracting landmarks, the algorithm chooses the correct landmark using a series of steps. These steps consider a defined list of event words (eg, for road traffic crashes, these could include 'crash', 'accident', 'overturn', etc), whether the user mentions a junction word (e.g., 'junction' or 'intersection') and a list of prepositions. Certain prepositions are given precedent over others to distinguish between locations indicating the location of an event versus locations further away that provide additional context; for example, at takes higher precedence that towards. The following main series of steps are used in the following order

  1. Locations that follow the pattern [even word] [preposition] [location] are extracted.
  2. Locations that follow the pattern [preposition] [location] are extracted. If multiple occurrences, the location near the higher order preposition is used. If a tie, the location closest to the event word is used. TODO: parameterize which should be prioritized: (1) location to event word or (2) preposition priority. Which one should we default and which should be tie-breaker? Not obvious, for example: accident towards thika mall at garden city.
  3. If a junction word is used, two roads are mentioned, and the two roads intersect once, the intersection point is used.
  4. The location closest to the event word is used.
  5. If the location name has multiple locations, we (1) restrict to locations near any mentioned road or area, (2) check for a dominant cluster of locations and (3) prioritize certain landmark types over others (e.g., a user is more likely to reference a large, well known location type like a stadium).
  6. If a landmark is not found, but a road or area are found, the road or area are returned. If a road and area are mentioned, the intersection of the road and area is returned.

Parameters

  • landmark_gazetteer: SpatialPointsDataframe or SpatialFeatures object with points.
  • landmark_gazetteer.name_var: Name of variable indicating name of landmark
  • landmark_gazetteer.type_var: Name of variable indicating type of landmark
  • landmark_gazetteer.gs_var: Name of variable indicating whether landmark is general or specific
  • roads: SpatialLinesDataframe or SpatialFeatures object with lines.
  • roads.name_var: Name of variable indicating name of road
  • areas: SpatialPolygonDataframe or SpatialFeatures object with polygons. Represents administrative areas.
  • areas.name_var: Name of variable indicating name of area.
  • prepositions_list: List of vectors of prepositions. Order of list determines order or prepsoition precedence.
  • event_words: Vector of event words.
  • junction_words: Vector of junction words.
  • false_positive_phrases: Common words found in text that include spurious location references (eg, githurai bus is the name of a bus) that includes the location githurai. This is common enough that we should look for and remove.
  • type_list: List of vectors of types. Order of list determines order or type precedence.
  • clost_dist_thresh: Distance (meters) as to what is considered "close" (eg, is the landmark "close" to a road?)
  • fuzzy_match: Whether to implement fuzzy matching of landmarks using levenstein distance.
  • TODO: Combine below two into one... eg, a list?
  • fuzzy_match.min_word_length: Minimum word length to use fuzzy/levenstein distance for matching.
  • fuzzy_match.dist: Allowable levenstein distances. Vector length must be same as above vector.
  • fuzzy_match.ngram_max: The number of n-grams that should be extracted from text to calculate a levensteing distance against landmarks. For example, if the text is composed of 5 words: w1 w2 w3 w4 w5 and fuzzy_match.ngram_max=3, the function extracts [w1 w2 w3] and compares the levenstein distance to all landmarks. Then in checks [w2 w3 w4], etc.
  • fuzzy_match.first_letters_same: When implementing a fuzzy match, should the first letter of the original and found word be the same?
  • fuzzy_match.last_letters_same: When implementing a fuzzy match, should the last letter of the original and found word be the same?
  • crs_distance: Coordinate reference system to calculate distances. Should be projected.
  • crs_out: Coordinate reference system for output.
  • quiet: If TRUE, lets user know how far along the algorithm is.