Various scripts for downloading and building datasets.
# Download English Wikimedia dumps
npx -p @tartarus/data td fetch wikimedia --output /tmp/
Requires node
and wget
.
(Tested on MacOS with node==10.14.2
and wget==1.20.3
.)
# Download Project Gutenberg library and catalogue
td fetch gutenberg --output /my/download/path
# Download Wikimedia dumps for English and Spanish language Wikipedia and Wikiquote
td fetch wikimedia --output /my/download/path \
--language en \
--language es \
--site wiki \
--site wikiquote
# Crawl an API endpoint
td spider --output /my/download/path -- site /my/site/config/file.ts
td spider
collects sequential data from APIs. Both iterative counters (e.g. page number) and
extractable URLs ('next page') are supported.
A spider requires a JS/TS configuration file to customize its use.
import { SpiderJsonData, SpiderHttpNavigator, SpiderStore, SpiderHandle } from '@tartarus/data'
export default {
name: 'myexamplesite',
// Determine how the received data will be parsed
// Supported: SpiderJsonData, SpiderYamlData, SpiderXmlData, SpiderCsvData, SpiderHtmlData, SpiderTextData
data: new SpiderJsonData(),
// Determine what to query (required)
navigator: new SpiderHttpNavigator(
{
// Base URL
baseTarget: 'https://api.domain.ext/v1/list',
// Determine how URLs are formed (return null to stop spidering)
target: (h: SpiderHandle): string | null => `${h.getBaseTarget()}&page=${h.getIteration()}`,
// Optional callback to test whether spidering should be ceased
isDone: (h: SpiderHandle): boolean => false
}
),
// Determine how responses are stored
store: new SpiderStore(
{
// Number of sub-directories to be used
subDirectoryDepth: 3,
// Filename to be used; return null to stop spidering
filename: (h: SpiderHandle) => `${h.getIteration()}.json`,
}
),
// Request decorator
request: {
headers: {
'User-Agent': 'Tartarus-Data-Spider/1.0 (me@email.ext)'
},
method: 'get',
responseEncoding: 'utf8'
},
// Spider behavior
behavior: {
delay: 1500, // Delay between requests in milliseconds
retryDelay: 15000, // Delay before retrying failed requests
maxRetries: 15, // Number of times to retry a failed request
}
}
The spider interface exposes information of its current status and the latest downloaded page by passing an instance of
SpiderHandle
class to the callback functions.
Returns an object that describes the data received in response to a successful query.
The data
element contains a parsed (JSON) object of the response.
The raw
element contains a string representation of the response data.
interface SpiderParsedData {
raw: string;
data: any;
}
Returns an object that contains a descriptor of a successful query (rawResponse
) and the raw data received (rawData
).
The contents of the rawResponse
element are dependent on the type of SpiderNavigator
in use -- for SpiderHttpNavigator
it will be an AxiosResponse<any>
; for SpiderFileNavigator
it will be set to null
.
interface SpiderNavigatorFetchResponse {
rawData: string;
rawResponse: any;
}
Returns the value passed in baseTarget
element to the SpiderNavigator
instance. Typically an URL.
Returns the current iteration.
Returns an absolute path to relativeFilename
in the output directory.
Returns the contents of the site configuration as described above.
Returns the Task
instance of the spider.