This package implements a variety of data structures, including
- Deque (based on block-list)
- Stack
- Queue
- Accumulators and Counters
- Disjoint Sets
- Binary Heap
- Mutable Binary Heap
- Ordered Dicts and Sets
- Dictionaries with Defaults
- Trie
- Linked List
- Sorted Dict, Sorted Multi-Dict and Sorted Set
- DataStructures.IntSet
The Deque
type implements a double-ended queue using a list of blocks. This data structure supports constant-time insertion/removal of elements at both ends of a sequence.
Usage:
a = Deque{Int}() isempty(a) # test whether the dequeue is empty length(a) # get the number of elements push!(a, 10) # add an element to the back pop!(a) # remove an element from the back unshift!(a, 20) # add an element to the front shift!(a) # remove an element from the front front(a) # get the element at the front back(a) # get the element at the back
Note: Julia's Vector
type also provides this interface, and thus can be used as a deque. However, the Deque
type in this package is implemented as a list of contiguous blocks (default size = 2K). As a deque grows, new blocks may be created and linked to existing blocks. This way avoids the copying when growing a vector.
Benchmark shows that the performance of Deque
is comparable to Vector
on push!
, and is noticeably faster on unshift!
(by about 30% to 40%).
The Stack
and Queue
types are a light-weight wrapper of a deque type, which respectively provide interfaces for FILO and FIFO access.
Usage of Stack:
s = Stack(Int) push!(s, x) x = top(s) x = pop!(s)
Usage of Queue:
q = Queue(Int) enqueue!(q, x) x = front(q) x = back(q) x = dequeue!(q)
A accumulator, as defined below, is a data structure that maintains an accumulated number for each key. This is a counter when the accumulated values reflect the counts:
type Accumulator{K, V<:Number} map::Dict{K, V} end
There are different ways to construct an accumulator/counter:
a = accumulator(K, V) # construct an accumulator with key-type K and # accumulated value type V a = accumulator(dict) # construct an accumulator from a dictionary a = counter(K) # construct a counter, i.e. an accumulator with # key type K and value type Int a = counter(dict) # construct a counter from a dictionary a = counter(seq) # construct a counter by counting keys in a sequence
Usage of an accumulator/counter:
# let a and a2 be accumulators/counters a[x] # get the current value/count for x. # if x was not added to a, it returns zero(V) push!(a, x) # add the value/count for x by 1 push!(a, x, v) # add the value/count for x by v push!(a, a2) # add all counts from a2 to a1 pop!(a, x) # remove a key x from a, and returns its current value merge(a, a2) # return a new accumulator/counter that combines the # values/counts in both a and a2
Some algorithms, such as finding connected components in undirected graph and Kruskal's method of finding minimum spanning tree, require a data structure that can efficiently represent a collection of disjoint subsets. A widely used data structure for this purpose is the Disjoint set forest.
Usage:
a = IntDisjointSets(10) # creates a forest comprised of 10 singletons union!(a, 3, 5) # merges the sets that contain 3 and 5 into one in_same_set(a, x, y) # determines whether x and y are in the same set elem = push!(a) # adds a single element in a new set; returns the new element # (this operation is often called MakeSet)
One may also use other element types:
a = DisjointSets{AbstractString}(["a", "b", "c", "d"]) union!(a, "a", "b") in_same_set(a, "c", "d") push!(a, "f")
Note that the internal implementation of IntDisjointSets
is based on vectors, and is very efficient. DisjointSets{T}
is a wrapper of IntDisjointSets
, which uses a dictionary to map input elements to an internal index.
Heaps are data structures that efficiently maintain the minimum (or maximum) for a set of data that may dynamically change.
All heaps in this package are derived from AbstractHeap
, and provide the following interface:
# Let h be a heap, i be a handle, and v be a value. length(h) # returns the number of elements isempty(h) # returns whether the heap is empty push!(h, v) # add a value to the heap top(h) # return the top value of a heap pop!(h) # removes the top value, and returns it
Mutable heaps (values can be changed after being pushed to a heap) are derived from
AbstractMutableHeap <: AbstractHeap
, and additionally provides the following interface:
i = push!(h, v) # adds a value to the heap and and returns a handle to v update!(h, i, v) # updates the value of an element (referred to by the handle i)
Currently, both min/max versions of binary heap (type BinaryHeap
) and mutable binary heap (type MutableBinaryHeap
) have been implemented.
Examples of constructing a heap:
h = binary_minheap(Int) h = binary_maxheap(Int) # create an empty min/max binary heap of integers h = binary_minheap([1,4,3,2]) h = binary_maxheap([1,4,3,2]) # create a min/max heap from a vector h = mutable_binary_minheap(Int) h = mutable_binary_maxheap(Int) # create an empty mutable min/max heap h = mutable_binary_minheap([1,4,3,2]) h = mutable_binary_maxheap([1,4,3,2]) # create a mutable min/max heap from a vector
Heaps can be used to extract the largest or smallest elements of an array without sorting the entire array first:
nlargest(3, [0,21,-12,68,-25,14]) # => [68,21,14] nsmallest(3, [0,21,-12,68,-25,14]) # => [-25,-12,0]
nlargest(n, a)
is equivalent to sort(a, lt = >)[1:min(n, end)]
, and
nsmallest(n, a)
is equivalent to sort(a, lt = <)[1:min(n, end)]
.
OrderedDicts
are simply dictionaries whose entries have a
particular order. For OrderedDicts
(and OrderedSets
), order
refers to insertion order, which allows deterministic iteration over
the dictionary or set:
d = OrderedDict(Char,Int) for c in 'a':'e' d[c] = c-'a'+1 end collect(d) # => [('a',1),('b',2),('c',3),('d',4),('e',5)] s = OrderedSet(π,e,γ,catalan,φ) collect(s) # => [π = 3.1415926535897..., # e = 2.7182818284590..., # γ = 0.5772156649015..., # catalan = 0.9159655941772..., # φ = 1.6180339887498...]
All standard Associative
and Dict
functions are available for
OrderedDicts
, and all Set
operations are available for
OrderedSets.
Note that to create an OrderedSet of a particular type, you must specify the type in curly-braces:
# create an OrderedSet of Strings strs = OrderedSet{AbstractString}()
A DefaultDict allows specification of a default value to return when a requested key is not in a dictionary.
While the implementation is slightly different, a DefaultDict
can be thought to provide a normal Dict
with a default value. A DefaultOrderedDict
does the same for an OrderedDict
.
Constructors:
DefaultDict(default, kv) # create a DefaultDict with a default value or function, # optionally wrapping an existing dictionary # or array of key-value pairs DefaultDict(KeyType, ValueType, default) # create a DefaultDict with Dict type (KeyType,ValueType) DefaultOrderedDict(default, kv) # create a DefaultOrderedDict with a default value or function, # optionally wrapping an existing dictionary # or array of key-value pairs DefaultOrderedDict(KeyType, ValueType, default) # create a DefaultOrderedDict with Dict type (KeyType,ValueType)
Examples using DefaultDict
:
dd = DefaultDict(1) # create an (Any=>Any) DefaultDict with a default value of 1 dd = DefaultDict(AbstractString, Int, 0) # create a (AbstractString=>Int) DefaultDict with a default value of 0 d = ['a'=>1, 'b'=>2] dd = DefaultDict(0, d) # provide a default value to an existing dictionary dd['c'] == 0 # true #d['c'] == 0 # false dd = DefaultOrderedDict(time) # call time() to provide the default value for an OrderedDict dd = DefaultDict(Dict) # Create a dictionary of dictionaries # Dict() is called to provide the default value dd = DefaultDict(()->myfunc()) # call function myfunc to provide the default value # These all create the same default dict dd = DefaultDict(AbstractString, Vector{Int}, () -> Vector{Int}()) dd = DefaultDict(AbstractString, Vector{Int}, () -> Int[]) # dd = DefaultDict(AbstractString, Vector{Int}, # **Note! Julia v0.4 and later only! # Vector{Int}) # the second Vector{Int} is called as a function push!(dd["A"], 1) push!(dd["B"], 2) julia> dd DefaultDict{AbstractString,Array{Int64,1},Function} with 2 entries: "B" => [2] "A" => [1] # create a Dictionary of type AbstractString=>DefaultDict{AbstractString, Int}, where the default of the # inner set of DefaultDicts is zero dd = DefaultDict(AbstractString, DefaultDict, () -> DefaultDict(AbstractString,Int,0))
Note that in the last example, we need to use a function to create each new DefaultDict
.
If we forget, we will end up using the same DefaultDict
for all default values:
julia> dd = DefaultDict(AbstractString, DefaultDict, DefaultDict(AbstractString,Int,0)); julia> dd["a"] DefaultDict{AbstractString,Int64,Int64,Dict{K,V}}() julia> dd["b"]["a"] = 1 1 julia> dd["a"] ["a"=>1]
An implementation of the Trie data structure. This is an associative structure, with AbstractString keys:
t=Trie{Int}() t["Rob"]=42 t["Roger"]=24 haskey(t,"Rob") #true get(t,"Rob",nothing) #42 keys(t) # "Rob", "Roger"
Constructors:
Trie(keys, values) # construct a Trie with the given keys and values Trie(keys) # construct a Trie{Void} with the given keys and with values = nothing Trie(kvs::AbstractVector{(K, V)}) # construct a Trie from the given vector of (key, value) pairs Trie(kvs::Associative{K, V}) # construct a Trie from the given associative structure
This package also provides an iterator path(t::Trie, str)
for looping over all the nodes
encountered in searching for the given string str
.
This obviates much of the boilerplate code needed in writing many trie algorithms.
For example, to test whether a trie contains any prefix of a given string,
use:
seen_prefix(t::Trie, str) = any(v -> v.is_key, path(t, str))
A list of sequentially linked nodes. This allows efficient insertion of nodes to the front of the list:
julia> l1 = nil() nil() julia> l2 = cons(1, l1) list(1) julia> l3 = list(2, 3) list(2, 3) julia> l4 = cat(l1, l2, l3) list(1, 2, 3) julia> l5 = map((x) -> x*2, l4) list(2, 4, 6) julia> for i in l5; print(i); end 246
DataStructures.IntSet
is a drop-in replacement for the Base IntSet
type. It
efficiently stores dense collections of small non-negative Int
s as a sorted
set. The constructor IntSet([itr])
constructs a sorted set of the integers
generated by the given iterable object, or an empty set if no argument is
given. If the set will be sparse (for example holding a few very large
integers), use Set
or SortedSet
instead.
A complement IntSet may be constructed with complement
or complement!
. The
complement of an empty IntSet
contains typemax(Int)
elements from 0 to
typemax(Int)-1
.
Three sorted containers are provided:
SortedDict, SortedMultiDict and SortedSet.
SortedDict is similar to the built-in Julia type Dict
with the additional feature that the keys are stored in
sorted order and can be efficiently iterated in this order.
SortedDict is a subtype of Associative. It is slower than Dict
because looking up a key requires an O(log n) tree search rather than
an expected O(1) hash-table lookup time as with Dict.
SortedDict is
a parametrized type with three parameters, the key type K
, the
value type V
, and the ordering type O
.
SortedSet has
only keys; it is an alternative to the built-in
Set
container. Internally,
SortedSet is implemented as a SortedDict in which the value type
is Void
.
Finally, SortedMultiDict is similar to SortedDict except that each key
can be associated with multiple values. The key=>value pairs in
a SortedMultiDict are stored according to the sorted order for keys,
and key=>value pairs with the same
key are stored in order of insertion.
The containers internally use a 2-3 tree, which is a kind of balanced tree and is described in many elementary data structure textbooks.
The containers require two functions to compare keys: a less-than and
equals function. With the
default ordering argument, the comparison
functions are isless(key1,key2)
(true when key1 < key2
)
and isequal(key1,key2)
(true when key1 == key2
) where key1
and key2
are keys.
More details are provided below.
The sorted container objects use a special type for indexing
called a token
defined as a two-entry tuple and aliased as
SDToken
, SMDToken
, and SetToken
for SortedDict, SortedMultiDict
and SortedSet respectively.
A token is the
address of a single data item in the container and can be
dereferenced in time O(1).
The first entry of a Token tuple is the container as a whole, and the
second refers to the particular item. The second part is called a
semitoken. The types for a semitoken are SDSemiToken
, SMDSemiToken
,
and SetSemiToken
for the three types of containers SortedDict, SortedMultiDict
and SortedSet. These types are
all aliases of IntSemiToken
.
A restriction for the sorted containers is that
IntSemiToken
or its aliases cannot used as the key-type.
This is because
ambiguity would result between the
two subscripting calls sc[k]
and sc[st]
described below. In the
rare scenario that a sorted container whose key-type is IntSemiToken
is required,
a workaround is to wrap the key inside another immutable
structure.
In the current
version of Julia, it is costly to operate on tuples whose entries are not bits-types
because such tuples are allocated on the heap.
For example, the first entry of a token is a pointer to a container
(a non-bits type), so
a new token is allocated on the heap rather than the
stack. In order to avoid performance loss, the package uses tokens
less frequently than semitokens. For a function
taking a token as an argument like
deref
described below, if it is invoked by explicitly naming
the token like this:
tok = (sc,st) # sc is a sorted container, st is a semitoken k,v = deref(tok)
then there may be a loss of performance compared to:
k,v = deref((sc,st))
because the former needs an extra heap allocation step for tok
.
The notion of token is similar to the concept of iterators used by C++ standard containers. Tokens can be explicitly advanced or regressed through the data in the sorted order; they are implicitly advanced or regressed via iteration loops defined below.
A token may take two special values: the before-start value and the past-end value. These values act as lower and upper bounds on the actual data. The before-start token can be advanced, while the past-end token can be regressed. A dereferencing operation on either leads to an error.
In the current implementation, semitokens are internally stored as integers. However, for the purpose of future compatibility, the user should not extract this internal representation; these integers do not have a documented interpretation in terms of the container.
SortedDict(d)
Argument
d
is an ordinary Julia dict (or any associative type) used to initialize the container:c = SortedDict(Dict("New York" => 1788, "Illinois" => 1818))
In this example the key-type is deduced to be ASCIIString, while the value-type is Int. The default ordering object
Forward
is used. See below for more information about ordering.SortedDict(d,o)
- Argument
d
is an ordinary Julia dict (or any associative type) used to initialize the container ando
is an ordering object used for ordering the keys. SortedDict(k1=>v1, k2=>v2, ...)
- Arguments are key-value pairs for insertion into the dictionary. The keys must be of the same type as one another; the values must also be of one type.
SortedDict(o, k1=>v1, k2=>v2, ...)
- The first argument
o
is an ordering object. The remaining arguments are key-value pairs for insertion into the dictionary. The keys must be of the same type as one another; the values must also be of one type. SortedDict(iter)
- Takes an arbitrary iterable object of key=>value pairs. The default Forward ordering is used.
SortedDict(iter,o)
- Takes an arbitrary iterable object of key=>value pairs.
The ordering object
o
is explicitly given. SortedDict{K,V,Ord}(o)
- Construct an empty SortedDict in which type parameters
are explicitly listed; ordering object is explicitly specified.
(See below for discussion of ordering.) An empty SortedDict
may also be constructed using
SortedDict(Dict{K,V}(),o)
where theo
argument is optional. SortedMultiDict(ks,vs,o)
- Construct a SortedMultiDict using keys given by
ks
, values given byvs
and ordering objecto
. The ordering object defaults toForward
if not specified. The two argumentsks
andvs
are 1-dimensional arrays of the same length in whichks
holds keys andvs
holds the corresponding values. SortedMultiDict(k1=>v1, k2=>v2, ...)
- Arguments are key-value pairs for insertion into the multidict. The keys must be of the same type as one another; the values must also be of one type.
SortedMultiDict(o, k1=>v1, k2=>v2, ...)
- The first argument
o
is an ordering object. The remaining arguments are key-value pairs for insertion into the multidict. The keys must be of the same type as one another; the values must also be of one type. SortedMultiDict(iter)
- Takes an arbitrary iterable object of key=>value pairs. The default Forward ordering is used.
SortedMultiDict(iter,o)
- Takes an arbitrary iterable object of key=>value pairs.
The ordering object
o
is explicitly given. SortedMultiDict{K,V,Ord}(o)
- Construct an empty sorted multidict in which type parameters
are explicitly listed; ordering object is explicitly specified.
(See below for discussion of ordering.) An empty SortedMultiDict
may also be constructed via
SortedMultiDict(K[], V[], o)
where theo
argument is optional. SortedSet(iter,o)
- Construct a SortedSet using keys given by iterable
iter
(e.g., an array) and ordering objecto
. The ordering object defaults toForward
if not specified. SortedSet{K,Ord}(o)
- Construct an empty sorted set in which type parameter
is explicitly listed; ordering object is explicitly specified.
(See below for discussion of ordering.) An alternate way
to create an empty set of type
K
isSortedSet(K[], o)
; again, the order argument defaults toForward
if not specified.
In the list of functions below, the running time of the various operations is provided. In these running times, n denotes the current size (number of items) in the container at the time of the function call, and c denotes the time needed to compare two keys.
sd[k]
- Argument
sd
is a SortedDict andk
is a key. In an expression, this retrieves the value associated with the key (orKeyError
if none). On the left-hand side of an assignment, this assigns or reassigns the value associated with the key. (For assigning and reassigning, see alsoinsert!
below.) Time: O(c log n) find(sd,k)
- Argument
sd
is a SortedDict and argumentk
is a key. This function returns the semitoken that refers to the item whose key isk
, or past-end semitoken ifk
is absent. Time: O(c log n) deref((sc,st))
- Argument
(sc,st)
is a token (i.e.,sc
is a container andst
is a semitoken). Note the double-parentheses in the calling syntax: the argument ofderef
is a token, which is defined to be a 2-tuple. This returns a key=>value pair. pointed to by the token for SortedDict and SortedMultiDict. Note that the syntaxk,v=deref((sc,st))
is valid because Julia automatically iterates over the two entries of the Pair in order to assignk
andv
. For SortedSet this returns a key. Time: O(1) deref_key((sc,st))
- Argument
(sc,st)
is a token for SortedMultiDict or SortedDict. This returns the key (i.e., the first half of a key=>value pair) pointed to by the token. This functionality is available as plainderef
for SortedSet. Time: O(1) deref_value((sc,st))
- Argument
(sc,st)
is a token for SortedMultiDict or SortedDict. This returns the value (i.e., the second half of a key=>value pair) pointed to by the token. Time: O(1) startof(sc)
- Argument
sc
is SortedDict, SortedMultiDict or SortedSet. This function returns the semitoken of the first item according to the sorted order in the container. If the container is empty, it returns the past-end semitoken. Time: O(log n) endof(sc)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet. This function returns the semitoken of the last item according to the sorted order in the container. If the container is empty, it returns the before-start semitoken. Time: O(log n) first(sc)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet This function returns the first item (ak=>v
pair for SortedDict and SortedMultiDict or a key for SortedSet) according to the sorted order in the container. Thus,first(sc)
is equivalent toderef((sc,startof(sc)))
. It is an error to call this function on an empty container. Time: O(log n) last(sc)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet. This function returns the last item (ak=>v
pair for SortedDict and SortedMultiDict or a key for SortedSet) according to the sorted order in the container. Thus,last(sc)
is equivalent toderef((sc,endof(sc)))
. It is an error to call this function on an empty container. Time: O(log n) pastendsemitoken(sc)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet. This function returns the past-end semitoken. Time: O(1) beforestartsemitoken(sc)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet. This function returns the before-start semitoken. Time: O(1) advance((sc,st))
- Argument
(sc,st)
is a token. This function returns the semitoken of the next entry in the container according to the sort order of the keys. After the last item, this routine returns the past-end semitoken. It is an error to invoke this function if(sc,st)
is the past-end token. If(sc,st)
is the before-start token, then this routine returns the semitoken of the first item in the sort order (i.e., the same semitoken returned by thestartof
function). Time: O(log n) regress((sc,st))
- Argument
(sc,st)
is a token. This function returns the semitoken of the previous entry in the container according to the sort order of the keys. If(sc,st)
indexes the first item, this routine returns the before-start semitoken. It is an error to invoke this function if(sc,st)
is the before-start token. If(sc,st)
is the past-end token, then this routine returns the smitoken of the last item in the sort order (i.e., the same semitoken returned by theendof
function). Time: O(log n) searchsortedfirst(sc,k)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet andk
is a key. This routine returns the semitoken of the first item in the container whose key is greater than or equal tok
. If there is no such key, then the past-end semitoken is returned. Time: O(c log n) searchsortedlast(sc,k)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet andk
is a key. This routine returns the semitoken of the last item in the container whose key is less than or equal tok
. If there is no such key, then the before-start semitoken is returned. Time: O(c log n) searchsortedafter(sc,k)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet andk
is an element of the key type. This routine returns the semitoken of the first item in the container whose key is greater thank
. If there is no such key, then the past-end semitoken is returned. Time: O(c log n) searchequalrange(sc,k)
- Argument
sc
is a SortedMultiDict andk
is an element of the key type. This routine returns a pair of semitokens; the first of the pair is the semitoken addressing the first item in the container with keyk
and the second is the semitoken addressing the last item in the container with keyk
. If no item matches the given key, then the pair (past-end-semitoken, before-start-semitoken) is returned. Time: O(c log n)
empty!(sc)
- Argument
sc
is a SortedDict, SortedMultiDict or SortedSet. This empties the container. Time: O(1). insert!(sc,k,v)
- Argument
sc
is a SortedDict or SortedMultiDict,k
is a key andv
is the corresponding value. This inserts the(k,v)
pair into the container. If the key is already present in a SortedDict or SortedSet, this overwrites the old value. In the case of SortedMultiDict, no overwriting takes place (since SortedMultiDict allows the same key to associate with multiple values). In the case of SortedDict, the return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. In the case of SortedMultiDict, a semitoken is returned (but no boolean). Time: O(c log n) insert!(sc,k)
- Argument
sc
is a SortedSet andk
is a key. This inserts the key into the container. If the key is already present in a this overwrites the old value. (This is not necessarily a no-op; see below for remarks about the customizing the sort order.) The return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. Time: O(c log n) push!(sc,k)
- Argument
sc
is a SortedSet andk
is a key. This inserts the key into the container. If the key is already present in a this overwrites the old value. (This is not necessarily a no-op; see below for remarks about the customizing the sort order.) The return value issc
. Time: O(c log n) push!(sc, k=>v)
- Argument
sc
is a SortedDict or SortedMultiDict andk=>v
is a key-value pair. This inserts the key-value pair into the container. If the key is already present in a this overwrites the old value. The return value issc
. Time: O(c log n) delete!((sc,st))
- Argument
(sc,st)
is a token for a SortedDict, SortedMultiDict or SortedSet. This operation deletes the item addressed by(sc,st)
. It is an error to call this on an entry that has already been deleted or on the before-start or past-end tokens. After this operation is complete,(sc,st)
is an invalid token and cannot be used in any further operations. Time: O(log n) delete!(sc,k)
- Argument
sc
is a SortedDict or SortedSet andk
is a key. This operation deletes the item whose key isk
. It is aKeyError
ifk
is not a key of an item in the container. After this operation is complete, any token addressing the deleted item is invalid. Returnssc
. Time: O(c log n) pop!(sc,k)
- Deletes the item with key
k
in SortedDict or SortedSetsc
and returns the value that was associated withk
in the case of SortedDict ork
itself in the case of SortedSet. AKeyError
results ifk
is not insc
. Time: O(c log n) pop!(ss)
- Deletes the item with first key in SortedSet
ss
and returns the key. ABoundsError
results ifss
is empty. Time: O(c log n) sc[st]
- If
st
is a semitoken andsc
is a SortedDict or SortedMultiDict, thensc[st]
refers to the value field of the (key,value) pair that the full token(sc,st)
refers to. This expression may occur on either side of an assignment statement. Time: O(1)
compare(sc,st1,st2)
- Here,
st1
andst2
are semitokens for the same containersc
; this function determines the relative positions of the data items indexed by(sc,st1)
and(sc,st2)
in the sorted order. The return value is -1 if(sc,st1)
precedes(sc,st2)
, 0 if they are equal, and 1 if(sc,st1)
succeeds(sc,st2)
. This function compares the tokens by determining their relative position within the tree without dereferencing them. For SortedDict it is mostly equivalent to comparingderef_key((sc,st1))
toderef_key((sc,st2))
using the ordering of the SortedDict except in the case that either(sc,st1)
or(sc,st2)
is the before-start or past-end token, in which case thederef
operation will fail. Which one is more efficient depends on the time-complexity of comparing two keys. Similarly, for SortedSet it is mostly equivalent to comparingderef((sc,st1))
toderef((sc,st2))
. For SortedMultiDict, this function is not equivalent to a key comparison since two items in a SortedMultiDict with the same key are not necessarily the same item. Time: O(log n) status((sc,st))
- This function returns 0 if the token
(sc,st)
is invalid (e.g., refers to a deleted item), 1 if the token is valid and points to data, 2 if the token is the before-start token and 3 if it is the past-end token. Time: O(1)
As is standard in Julia, iteration over the containers is
implemented via calls to three functions, start
,
next
and done
. It is usual practice, however, to
call these functions implicitly with a for-loop rather than
explicitly, so they are presented here in for-loop notation.
Internally, all of these iterations are implemented with semitokens
that are advanced via the advance
operation.
Each iteration
of these loops requires O(log n) operations to advance the
semitoken. If one loops over an entire container, then the amortized
cost of advancing the semitoken drops to O(1).
The following snippet loops over the entire container sc
, where
sc
is a SortedDict or SortedMultiDict:
for (k,v) in sc < body > end
In this loop, (k,v)
takes on successive (key,value) pairs
according to
the sort order of the key. If one uses:
for p in sc < body > end
where sc
is a SortedDict or SortedMultiDict, then p
is
a k=>v
pair.
For SortedSet one uses:
for k in ss < body > end
There are two ways to iterate over a subrange of a container. The first is the inclusive iteration for SortedDict and SortedMultiDict:
for (k,v) in inclusive(sc,st1,st2) < body > end
Here, st1
and st2
are semitokens that refer to the container sc
.
It is acceptable for (sc,st1)
to be the past-end token
or (sc,st2)
to be the before-start token (in these cases, the body
is not executed).
If compare(sc,st1,st2)==1
then the body is not executed.
A second calling format for inclusive
is
inclusive(sc,(st1,st2))
. One purpose for second format is so that
the return value of searchequalrange
may be used directly
as the second argument to inclusive
.
One can also define a loop that excludes the final item:
for (k,v) in exclusive(sc,st1,st2) < body > end
In this case, all the data addressed by tokens from (sc,st1)
up to but excluding
(sc,st2)
are executed. The body is not executed at all if compare(sc,st1,st2)>=0
.
In this setting, either or both can be the past-end token, and (sc,st2)
can
be the before-start token. For the sake
of consistency, exclusive
also supports the calling format
exclusive(sc,(st1,st2))
. In the previous few snippets, if the loop
object is p
instead of (k,v)
, then p
is a k=>v
pair.
Both the inclusive
and exclusive
functions return objects that can be
saved and used later for iteration.
The validity of the tokens is not checked until the loop initiates.
For SortedSet the usage is:
for k in inclusive(ss,st1,st2) < body > end for k in exclusive(ss,st1,st2) < body > end
If sc
is a SortedDict or SortedMultiDict,
one can iterate over just keys or just values:
for k in keys(sc) < body > end for v in values(sc) < body > end
Finally, one can retrieve semitokens during any of these iterations. In the case of SortedDict and SortedMultiDict, one uses:
for (st,k,v) in semitokens(sc) < body > end for (st,k) in semitokens(keys(sc)) < body > end for (st,v) in semitokens(values(sc)) < body > end
In each of the above three iterations, st
is a
semitoken referring to the
current (k,v)
pair.
In the case of SortedSet, the following iteration may be used:
for (st,k) in semitokens(ss) < body > end
If one wishes to retrieve only semitokens, the following may be used:
for st in onlysemitokens(sc) < body > end
In this case, sc
is a SortedDict, SortedMultiDict, or SortedSet.
To be compatible with standard containers, the package also offers
eachindex
iteration:
for ind in eachindex(sc) < body > end
This iteration function eachindex
is equivalent to keys
in the case
of SortedDict. It is equivalent to onlysemitokens
in the case of SortedMultiDict
and SortedMultiSet.
In place of sc
in the above keys
, values
and
semitokens
, snippets,
one could also use inclusive(sc,st1,st2)
or exclusive(sc,st1,st2)
.
Similarly, for SortedSet, one can iterate over
semitokens(inclusive(ss,st1,st2))
or semitokens(exclusive(ss,st1,st2))
Note that it is acceptable for the loop body in the above
semitokens
code snippets to invoke
delete!((sc,st))
or delete!((ss,st))
.
This is because the for-loop internal state variable
is already advanced to the next token at the beginning of the body, so
st
is not necessarily referred to in the loop body (unless the
user refers to it).
isempty(sc)
- Returns
true
if the container is empty (no items). Time: O(1) length(sc)
- Returns the length, i.e., number of items, in the container. Time: O(1)
in(p,sc)
- Returns true if
p
is insc
. In the case thatsc
is a SortedDict or SortedMultiDict,p
is a key=>value pair. In the case thatsc
is a SortedSet,p
should be a key. Time: O(c log n) for SortedDict and SortedSet. In the case of SortedMultiDict, the time is O(cl log n), where l stands for the number of entries that have the key of the given pair. (So therefore this call is inefficient if the same key addresses a large number of values, and an alternative should be considered.) in(x,iter)
Returns true if
x
is initer
, whereiter
refers to any of the iterable objects described above in the discussion of container loops andx
is of the appropriate type. For all of the iterables except the five listed below, the algorithm used is a linear-time search. For example, the call:(k=>v) in exclusive(sd,st1,st2)
where
sd
is a SortedDict,st1
andst2
are semitokens,k
is a key, andv
is a value, will loop over all entries in the dictionary between the two tokens and a compare for equality usingisequal
between the indexed item andk=>v
.The five exceptions are:
(k=>v) in sd (k=>v) in smd k in ss k in keys(sd) k in keys(smd)
Here,
sd
is a SortedDict,smd
is a SortedMultiDict, andss
is a SortedSet.These five invocations of
in
use the index structure of the sorted container and test equality based on the order object of the keys rather thanisequal
. Therefore, these five are all faster than linear-time looping. The first three were already discussed in the previous entry. The last two are equivalent tohaskey(sd,k)
andhaskey(smd,k)
respectively. To force the use ofisequal
test on the keys rather than the order object (thus slowing the execution from logarithmic to linear time), replace the above five constructs with these:(k=>v) in collect(sd) (k=>v) in collect(smd) k in collect(ss) k in collect(keys(sd)) k in collect(keys(smd))
eltype(sc)
- Returns the (key,value) type (a 2-entry pair, i.e.,
Pair{K,V}
) for SortedDict and SortedMultiDict. Returns the key type for SortedSet. This function may also be applied to the type itself. Time: O(1) keytype(sc)
- Returns the key type for SortedDict, SortedMultiDict and SortedSet. This function may also be applied to the type itself. Time: O(1)
valtype(sc)
- Returns the value type for SortedDict and SortedMultiDict. This function may also be applied to the type itself. Time: O(1)
similar(sc)
- Returns a new SortedDict, SortedMultiDict, or SortedSet
of the same type and with the same ordering
as
sc
but with no entries (i.e., empty). Time: O(1) orderobject(sc)
- Returns the order object used to construct the container. Time: O(1)
haskey(sc,k)
- Returns true if key
k
is present for SortedDict, SortedMultiDict or SortedSetsc
. For SortedSet,haskey(sc,k)
is a synonym forin(k,sc)
. For SortedDict and SortedMultiDict,haskey(sc,k)
is equivalent toin(k,keys(sc))
. Time: O(c log n) get(sd,k,v)
- Returns the value associated with key
k
wheresd
is a SortedDict, or else returnsv
ifk
is not insd
. Time: O(c log n) get!(sd,k,v)
- Returns the value associated with key
k
wheresd
is a SortedDict, or else returnsv
ifk
is not insd
, and in the latter case, inserts(k,v)
intosd
. Time: O(c log n) getkey(sd,k,defaultk)
- Returns key
k
wheresd
is a SortedDict, ifk
is insd
else it returnsdefaultk
. If the container uses in its ordering aneq
method different from isequal (e.g., case-insensitive ASCII strings illustrated below), then the return value is the actual key stored in the SortedDict that is equivalent tok
according to theeq
method, which might not be equal tok
. Similarly, if the user performs an implicit conversion as part of the call (e.g., the container has keys that are floats, but thek
argument togetkey
is an Int), then the returned key is the actual stored key rather thank
. Time: O(c log n) isequal(sc1,sc2)
- Checks if two containers are equal in the sense
that they contain the same items; the keys are compared
using the
eq
method, while the values are compared with theisequal
function. In the case of SortedMultiDict, equality requires that the values associated with a particular key have same order (that is, the same insertion order). Note thatisequal
in this sense does not imply any correspondence between semitokens for items insc1
with those forsc2
. If the equality-testing method associated with the keys and values implies hash-equivalence in the case of SortedDict, thenisequal
of the entire containers implies hash-equivalence of the containers. Time: O(cn + n log n) packcopy(sc)
- This returns a copy of
sc
in which the data is packed. When deletions take place, the previously allocated memory is not returned. This function can be used to reclaim memory after many deletions. Time: O(cn log n) deepcopy(sc)
- This returns a copy of
sc
in which the data is deep-copied, i.e., the keys and values are replicated if they are mutable types. A semitoken for the originalsc
is a valid semitoken for the copy because this operation preserves the relative positions of the data in memory. Time O(maxn), where maxn denotes the maximum size thatsc
has attained in the past. packdeepcopy(sc)
- This returns a packed copy of
sc
in which the keys and values are deep-copied. This function can be used to reclaim memory after many deletions. Time: O(cn log n) merge(sc1, sc2...)
- This returns a SortedDict or SortedMultiDict that results from merging
SortedDicts or SortedMultiDicts
sc1
,sc2
, etc., which all must have the same key-value-ordering types. In the case of keys duplicated among the arguments, the rightmost argument that owns the key gets its value stored for SortedDict. In the case of SortedMultiDict all the key-value pairs are stored, and for keys shared betweensc1
andsc2
the ordering is left-to-right. This function is not available for SortedSet, but theunion
function (see below) provides equivalent functionality. Time: O(cN log N), where N is the total size of all the arguments. merge!(sc, sc1...)
- This updates
sc
by merging SortedDicts or SortedMultiDictssc1
, etc. intosc
. These must all must have the same key-value types. In the case of keys duplicated among the arguments, the rightmost argument that owns the key gets its value stored for SortedDict. In the case of SortedMultiDict all the key-value pairs are stored, and for overlapping keys the ordering is left-to-right. This function is not available for SortedSet, but theunion!
function (see below) provides equivalent functionality. Time: O(cN log N), where N is the total size of all the arguments.
The SortedSet container supports the following set operations. Note that
in the case of intersect, symdiff and setdiff, the two SortedSets should
have the same key and ordering object. If they have different key or ordering
types, no error
message is produced; instead, the built-in default versions of these functions
(that can be applied to Any
iterables and that return arrays) are invoked.
union!(ss, iterable)
- This function inserts each item from the second argument
(which must iterable) into the SortedSet
ss
. The items must be convertible to the key-type ofss
. Time: O(ci log n) where i is the number of items in the iterable argument. union(ss, iterable...)
- This function creates a new SortedSet (the return argument) and
inserts each item from
ss
and each item from each iterable argument into the returned SortedSet. Time: O(cn log n) where n is the total number of items in all the arguments. intersect(ss, others...)
- Each argument is a SortedSet with the same key and order type. The return variable is a new SortedSet that is the intersection of all the sets that are input. Time: O(cn log n), where n is the total number of items in all the arguments.
symdiff(ss1, ss2)
- The two argument are sorted sets with the same key and order type. This operation
computes the symmetric difference, i.e., a sorted set containing
entries that are in one of
ss1
,ss2
but not both. Time: O(cn log n), where n is the total size of the two containers. setdiff(ss1, ss2)
- The two arguments are sorted sets with the same key and order type. This operation
computes the difference, i.e., a sorted set containing entries that in
are in
ss1
but notss2
. Time: O(cn log n), where n is the total size of the two containers. setdiff!(ss, iterable)
- This function deletes items in
ss
that appear in the second argument. The second argument must be iterable and its entries must be convertible to the key type of m1. Time: O(cm log n), where n is the size ofss
and m is the number of items initerable
. issubset(iterable, ss)
- This function checks whether each item of the first argument
is an element of the SortedSet
ss
. The entries must be convertible to the key-type ofss
. Time: O(cm log n), where n is the sizes ofss
and m is the number of items initerable
.
As mentioned earlier, the default ordering of keys uses
isless
and isequal
functions. If the default ordering is used,
it is a requirement of the container that isequal(a,b)
is true if and
only if !isless(a,b)
and !isless(b,a)
are both true. This relationship
between isequal
and isless
holds for common built-in types, but
it may not hold for all types, especially user-defined types.
If it does not hold for a certain type, then a custom ordering
argument must be defined as discussed in the next few paragraphs.
The name for the default ordering (i.e., using isless
and
isequal
) is Forward
. Note: this is the name of the
ordering object; its type is ForwardOrdering.
Another possible
ordering object is Reverse
, which reverses the usual sorted order.
This name must be
imported import Base.Reverse
if it is used.
As an example of a custom ordering, suppose the keys
are of type ASCIIString
, and the user wishes to order the keys ignoring
case: APPLE, berry and Cherry would appear in that
order, and APPLE and aPPlE would be indistinguishable in this
ordering.
The simplest approach is to define an ordering object
of the form Lt(my_isless)
, where Lt
is a built-in type
(see ordering.jl
) and my_isless
is the user's comparison function.
In the above example, the ordering object would be:
Lt((x,y) -> isless(lowercase(x),lowercase(y)))
The ordering object is indicated in the above list of constructors
in the o
position
(see above for constructor syntax).
This approach suffers from a performance hit (10%-50% depending on the
container) because the compiler cannot inline or compute the
correct dispatch for the function in parentheses, so the dispatch
takes place at run-time.
A more complicated but higher-performance method to implement
a custom ordering is as follows.
First, the user creates a singleton type that is a subtype of
Ordering
as follows:
immutable CaseInsensitive <: Ordering end
Next, the user defines a method named lt
for less-than
in this ordering:
lt(::CaseInsensitive, a, b) = isless(lowercase(a), lowercase(b))
The first argument to lt
is an object of the CaseInsensitive
type (there is only one such object since it is a singleton type).
The container also needs an equal-to function; the default is:
eq(o::Ordering, a, b) = !lt(o, a, b) && !lt(o, b, a)
For a further slight performance boost, the user can also customize this function with a more efficient implementation. In the above example, an appropriate customization would be:
eq(::CaseInsensitive, a, b) = isequal(lowercase(a), lowercase(b))
Finally, the user specifies the unique element of CaseInsensitive
, namely
the object CaseInsensitive()
, as the ordering object to the
SortedDict
, SortedMultiDict
or SortedSet
constructor.
For the above code to work, the module must make the following declarations, typically near the beginning:
import Base.Ordering import Base.lt import DataStructures.eq
As with ordinary Dicts, keys for the sorted containers
can be either mutable or immutable. In the
case of mutable keys, it is important that the keys not be mutated
once they are in the container else the indexing structure will be
corrupted. (The same restriction applies to Dict.)
For example, suppose a SortedDict sd
is defined in which the
keys are of type Array{Int,1}.
(For this to be possible, the user
must provide an isless
function or order object for Array{Int,1}
since
none is built into Julia.) Suppose the values of sd
are of type Int
.
Then the following sequence of statements leaves sd
in
a corrupted state:
k = [1,2,3] sd[k] = 19 k[1] = 7
The sorted containers are currently not optimized for cache performance. This will be addressed in the future.
There is a minor performance issue as follows:
the container may hold onto a small number of keys and values even after the
data records containing those keys and values have been deleted. This
may cause a memory drain in the case of large keys and values.
It may also lead to a
delay
in the invocation of finalizers.
All keys and values are released completely by the empty!
function.