PyDatastream is a Python interface to the Thomson Dataworks Enterprise (DWE) SOAP API (non free), with some convenience functions for retrieving Datastream data specifically. This package requires valid credentials for this API.
- This package is mainly meant to access Datastream. However basic functionality (
request
method) should work for other Dataworks Enterprise sources. - The package is using Pandas library (GitHub repo), which I found to be the best Python library for time series manipulations. Together with IPython notebook it is the best open source tool for the data analysis. For quick start with pandas have a look on tutorial notebook and 10-minutes introduction.
- Alternatives for other scientific computing languages:
- MATLAB: MATLAB datafeed toolbox
- R: RDatastream (in fact PyDatastream was inspired by RDatastream).
- I am always open for suggestions, critique and bug reports.
First, install prerequisites: pandas
and suds
. Both of packages can be installed with the pip installer:
pip install pandas
pip install suds
However, please refer to the pandas documentation for the dependencies.
The latest version of PyDatastream is always available at GitHub at the master
branch. Last relese could be also installed from PyPI using pip
:
pip install pydatastream
All methods to work with the DWE is organized as a class, so first you need to create an object with your valid credentials:
from pydatastream import Datastream
DWE = Datastream(username="DS:XXXX000", password="XXX000")
If authentication was successfull, then you can check system information (including version of DWE):
DWE.system_info()
and list of data sources available for your subscription:
DWE.sources()
Basic functionality of the module allows to fetch Open-High-Low-Close (OHLC) prices and Volumes for given ticker.
The following command requests daily closing price data for the Apple asset (DWE mnemonic "@AAPL"
) on May 3, 2000:
data = DWE.get_price('@AAPL', date='2000-05-03')
print data
or request daily closing price data for Apple in 2008:
data = DWE.get_price('@AAPL', date_from='2008', date_to='2009')
print data.head()
The data is retrieved as pandas.DataFrame object, which can be plotted:
data.plot()
aggregated into monthly data:
print data.resample('M', how='last')
or manipulated in a variety of ways. Due to extreme simplicity of resampling the data in Pandas library (for example, tainkg into account business calendar), I would recommend to request daily data (unless the requests are huge or daily scale is not applicable) and perform all transformations locally. Also note that thanks to Pandas library format of the date string is extremely flexible.
For fetching Open-High-Low-Close (OHLC) data there exist two methods: get_OHLC
to fetch only price data and get_OHLCV
to fetch boyth price and volume data. This separation is required as volume data is not available for financial indices.
Request daily OHLC and Volume data for Apple in 2008:
data = DWE.get_OHLCV('@AAPL', date_from='2008', date_to='2009')
Request daily OHLC data for S&P 500 Index from May 6, 2010 until present date:
data = DWE.get_OHLC('S&PCOMP', date_from='May 6, 2010')
If the Thomson Reuters mnemonic for specific fields are known, then more general function can be used. The following request
data = DWE.fetch('@AAPL', ['P','MV','VO',], date_from='2000-01-01')
fetchs the closing price, daily volume and market valuation for Apple Inc.
fetch
can be used for requesting data for several tickers at once. In this case pandas.Panel (instead of pandas.DataFrame) will be returned.
res = DWE.fetch(['@AAPL','U:MMM'], fields=['P','MV','VO','PH'], date_from='2000-05-03')
print res['MV'].head()
For convenience major and minor axes of panel are swapped, so the result is mimicing pandas method for fetching data from Yahoo! Finance. Panel can be sliced to get the data for each ticker:
df = res.minor_xs('@AAPL')
print df.head()
As discussed below, it may be convenient to set up property raise_on_error
to False
when fetching data of several tickers. In this case if one or several tickers are misspecified, error will no be raised, but the missing data will be replaced with NaNs:
DWE.raise_on_error = False
res = DWE.fetch(['@AAPL','U:MMM','xxxxxx','S&PCOMP'], fields=['P','MV','VO','PH'], date_from='2000-05-03')
print res['MV'].head()
print res['P'].head()
Please note, that in the last example the closing price (P
) for S&PCOMP
ticker was not retrieved. Due to Thomson Reuters Datastream mnenomics, field P
is not available for indexes and field PI
should be used instead.
PyDatastream also has an interface for retrieving list of constituents of indices:
res = DWE.get_constituents('S&PCOMP')
print res.ix[0]
As an option, the list for a specific date can be requested as well:
res = DWE.get_constituents('S&PCOMP', '1-sept-2013')
List of constituents of indices, that were considered above, is an example of static request, i.e. a request that does not retrive a time-series, but a single snapshot with the data.
On the API level static requests are marked with "~REP" suffix. Within the PyDatastream library static requests could be called using the same fetch()
function as time series by providing an argument static=True
.
For example, this request retrieves names, ISINs and identifiers for primary exchange (ISINID) for various tickers of BASF corporation:
res = DWE.fetch(['D:BAS','D:BASX','HN:BAS','I:BAF','BFA','@BFFAF','S:BAS'],
['ISIN', 'ISINID', 'NAME'], static=True)
Another example of use of static requests is a cross-section of time-series. The following example retrieves an actual price, market capitalization and daily volume for same companies:
res = DWE.fetch(['D:BAS','D:BASX','HN:BAS','I:BAF','BFA','@BFFAF','S:BAS'],
['P', 'MV', 'VO'], static=True)
The module has a general-purpose function request
that can be used for fetching data with custom requests. This function returns raw data in format of suds
package. Data can be used directly or parsed later with the parse_record
method:
raw = DWE.request('@AAPL~=P,MV,VO,PH~2013-01-01~D')
print raw['StatusType']
print raw['StatusCode']
data = DWE.extract_data(raw)
print data['CCY']
print data['MV']
Information about mnemonics and syntax of the request string can be found in Thomson Financial Network.
Some of examples are taken from Thomson Financial Network and description of rDatastream package.
res = DWE.fetch('@AAPL~PERF', date_from='2011-09-01')
print res.head()
res = DWE.request('U:IBM~XREF')
print DWE.extract_data(res)
res = DWE.fetch('U:IBM(P)~~EUR', date_from='2013-09-01')
print res.head()
res = DWE.fetch('MAV#(U:IBM,20D)', date_from='2013-09-01')
print res.head()
Thomson Dataworks Enterprise User Guide suggests to optimize requests: very often it is quicker to make one bigger request than several smaller requests because of the relatively high transport overhead with web services.
PyDatastream allows to fetch several requests in a single API call:
r1 = '@AAPL~OHLCV~2013-11-26~D'
r2 = 'U:MMM~=P,MV,PO~2013-11-26~D'
res = DWE.request_many([r1,r2])
print DWE.parse_record(res[0])
print DWE.parse_record(res[1])
Please note, that due to possible specifics of requests, results of them should be parsed separately, similar to the example above.
If request contains errors then normally DatastreamException
will be raised and the error message from the DWE will be printed. To alter this behavior, one can use raise_on_error
property of Datastream class. Being set to False
it will force parser to ignore error messages and return empty pandas.Dataframe. For instance:
r1 = '@AAPL~OHLCV~2013-11-26~D'
r2 = '902172~OHLCV~wrong_request'
res = DWE.request_many([r1,r2])
DWE.raise_on_error = False
print DWE.parse_record(res[0])
print DWE.parse_record(res[1])
raise_on_error
can be useful for requests that contain several tickers. In this case data fields and/or tickers that can not be fetched will be replaced with NaNs in resulting pandas.Panel.
For the debugging puroposes, Datastream class has show_request
property, which, if set to True
, makes standard methods to output the text string with request:
DWE.show_request = True
data = DWE.fetch('@AAPL', ['P','MV','VO',], date_from='2000-01-01')
Finally, method status
could extract status info from the record with raw response:
print DWE.status(res[1])
and last_status
property always contains status of the last parsed record:
print DWE.last_status
It is recommended that you read the Thomson Dataworks Enterprise User Guide, especially section 4.1.2 on client design. It gives reasonable guidelines for not overloading the servers with too intensive requests.
For building custom Datastream requests, useful guidelines are given on this somewhat old Thomson Financial Network webpage.
If you have access codes for the Datastream Extranet, you can use the Datastream Navigator to look up codes and data types. Also if you're a client of Thomson Reuters, you can get support at the official webpage.
Finally, all these links could be printed in your terminal or iPython notebook by calling
DWE.info()
PyDatastream is released under the MIT license.