martinobdl/ITCH

What is the significance of this ITCH data?

Opened this issue · 4 comments

Hello, I recently started reading this book for trading using machine learning approaches. I encountered a problem(too slow code) when trying to use a python notebook from the book code repo to parse a 12GB ITCH data file found in ftp://emi.nasdaq.com/ITCH/ (takes up to 4 hours on a 4-cpu GCP instance to parse the file)
So I searched about a c++ solution and came across your repo. I have a few questions ...

  • A) What is the significance of this data? I read the technical specification but I failed to understand what this data is about / how important this data is?

  • B) What is the timeline or in other words is this some sort of historical data that covers a time period from let's say 2000-2020 or this is something else?

  • C) How fast is your code for achieving the same task(parsing a 12GB .bin file on an i5 macbook pro for example)?

  • D) The python notebook I mentioned earlier parses the file using the following (simplified description) loop forever manner:

    • read 2 bytes(message length)
    • read 1 byte(message type)
    • read message length - 1
    • process further & store data

I did not read your code yet but do you do something similar(sequential processing) or there is some other workaround ex: reading at a large buffer size? (I'm asking because I thought of re-implementing the solution for fun)

Thank you for taking the time reading this

A) To understand what the data is about you should be familiar with the concept of Limit Order Books (LOB) and Matching Engine. The ITCH data is about the actions that are performed on the LOB by market participants and the aim of this repo is to translate the raw data into research-friendly data.

B) The raw data isn't part of the repo and you should find it yourself (the public NASDAQ FTP will provide you some samples)

C) The code on my i5 MPB parses between 1-3 Mio messages per second per stock. Should be fast enough.

D) Yes. Sequential parsing and then keep tracks of the state of the order book, modifying it every time a relevant message gets read from the binary data.

The README should be clear enough to clarify any remaining doubts. If not please tell me so then I can write it more clearly

Thanks for the explanation I guess I'll have to google this (LOB) and matching engine thing to understand what it is about. Regarding the raw data I understand that I should outsource it, I'm just being a bit skeptical about this data importance but that is probably because I don't understand the concepts you indicated. I was asking about whether this ITCH data covers a certain time period or this is something else because when I examined what should be the output of the slow python notebook as well as your code's, it does not look like it contains (high, low, volume, ...) or any indication of time, it contains some ambiguous(to me) information that's why I asked for clarification about what is the timeline. Regarding the README, sure I'll let you know if I'm having doubts about something as I'm going through your code.

The data is quite significant. Everywhere I look, the provider wants money for this data. And it seems these developers took the data out of thin air and rebuilt it. I will be going through this repo to find a new advantage! emadboxtorx have you had luck with the repo?

Also the devs that compiled the data do have a timestamp right in the data field... The timestamp is in unix time... So when your parsing the data you will need to reflect that and convert it to readable time.

@teamun1corn Yes, I know that all ohlcv have timestamps rather than human readable formats but it's not an issue, I think it's an advantage because timestamps are stored as INTs instead of datetime strings which use more storage space. Regarding me trying this code, no I still haven't tried the code because I'm currently working on different aspects and datasets(ohlcv mostly) but I'm sure it will do the job just fine by the time I would be dealing with Totalview data.