Kismuz/btgym

cassandra dataframe is not works

fu2re opened this issue · 4 comments

fu2re commented

I have made my custom dataset using cassandra and overrinde read_csv method.

class CassandraDataSet(BTgymDataset2):
    @staticmethod
    def pandas_factory(colnames, rows, index=None):
        return pd.DataFrame(rows, columns=colnames, index=index)

    @classmethod
    def cassandra_connect(cls):
        try:
            return connection.get_connection()
        except CQLEngineException:
            return connection.register_connection(
                settings.CASSANDRA_CLUSTER_NAME,
                session=cls.cluster.connect(), default=True
            )

    def read_csv(self, data_filename=None, force_reload=False):
        session = self.cassandra_connect().session
        session.row_factory = self.pandas_factory
        session.default_fetch_size = None
        query = "SELECT {columns} from keyspace.chartdata WHERE name='{name}' ORDER BY ts ASC'".format(
            name=self.pair,
            columns='"' + '", "'.join(self.names) + '"',
            schema=self.schema
        )
        rslt = session.execute(query, timeout=None)
        current_dataframe = rslt._current_rows
        current_dataframe = current_dataframe.set_index('ts')
        self.data = current_dataframe
        data_range = pd.to_datetime(self.data.index)
        self.total_num_records = self.data.shape[0]
        self.data_range_delta = (data_range[-1] - data_range[0]).to_pytimedelta()

cassndra model:

from cassandra.cqlengine.models import Model, columns
from cassandra.cqlengine.management import sync_table
from cassandra.cluster import Cluster
from cassandra.cqlengine import connection
import settings

class ChartData(Model):
    __options__ = {
        'compaction': {'class': 'DateTieredCompactionStrategy',
                       'base_time_seconds': 3600,
                       'tombstone_compaction_interval': 86400},
        'default_time_to_live': 0
    }
    name = columns.Text(primary_key=True, partition_key=True)
    # the list of all sources somewhere. Store current source at the settings file.
    ts = columns.DateTime(primary_key=True)
    open = columns.Double()
    close = columns.Double()
    high = columns.Double()
    low = columns.Double()
    volume = columns.Double()

then I just open aac example and replace your dataset with my own. And is shows me the plot immediately. then it stuck with timeout error. No steps doing. I have figure out that the problem is inside of
btgym/server.py:711
btgym/rendering/renderer.py:232
...
backtrader/cerebro.py:996

Your example code is working fine and does not stuck here. Plot is not displayed and learning works. What I have missed? Please help, I have spent a whole month, but cant resolve it by myself.

fu2re commented

packages you may interest in
backtrader==1.9.69.122
cassandra-driver==3.16.0
matplotlib==2.0.2
numpy==1.15.4
pandas==0.23.4
scikit-learn==0.20.0
scipy==1.1.0
tensorboard==1.12.0
tensorflow==1.12.0

@fu2re,

Short:

  1. Have you run your env. manually before attempting distributed training setup?
    See this comment for details: #80 (comment)

  2. Even before running environment, have you run your dataset-trial-episode cycle manually in a loop several times? - in a manner like in https://github.com/Kismuz/btgym/blob/master/examples/data_domain_api_intro.ipynb

Expanded:

when developing such a principal upgrade in functionality, it is a good practice to follow modularised testing approach. For this particular case I would recommend:

  • first, test dataset iterator as stand-along module to ensure it delivers data exactly as expected (see p.2 for manual test or something like https://github.com/Kismuz/btgym/blob/master/btgym/datafeed/test_data.py for more automated setup using unittest package)
  • after dataset module tests has been passed, create single environment module test loop as referred in p.1 an extensively test environment behaviour, possibly with verbosity settings set to 1, or 2. As btgym itself launches several sub-processes, you can get really hard time identifying errors sources when stdout and stderr are supressed by AAC training framework; only when step 2 has been passed, it is a good idea to run distributed training setup.

@fu2re, do I understand correctly that your proposed functionality is:

  • fetch data once from outer source and store it top-level dataset iterator
  • make multiply samples of trials and episodes from stored data

Pls. correct if I miss something.

fu2re commented

my CassandraDataSet is completely similar to your BTgymDataset / BTgymDataset2. The only defference is what I get data from cassandra insted of csv file.
I cant pass even your own tests in data_test:
IndexError: index -1 is out of bounds for axis 0 with size 0

image

I had the similar error with my CassandraDataSet tests.