numenta/htmresearch

json is not a drop-in replacement for simplejson

SaganBolliger opened this issue · 8 comments

In classification_network.py and several other places json is imported if simplejson fails to import. However, json returns a dict with keys/values that are unicode strings, whereas simplejson returns a dict with standard str key/value pairs. The nupic bindings fail when given unicode strings.

Either json should not be included as a fallback for simplejson, or additional configuration/conversion must be included so that the behavior matches.

Here is the stracktrace when I create a ClassificationModelHTM using json.

In [3]: cm = ClassificationModelHTM(json.load(open("projects/nlp/data/network_configs/tp_knn.json")),"projects/nlp/data/sample_reviews/sample_reviews.csv")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-9e5ff979105a> in <module>()
----> 1 cm = ClassificationModelHTM(json.load(open("projects/nlp/data/network_configs/tp_knn.json")),"projects/nlp/data/sample_reviews/sample_reviews.csv")

/Users/sbolliger/nupic.research/htmresearch/frameworks/nlp/classify_htm.pyc in __init__(self, networkConfig, inputFilePath, retinaScaling, retina, apiKey, verbosity, numLabels, modelDir, prepData, stripCats)
     75       self.networkDataPath = inputFilePath
     76 
---> 77     self.network = self.initModel()
     78     self.learningRegions = self._getLearningRegions()
     79 

/Users/sbolliger/nupic.research/htmresearch/frameworks/nlp/classify_htm.pyc in initModel(self)
    114 
    115     # This encoder specifies the LanguageSensor output width.
--> 116     return configureNetwork(recordStream, self.networkConfig, encoder)
    117 
    118 

/Users/sbolliger/nupic.research/htmresearch/frameworks/classification/classification_network.py in configureNetwork(dataSource, networkParams, encoder)
    245     _setScalarEncoderMinMax(networkParams, dataSource)
    246 
--> 247   network = createNetwork(dataSource, networkParams, encoder)
    248 
    249   # Need to init the network before it can run.

/Users/sbolliger/nupic.research/htmresearch/frameworks/classification/classification_network.py in createNetwork(dataSource, networkConfig, encoder)
    272                                      sensorRegionConfig,
    273                                      dataSource,
--> 274                                      encoder)
    275 
    276   # Keep track of the previous region name and width to validate and link the

/Users/sbolliger/nupic.research/htmresearch/frameworks/classification/classification_network.py in _createSensorRegion(network, regionConfig, dataSource, encoder, moduleName)
    115     encoders = encoder
    116 
--> 117   _addRegisteredRegion(network, regionConfig, moduleName)
    118 
    119   # getSelf() returns the actual region, instead of a region wrapper

/Users/sbolliger/nupic.research/htmresearch/frameworks/classification/classification_network.py in _addRegisteredRegion(network, regionConfig, moduleName)
    149   # regionName = str(regionName)
    150 
--> 151   return network.addRegion(regionName, regionType, json.dumps(regionParams))
    152 
    153 

/Users/sbolliger/nupic/src/nupic/engine/__init__.pyc in addRegion(self, name, nodeType, nodeParams)
    650     @doc:place_holder(Network.addRegion)
    651     """
--> 652     engine.Network.addRegion(self, name, nodeType, nodeParams)
    653     return self._getRegions()[name]
    654 

/Users/sbolliger/local/lib/python2.7/site-packages/nupic.bindings-0.2.2-py2.7.egg/nupic/bindings/engine_internal.pyc in addRegion(self, *args, **kwargs)
   1084     def addRegion(self, *args, **kwargs):
   1085         """addRegion(self, name, nodeType, nodeParams) -> Region"""
-> 1086         return _engine_internal.Network_addRegion(self, *args, **kwargs)
   1087 
   1088     def addRegionFromBundle(self, *args, **kwargs):

TypeError: in method 'Network_addRegion', argument 2 of type 'std::string const &'

Can we just stick with one or the other? (I don't really care which one.) I don't want the complexity of having both.

I'm inclined to choose simplejson. Would this mean adding it to the nupic.research requirements.txt?

I'm inclined to choose simplejson. Would this mean adding it to the nupic.research requirements.txt?

Yes, that's fine. Not a big deal - you can just install it manually too (which is what I usually prefer).

Okay @SaganBolliger will you please change the imports to simplejson, add simplejson to the requirements, and test?

Yes, sure.

It really seems like a bug in bindings. It should be able to work with either type of strings. simplejson vs. json is too much of an implementation-dependent subtlety.

What do you think, @scottpurdy ?

@vitaly-krugl @scottpurdy Relevant information from the SWIG documentation:

At this time, SWIG provides limited support for Unicode and wide-character strings (the C wchar_t type). Some languages provide typemaps for wchar_t, but bear in mind these might not be portable across different operating systems. This is a delicate topic that is poorly understood by many programmers and not implemented in a consistent manner across languages. For those scripting languages that provide Unicode support, Unicode strings are often available in an 8-bit representation such as UTF-8 that can be mapped to the char * type (in which case the SWIG interface will probably work). If the program you are wrapping uses Unicode, there is no guarantee that Unicode characters in the target language will use the same internal representation (e.g., UCS-2 vs. UCS-4). You may need to write some special conversion functions.

If I recall correctly, python is a utf-8 shop, so

such as UTF-8 that can be mapped to the char * type (in which case the SWIG interface will probably work)"

should apply