dClass - Pattern Classification Engine dClass is an indexed pattern classification engine. Unlike a search index which indexes your data and runs queries over it, dClass indexes your queries (patterns) and runs data over it, classifying the data against the patterns. dClass is capable of performing near constant time pattern classification. dClass can quickly and accurately find the best matching pattern for a given input. For an input of size M classified against an index of size N, dClass has worst case O(M) performance, even for large values of N. To accomplish this, dClass uses a dtree, a multi edge aliased network of sub pattern nodes. This structure is heavily optimized for searching, retrieval, and high performance on modern day CPUs (classification runtime is in the range of 5 microseconds on modern day hardware). dClass introduces several classification pattern types: STRONG, CHAIN, WEAK, and NONE. These types can be coupled with regular expressions, absolute positioning, grouping, inheritance, duplication, ranking, and directional proximity. This allows for an expressive index which is capable of handling complex context aware patterns like language as well as simpler pattern classifications like device detection, all under a unified syntax and API. dClass is built in a modular fashion and allows for schema free data modeling. This means that multiple pattern indexes can be combined with their own custom classification language allowing for networked knowledge based classification while retaining near constant time performance. Please see the webClass project to see dClass in action: https://github.com/rezan/webClass PATTERNS dClass can load its patterns from a .dtree file or from a DDR xml directory. Patterns can also be added directly to the index via the dtree C API. The test client allows for the conversion of a DDR xml directory into a .dtree file and an API exists to dump the current index into a .dtree file. Please see the README in the dtree directory for detailed dtree pattern notes, examples, and tips: https://github.com/TheWeatherChannel/dClass/blob/master/dtrees/README AUTHORS Reza Naghibi (reza.naghibi@yahoo.com) Special thanks: OpenDDR team, Anthony Watson, Eric Honer, Joe Pearson, Luke Kolin, Ivan Kozhuharov, Chris Hill, Chris McClellen, and The Weather Channel. DCLASS VS GREP https://github.com/TheWeatherChannel/dClass/wiki/dClass-vs-grep DEVICE MAP (OPENDDR) This project will track DeviceMap updates with its own patches. All DeviceMap updates will be backwards compatible with dClass 2.0 code. HOWTO To compile the test client, run make in the src directory. To build with varnish or nginx, please reference the READMEs in the varnish and nginx servers subdirectories. To integrate with the dClass API: -include the dClass header file: #include "dclass_client.h" -define a dclass_index: dclass_index dci; -populate the index using a dtree file or DeviceMap resource file: dclass_load_file(&dci,"/path/to/file.dtree"); -OR- openddr_load_resources(&dci,"/path/to/devicemap/devicedata"); -classify a string against the index and get the resulting kv data: dclass_keyvalue *kv=dclass_classify(&dci,"this is a string"); char *id=kv->id; char *field_xyz=dclass_get_kvalue(kv,"xyz"); -freeing the index: dclass_free(&dci); ROADMAP Enhancements for 2.4 -Configurable v16 bit, v32 bit, and native addressing per index -Move certain index settings from global to configurable per index -Better support for Unicode [1] Enhancements for 3.0 -Realtime additions, modifications, and deletion -Expanded regex support JAVA dClass supports Java via a native JNI extension. A custom JNI loader is used. It first attempts to load a system shared object (dclassjava). If that fails, it then attempts to load a locally packaged shared object. A pre compiled jar is included at java/dist/dclass.jar which comes packaged with 32bit and 64bit shared objects for Windows, Linux, and OS X. NOTES All US-ASCII alphanumeric characters are pattern searchable. Non alphanumeric pattern searchable characters are defined in DTREE_HASH_SCHARS. These chars are word separators. Indexed US-ASCII print characters (0x20 thru 0x7E) which aren't pattern searchable are replaced with DTREE_PATTERN_ANY and can match on any character. All pattern matching is US-ASCII case insensitive. Extended non-separator character set recognition is supported via DTREE_HASH_TCHARS [1]. Write operations on the index are not thread safe. Read operations are thread safe (with at most one writer). Read operations have the dclass index parameter designated with a 'const'. Memory limits are tightly bounded. Default 16bit configuration allows for 65k search nodes and 64MB of general memory. Adjusting DTREE_DT_PACKED* will allow for more search nodes and increasing DTREE_M_MAX_SLABS will allow for more general use memory. Since the dtree data structure is memory pointer heavy, pointers have the option to be compressed down into 16bit or 32bit values. [1] Unicode can be supported by adding '%' to DTREE_HASH_TCHARS and using percent encoding on your data.