Query directly from raw files

Question

Query directly from raw files

jdegoes opened this issue 11 years ago · 1 comments

To make Precog much more accessible and user-friendly to local installs, as well as prepare for work on a distributed version of Precog, we should allow querying directly on files which are stored in formats for which we have an input adapter, similar to how Hive and Pig handle data analysis.

This ticket is to refactor the query engine so that we are able to allow querying directly over JSON data files, CSV files and, of course, NIHDB 'files', in a file system containing a variety of file formats.

To do this, we need to define a suitable input adapter which exposes a Table-oriented view of a file format, and propagate information necessary to use a particular adapter (e.g. for CSV files or possibly even JSON files, the input may be ambiguous and require information such as delimiters in order to unambiguously interpret as a Table).

Some file "formats" may in fact be directories containing many files; we should think about how to handle these.

Note that as per @nuttycom's comment, we already have JSON-backed and even JDBC-backed table adapters. The exact functionality we lack is the ability to discriminate between alternate representations at runtime based on the actual string paths passed to the table load function, as well as an architecture that makes it easy to add new input adapters and rules for selecting them during runtime loads.

This ticket will be considered complete when it is possible to create a Quirrel script that loads data from a JSON file, a CSV file, and a NIHDB file, and joins them all together; and when the associated architecture allows cleanly adding support and selection criteria for new input adapters (by defining the input adapter and describing the rules that dictate when the input adapter is used for dynamically loaded data -- e.g. when the file extension or mime type is such and such).

Answer 1 · 2013-10-06T05:24:03.000Z

This already exists, and is very thoroughly supported by the
ColumnarTableModule abstraction. It's used extensively in the tests, and
the same technique is used for MongoDB and JDBC back-ends.

On Sat, Oct 5, 2013 at 5:48 PM, John A. De Goes notifications@github.comwrote:

To make Precog much more accessible and user-friendly to local installs,
as well as prepare for work on a distributed version of Precog, we should
allow querying directly on files which are stored in formats for which we
have an input adapter, similar to how Hive and Pig handle data analysis.

This ticket is to refactor the query engine so that we are able to allow
querying directly over JSON data files, CSV files and, of course, NIHDB
'files'.

To do this, we need to define a suitable input adapter which exposes a
Table-oriented view of a file format, and possibly propagate information
necessary to use a particular adapter (e.g. for CSV files or possibly even
JSON files, the input may be ambiguous and require information such as
delimiters in order to unambiguously interpret as a Table).

Some file "formats" may in fact be directories containing many files; we
should think about how to handle these.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/528
.