hrbrmstr/sergeant

column order not respected

espinielli opened this issue · 3 comments

The columns order for my table reference seems not to be respected.

The columns order in the CSV files is

snapshot_id,altitude,heading,latitude,longitude,radar_id,speed,squawk

But I would like to see them as:

snapshot_id longitude latitude altitude heading speed radar_id squawk

I tried the following SELECT in the SQL prompt (it worked):

0: jdbc:drill:> SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv` LIMIT 20;
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| snapshot_id  | longitude  | latitude  | altitude  | heading  | speed  | squawk  | radar_id  |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| 1486340102   | 14.27592   | 50.10523  | 0         | 40       | 0      | 0       | 8520      |
| 1486340568   | 14.27612   | 50.10521  | 0         | 58       | 0      | 0       | 8520      |
| 1486340668   | 14.27592   | 50.10512  | 0         | 247      | 0      | 0       | 8520      |
| 1486340773   | 14.27569   | 50.10515  | 0         | 232      | 0      | 0       | 8520      |
| 1486340828   | 14.27578   | 50.10507  | 0         | 77       | 0      | 0       | 8520      |
| 1486340903   | 14.27590   | 50.10528  | 0         | 41       | 0      | 0       | 8520      |
| 1486341363   | 14.27594   | 50.10523  | 0         | 48       | 0      | 0       | 8520      |
| 1486341828   | 14.27573   | 50.10522  | 0         | 241      | 0      | 0       | 8520      |
| 1486342643   | 14.27581   | 50.10522  | 0         | 120      | 0      | 0       | 8520      |
| 1486342870   | 14.27571   | 50.10522  | 0         | 255      | 0      | 0       | 8520      |
| 1486343421   | 14.27588   | 50.10524  | 0         | 63       | 0      | 0       | 8520      |
| 1486343966   | 14.27568   | 50.10509  | 0         | 312      | 0      | 0       | 8520      |
| 1486344251   | 14.27590   | 50.10522  | 0         | 119      | 0      | 0       | 8520      |
| 1486345301   | 14.27586   | 50.10524  | 0         | 294      | 0      | 0       | 8520      |
| 1486345875   | 14.27606   | 50.10507  | 0         | 123      | 0      | 0       | 8520      |
| 1486345931   | 14.27627   | 50.10509  | 0         | 132      | 0      | 0       | 8520      |
| 1486345961   | 14.27619   | 50.10529  | 0         | 339      | 0      | 0       | 8520      |
| 1486345981   | 14.27595   | 50.10538  | 0         | 297      | 0      | 0       | 8520      |
| 1486346091   | 14.27594   | 50.10522  | 0         | 120      | 0      | 0       | 8520      |
| 1486347131   | 14.27592   | 50.10543  | 0         | 12       | 0      | 0       | 8520      |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
20 rows selected (9.943 seconds)

But when I use the same SELECT as definition of the table reference with sergeant/dplyr I only get the column order as in the original CSV files (even if later I try to reorder them with dplyr).

Here is the sequence of commands in R:

> db <- src_drill("localhost")
> fr24pos <- tbl(db, "(SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv`)")
> fr24pos
# Source:   table<(SELECT snapshot_id, longitude, latitude, altitude, heading, speed,
', '') AS int) AS `squawk`, radar_id FROM
#   dfs.fr24.`/*.csv`)> [?? x 8]
# Database: DrillConnection
   altitude squawk snapshot_id heading latitude radar_id speed longitude
      <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
 1        0      0  1486340102      40     50.1     8520     0      14.3
 2        0      0  1486340568      58     50.1     8520     0      14.3
 3        0      0  1486340668     247     50.1     8520     0      14.3
 4        0      0  1486340773     232     50.1     8520     0      14.3
 5        0      0  1486340828      77     50.1     8520     0      14.3
 6        0      0  1486340903      41     50.1     8520     0      14.3
 7        0      0  1486341363      48     50.1     8520     0      14.3
 8        0      0  1486341828     241     50.1     8520     0      14.3
 9        0      0  1486342643     120     50.1     8520     0      14.3
10        0      0  1486342870     255     50.1     8520     0      14.3
# ... with more rows
> fr24pos %>% filter(altitude > 1500) %>% head()
# Source:   lazy query [?? x 8]
# Database: DrillConnection
  altitude squawk snapshot_id heading latitude radar_id speed longitude
     <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
1    39000  30276  1486339252     114     37.9      230   506      44.0
2    39000  30276  1486339315     114     37.8      230   509      44.2
3    38975  20836  1486339327     114     37.8      230   509      44.2
4    39025  20836  1486339361     114     37.8      230   508      44.2
5    39000  20836  1486339452     114     37.7      230   505      44.6
6    39000  20836  1486339513     131     37.6      230   503      44.7
> fr24pos %>% select(snapshot_id, longitude, latitude, heading, speed, everything()) %>% filter(altitude > 1500) %>% head()
# Source:   lazy query [?? x 8]
# Database: DrillConnection
  altitude squawk snapshot_id heading latitude radar_id speed longitude
     <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
1    39000  30276  1486339252     114     37.9      230   506      44.0
2    39000  30276  1486339315     114     37.8      230   509      44.2
3    38975  20836  1486339327     114     37.8      230   509      44.2
4    39025  20836  1486339361     114     37.8      230   508      44.2
5    39000  20836  1486339452     114     37.7      230   505      44.6
6    39000  20836  1486339513     131     37.6      230   503      44.7 

I file it here, but maybe it could be an issue with dplyr or Apache Drill...

Example CSV file in attachment (extension in .txt to get into the issue):
20170206_207761844.csv.txt

PS: I have to explicitly remove "\r" from squawk (last field in the CSV) to deal with line ending...probably there is a config to do that...
PPS: fr24 is a workspace of dfs with CSV files with headers and I followed "Using Apache Drill with R" recipes to get up an running: THANKS!

So, the REST API returns JSON and Drill's JSON output does not respect column order.

There's a new JDBC dplyr interface and it does respect column order by default.

looks like https://issues.apache.org/jira/browse/DRILL-6847 will make this possible when 1.15.0 is out

Hey! Well 1.15.0 is out and I've made a change to the in-progress 0.8.0 release of sergeant that preserves column order! (this does require 1.15.0+ tho)

9a310da