column order not respected
espinielli opened this issue · 3 comments
The columns order for my table reference seems not to be respected.
The columns order in the CSV files is
snapshot_id,altitude,heading,latitude,longitude,radar_id,speed,squawk
But I would like to see them as:
snapshot_id longitude latitude altitude heading speed radar_id squawk
I tried the following SELECT
in the SQL prompt (it worked):
0: jdbc:drill:> SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv` LIMIT 20;
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| snapshot_id | longitude | latitude | altitude | heading | speed | squawk | radar_id |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| 1486340102 | 14.27592 | 50.10523 | 0 | 40 | 0 | 0 | 8520 |
| 1486340568 | 14.27612 | 50.10521 | 0 | 58 | 0 | 0 | 8520 |
| 1486340668 | 14.27592 | 50.10512 | 0 | 247 | 0 | 0 | 8520 |
| 1486340773 | 14.27569 | 50.10515 | 0 | 232 | 0 | 0 | 8520 |
| 1486340828 | 14.27578 | 50.10507 | 0 | 77 | 0 | 0 | 8520 |
| 1486340903 | 14.27590 | 50.10528 | 0 | 41 | 0 | 0 | 8520 |
| 1486341363 | 14.27594 | 50.10523 | 0 | 48 | 0 | 0 | 8520 |
| 1486341828 | 14.27573 | 50.10522 | 0 | 241 | 0 | 0 | 8520 |
| 1486342643 | 14.27581 | 50.10522 | 0 | 120 | 0 | 0 | 8520 |
| 1486342870 | 14.27571 | 50.10522 | 0 | 255 | 0 | 0 | 8520 |
| 1486343421 | 14.27588 | 50.10524 | 0 | 63 | 0 | 0 | 8520 |
| 1486343966 | 14.27568 | 50.10509 | 0 | 312 | 0 | 0 | 8520 |
| 1486344251 | 14.27590 | 50.10522 | 0 | 119 | 0 | 0 | 8520 |
| 1486345301 | 14.27586 | 50.10524 | 0 | 294 | 0 | 0 | 8520 |
| 1486345875 | 14.27606 | 50.10507 | 0 | 123 | 0 | 0 | 8520 |
| 1486345931 | 14.27627 | 50.10509 | 0 | 132 | 0 | 0 | 8520 |
| 1486345961 | 14.27619 | 50.10529 | 0 | 339 | 0 | 0 | 8520 |
| 1486345981 | 14.27595 | 50.10538 | 0 | 297 | 0 | 0 | 8520 |
| 1486346091 | 14.27594 | 50.10522 | 0 | 120 | 0 | 0 | 8520 |
| 1486347131 | 14.27592 | 50.10543 | 0 | 12 | 0 | 0 | 8520 |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
20 rows selected (9.943 seconds)
But when I use the same SELECT
as definition of the table reference with sergeant
/dplyr
I only get the column order as in the original CSV files (even if later I try to reorder them with dplyr
).
Here is the sequence of commands in R:
> db <- src_drill("localhost")
> fr24pos <- tbl(db, "(SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv`)")
> fr24pos
# Source: table<(SELECT snapshot_id, longitude, latitude, altitude, heading, speed,
', '') AS int) AS `squawk`, radar_id FROM
# dfs.fr24.`/*.csv`)> [?? x 8]
# Database: DrillConnection
altitude squawk snapshot_id heading latitude radar_id speed longitude
<int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 0 0 1486340102 40 50.1 8520 0 14.3
2 0 0 1486340568 58 50.1 8520 0 14.3
3 0 0 1486340668 247 50.1 8520 0 14.3
4 0 0 1486340773 232 50.1 8520 0 14.3
5 0 0 1486340828 77 50.1 8520 0 14.3
6 0 0 1486340903 41 50.1 8520 0 14.3
7 0 0 1486341363 48 50.1 8520 0 14.3
8 0 0 1486341828 241 50.1 8520 0 14.3
9 0 0 1486342643 120 50.1 8520 0 14.3
10 0 0 1486342870 255 50.1 8520 0 14.3
# ... with more rows
> fr24pos %>% filter(altitude > 1500) %>% head()
# Source: lazy query [?? x 8]
# Database: DrillConnection
altitude squawk snapshot_id heading latitude radar_id speed longitude
<int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 39000 30276 1486339252 114 37.9 230 506 44.0
2 39000 30276 1486339315 114 37.8 230 509 44.2
3 38975 20836 1486339327 114 37.8 230 509 44.2
4 39025 20836 1486339361 114 37.8 230 508 44.2
5 39000 20836 1486339452 114 37.7 230 505 44.6
6 39000 20836 1486339513 131 37.6 230 503 44.7
> fr24pos %>% select(snapshot_id, longitude, latitude, heading, speed, everything()) %>% filter(altitude > 1500) %>% head()
# Source: lazy query [?? x 8]
# Database: DrillConnection
altitude squawk snapshot_id heading latitude radar_id speed longitude
<int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 39000 30276 1486339252 114 37.9 230 506 44.0
2 39000 30276 1486339315 114 37.8 230 509 44.2
3 38975 20836 1486339327 114 37.8 230 509 44.2
4 39025 20836 1486339361 114 37.8 230 508 44.2
5 39000 20836 1486339452 114 37.7 230 505 44.6
6 39000 20836 1486339513 131 37.6 230 503 44.7
I file it here, but maybe it could be an issue with dplyr
or Apache Drill...
Example CSV file in attachment (extension in .txt to get into the issue):
20170206_207761844.csv.txt
PS: I have to explicitly remove "\r" from squawk
(last field in the CSV) to deal with line ending...probably there is a config to do that...
PPS: fr24 is a workspace of dfs with CSV files with headers and I followed "Using Apache Drill with R" recipes to get up an running: THANKS!
So, the REST API returns JSON and Drill's JSON output does not respect column order.
There's a new JDBC dplyr
interface and it does respect column order by default.
looks like https://issues.apache.org/jira/browse/DRILL-6847 will make this possible when 1.15.0 is out