- Write a Scala program to read a JSON file with 6 fields {Customer ID, Customer Name, Product, Product ID, Price and Purchase Date}. Read the file using this schema and write the same in CSV format (comma-delimited).
- In continuation to above question, now create a new file using Scala in Parquet format by including only 5 fields. {Customer ID, Customer Name, Product, Price and Purchase Date}
The source code can be found here ScalaJSONToCSV.scala
-
A JSON file is input.json is created under resources directory with sample data
{"Customer ID":"001", "Customer Name":"John Doe", "Product":"Dart Board", "Product ID":"123", "Price":70, "Purchase Date":"2019-04-04"}, {"Customer ID":"002", "Customer Name":"Jane Doe", "Product":"Keyboard", "Product ID":"124", "Price":49.99, "Purchase Date":"2019-04-03"}, {"Customer ID":"003", "Customer Name":"Hercule Poirot", "Product":"Magnifying Glass", "Product ID":"125", "Price":249.99, "Purchase Date":"2019-04-02"}, {"Customer ID":"004", "Customer Name":"Frank Underwood", "Product":"Water Rower", "Product ID":"323", "Price":3999.99, "Purchase Date":"2019-04-02"}, {"Customer ID":"005", "Customer Name":"Pika Achu", "Product":"Duracell Batteries", "Product ID":"130", "Price":15.99, "Purchase Date":"2019-04-01"}
-
Using
SparkSession
instance read the JSON file into a DataFrame (inputJSONDF). -
Display the DataFrame (inputJSONDF) for validation (only for devs)
-
Write the DataFrame (inputJSONDF) as CSV to any desired location. For this coding challenge, I have written the file to output_csv directory
- Using the above created DataFrame (inputJSONDF).
- Rename all the fields that have blank space in their field name. This is to conform and abide by the parquet file naming convention.
- Select only the required columns and create a new dataframe (parquetDF) with only the required data.
- Write the newly create dataframe (parquetDF) as parquet file. For this coding challenge, I have written the file to output_parquet directory.