Create tables script:
#to find the nanoid (in this case mine is 6aqqepemzy)
cat .env

code .\emr_pipeline\scripts\create_tables.hql

DROP TABLE sales_data_raw;
CREATE EXTERNAL TABLE sales_data_raw (
`region` STRING,
`country` STRING,
`item_type` STRING,
`sales_channel` STRING,
`order_priority` STRING,
`order_date` STRING,
`order_id` STRING,
`ship_date` STRING,
`units_sold` STRING,
`unit_price` STRING,
`unit_cost` STRING,
`total_revenue` STRING,
`total_cost` STRING,
`total_profit` STRING
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION
's3://emr-pipeline-6aqqepemzy/emr_pipeline/data/sales_data_raw/' #change your NanoID here
TBLPROPERTIES (
"skip.header.line.count"="1"
);
CREATE EXTERNAL TABLE sales_data (
`region` STRING,
`country` STRING,
`item_type` STRING,
`sales_channel` STRING,
`order_priority` STRING,
`order_date` date,
`order_id` STRING,
`ship_date` date,
`units_sold` INTEGER,
`unit_price` DECIMAL(10,2),
`unit_cost` DECIMAL(10,2),
`total_revenue` DECIMAL(12,2),
`total_cost` DECIMAL(12,2),
`total_profit` DECIMAL(12,2)
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://emr-pipeline-6aqqepemzy/emr_pipeline/data/sales_data/' #change your NanoID here
TBLPROPERTIES (
'parquet.compression'='SNAPPY'
);
You’ll have to change emr-pipeline-[your NanoID] at LOCATION
This SQL code creates two external tables in a Hive environment for managing sales data stored in Amazon S3.
Dropping the sales_data_raw Table: The code starts by dropping the sales_data_raw table if it exists, ensuring no conflicts with the new table definition.
Creating sales_data_raw Table:
In summary, the code sets up two distinct tables: one for raw data and another for processed data, facilitating efficient data management and analysis.