/NYC-Taxi

big data analytics using hadoop ๐Ÿ˜

Primary LanguageR

๐Ÿ“Œ Summary

NYC Yellow Taxi Trip Records ๋ฐ์ดํ„ฐ์˜ EDA๋ฅผ ํ†ตํ•ด, pickup_datetime์™€ dropoff_datetime ๋ณ€์ˆ˜๋ฅผ ์‚ดํŽด๋ณธ ๊ฒฐ๊ณผ, ์‚ฌ๋žŒ๋“ค์ด ํ™œ๋™ํ•˜๋Š” ๋‚ฎ ์‹œ๊ฐ„์ด ์•„๋‹Œ, ๋ฐค์—๋„ ํƒ์‹œ์˜ ์ˆ˜์š”๊ฐ€ ๋งŽ๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‚ฎ์—๋Š” ๋Œ€์ค‘ ๊ตํ†ต์˜ ์„ ํƒ์ง€๊ฐ€ ๋‹ค์–‘ํ•˜์ง€๋งŒ, ๋ฐค์—๋Š” ๋‹ค์†Œ ํ•œ์ •์ ์ด์—ˆ๋‹ค. ์‹ฌ์•ผ ๋ฒ„์Šค ์šด์˜์„ ํ†ตํ•ด ์Šน๊ฐ๋“ค์—๊ฒŒ ๋ฐค์—๋„ ์•ˆ์ „ํ•œ ๊ตํ†ต์ˆ˜๋‹จ์„ ์ œ๊ณตํ•˜๊ณ ์ž ํ•˜๋ฉฐ, ๋ฒ„์Šค ์‚ฌ์—…์„ ์ง„ํ–‰ ๋ฐ ํ™•๋Œ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‰ด์š• ํƒ์‹œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‹ฌ์•ผ ๋ฒ„์Šค์˜ ์ •๋ฅ˜์žฅ ์œ„์น˜ ์„ ์ •๊ณผ ์š”๊ธˆ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ถ„์„์„ ์ง„ํ–‰ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.


๋ณธ ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์€

  1. โ€˜์‹ฌ์•ผ ๋ฒ„์Šค ์‚ฌ์—…โ€™์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ณ 
  2. โ€˜์‹ฌ์•ผ ๋ฒ„์Šค ๋…ธ์„  ์„ ์ •โ€™ ์„ ์œ„ํ•ด, K-Means Clustering ์„ ํ†ตํ•ด ๋ฒ„์Šค ์ •๋ฅ˜์žฅ์„ ์ฐพ๊ณ 
  3. โ€˜์‹ฌ์•ผ ๋ฒ„์Šค ์š”๊ธˆ ์˜ˆ์ธกโ€™ ์„ ์œ„ํ•ด, Linear Regression ์„ ํ†ตํ•ด ์š”๊ธˆ์„ ์ฑ…์ •ํ•œ๋‹ค.

โ€˜์‹ฌ์•ผโ€™ ๋ฒ„์Šค ์ •๋ฅ˜์žฅ ์œ„์น˜ ์„ ์ •์„ ์œ„ํ•ด, ๊ธฐ์ค€์„ ์„ธ์›Œ ๋ฐ์ดํ„ฐ ์ •์ œ ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค.

  1. ์ด๋ก ์ ์œผ๋กœ ์กด์žฌํ•˜๋ฉด ์•ˆ ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ 
  2. ๋ฐ์ดํ„ฐ ์ƒ ๋ง์ด ์•ˆ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•œ ํ›„
  3. ๋ฐ์ดํ„ฐ ์ฃผ์ œ์— ๋งž๊ฒŒ, ์˜คํ›„ 11์‹œ๋ถ€ํ„ฐ ์˜ค์ „ 7์‹œ๊นŒ์ง€์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

๋‹ค์Œ์€ ์‹ฌ์•ผ ๋ฒ„์Šค ๋…ธ์„ ์„ ์„ ์ •ํ•˜๊ธฐ ์œ„ํ•ด K-Means Clusering ์„ ์‚ฌ์šฉํ•œ๋‹ค.

  1. K ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด Elbow Method ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ 
  2. K-Means ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ณผ์ •์„ ๊ฑฐ์ณ, ๊ฐ ๊ตฌ์—ญ์˜ ์ค‘์‹ฌ์ ์„ ๋‰ด์š•(๋งจํ•ดํŠผ) ์ „์ฒด๋ฅผ ์ด์–ด์ฃผ๋Š” ๋…ธ์„ ์˜ ๋ฒ„์Šค ์ •๋ฅ˜์žฅ์œผ๋กœ ์ •ํ•œ๋‹ค.

๋˜ํ•œ ์š”๊ธˆ ์˜ˆ์ธก์„ ์œ„ํ•ด Linear Regression ์„ ์‚ฌ์šฉํ•œ๋‹ค.

  1. K-Means ๋ฅผ ํ†ตํ•ด ์„ ์ •๋œ ๋ฒ„์Šค ์ •๋ฅ˜์žฅ ์ขŒํ‘œ๋ฅผ ์ด์–ด์ฃผ๋Š” ๋…ธ์„ ์„ ๋งŒ๋“ค๊ณ 
  2. ํšŒ๊ท€๋ถ„์„์„ ํ†ตํ•ด ๋…ธ์„  ๊ตฌ๊ฐ„๋ณ„ ์‹ฌ์•ผ ํƒ์‹œ ์š”๊ธˆ์„ ์˜ˆ์ธกํ•˜๊ณ 
  3. ํƒ์‹œ ์š”๊ธˆ์„ ํ† ๋Œ€๋กœ ์‹ฌ์•ผ ๋ฒ„์Šค ์š”๊ธˆ์„ ์˜ˆ์ธกํ•œ๋‹ค.

๐Ÿ˜ Framework

  • HDFS
  • MapReduce

๐Ÿš• DataSet

Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC
2= VeriFone Inc.
pickup datetime The date and time when the meter was engaged.
dropoff datetime The date and time when the meter was disengaged.
Passenger count The number of passengers in the vehicle. This is a driver-entered value.
Trip distance The elapsed trip distance in miles reported by the taximeter.
Pickup longitude Longitude where the meter was engaged.
Pickup latitude Latitude where the meter was engaged.
RateCodeID The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
Store and fwd flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka โ€œstore and forward,โ€ because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
Dropoff longitude Longitude where the meter was disengaged.
Dropoff latitude Latitude where the meter was disengaged.
Payment type A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
Fare amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA tax $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip amount Tip amount โ€“ This field is automatically populated for credit card tips. Cash tips are not included.
Tolls amount Total amount of all tolls paid in trip.
Total amount The total amount charged to passengers. Does not include cash tips.