Cleaning dates and removing outliers with R using lubridate and other R packages using the Kaggle Taxis dataset
From Kaggle - Taxis:
- 1. Shouldn’t this be using pipes i.e.
%>%
?:
train_kaggle$distance = distance(train_kaggle$pickup_latitude, train_kaggle$pickup_longitude,
train_kaggle$dropoff_latitude, train_kaggle$dropoff_longitude)
- 2. Some
lubridate
code:
QUOTE
The hour of day of the pick up is added and a daytime boolean (between 8:00 and 20:00) is added.
# Add time of pickup----
t.lub.train <- ymd_hms(train_kaggle$pickup_datetime)
train_kaggle$pickup_hour <- as.numeric(format(t.lub.train, "%H")) + as.numeric(format(t.lub.train,
"%M"))/60
# Add daytime
train_kaggle$Daytime = (train_kaggle$pickup_hour < 20) & train_kaggle$pickup_hour >
8
test_kaggle$Daytime = (test_kaggle$pickup_hour < 20) & test_kaggle$pickup_hour >
8
END QUOTE
- 3. How to remove outliers:
QUOTE
Outliers are removed as follows:
Trips with distance of 0 km.
Trips over 40 km
Trips over 20000 seconds
Trips with average speed over 50 km/ hour
train_kaggle = train_kaggle[train_kaggle$distance != 0, ]
train_kaggle = train_kaggle[train_kaggle$distance < 40, ]
train_kaggle = train_kaggle[train_kaggle$trip_duration < 20000, ]
train_kaggle = train_kaggle[train_kaggle$speed < 50, ]
END QUOTE