Predicting Hike Times with Machine Learning

January 8, 2019

Introduction

In Switzerland, for most hiking destinations signs are showing rough time estimates, so that people know how long it will take to reach the next peak or mountain hut.
Alpine clubs use a standardised approach to predict these times. The underlying formulas are tailored for slower hikers intentionally. This conservative approach is to prevent inexperienced hikers from ending up in distress in the mountains.

signs in the Swiss Alps

DIN 33466 norms the formula to be used to calculate the distances and is the base for signalisation in the German-speaking parts of the Alps and also in Slovenia. According to DIN 33466 a hiker can cover the following distances within one hour - 300 m of uphill, 500 m of downhill, 4 km of distance. Times for vertical and horizontal movement are calculated separately and half of the smaller value is added to the larger one to determine the total movement time required.
The Swiss Alpine Club (SAC) uses a slightly simpler formula in its handbook (Winkler, Brehm, Haltmeier; Bergsport Sommer: Technik, Taktik, Sicherheit (2010); SAC-Verlag), which says that a hiker covers 4 km in distance per hour or 400 m of uphill per hour. Depending on the terrain an additional hour can be added for each 800 m of downhill.

# Python3
def din33466(uphill, downhill, distance):
    vertical = downhill / 500.0 + uphill / 300.0
    horizontal = distance / 4.0
    return min(vertical, horizontal) / 2 + max(vertical, horizontal)

def sac(uphill, downhill, distance):
    return uphill/400.0 + distance/4.0

for hike in ((900, 300, 8), (1500, 500, 20)):
    t1 = din33466(*hike)
    t2 = sac(*hike)
    print("Hike time for {} m of uphill, {} m of downhill and a distance of {} km".format(*hike))
    print("according to {0:>10}: {1:>8.2f} h".format("DIN33466", t1))
    print("according to {0:>10}: {1:>8.2f} h".format("SAC", t2))

Hike time for 900 m of uphill, 300 m of downhill and a distance of 8 km
according to   DIN33466:     4.60 h
according to        SAC:     4.25 h

Hike time for 1500 m of uphill, 500 m of downhill and a distance of 20 km
according to   DIN33466:     8.50 h
according to        SAC:     8.75 h

More experienced hikers can substitute the values in these formulas to get results which fit their personal fitness level - or use machine learning.
Let's do the latter and show how machine learning can be used to achieve better results.

Structure

Data scientists often find themselves spending much time with data acquisition and preparation (in the 2017 Kaggle survey "dirty data" was ranked the highest barrier, A Glimpse into The Daily Life of a Data Scientist), yet most tutorials start with ready to use datasets. This time we start with nothing but a simple problem and gather the data ourselves to provide insight into the process from data gathering to model creation.

The first part demonstrates how data which is publicly available in the Web can be scraped and prepared using Scrapy, a Python library which makes building crawlers quite easy.

In the second part, we use our gathered data to build different machine learning models with sklearn and keras to predict the hike time and evaluate them against a held-out test dataset.

Part 1: Data Gathering

With a small self-build web crawler, we fetch an initial set of GPS recorded tracks as GPX files from hikr.org. Hikr.org is an online platform where users can upload their tracks and write tour reports.
GPX (GPS Exchange Format) is an XML based format for storing waypoints, tracks and routes. The crawled GPX files are then processed to extract features such as minimum and maximum altitude, geographic bounds, track length, uphill and downhill.

Building a spider with Scrapy

The steps needed to fetch a GPX file from hikr.org are as follows:

search for all hiking reports
for all found reports:
1. open a report detail page
2. if a GPX file is attached to the report, download it

On hikr.org authentication is not required to download files and no captchas have to be solved to perform these steps, thus automating this was straightforward and the resulting class fits on a few lines of code as shown below.

import scrapy

class GpxSpider(scrapy.Spider):
    name = 'gpx'
    start_urls = ['http://www.hikr.org/filter.php?act=filter&a=ped&ai=100&aa=630']

    def parse(self, response):
        for href in response.css('.content-list a::attr(href)'):
            yield response.follow(href, self.parse_detail_page)

        # follow pagination links
        for href in response.css('a#NextLink::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_detail_page(self, response):
        gpx_href = [href for href in response.css('#geo_table a::attr(href)').extract() if href.endswith(".gpx")]
        if len(gpx_href) >= 1:
            yield {
                'file_urls': [gpx_href[0]],
                'difficulty': response.xpath(
                    'normalize-space(//td[normalize-space() ="Hiking grading:"]/following-sibling::td/a/text())').extract_first(),
                'user': response.css('.author a.standard::text').extract_first(),
                'name': response.css('h1.title::text').extract_first(),
                'url': response.url
            }

In the class above, the parse method is called with the responses of the requests to all URLs defined in the start_urls array. In this case, this is only one URL which points to the list view with all hiking reports.
We are interested in two links on the overview page - the link to a detail page of a report, and the link to the next page of results.
By yielding response.follow(href, handler) we instruct Scrapy to send a request to the URL and call the handler with the response of that request. The detail page is handled by parse_detail_page which extracts the difficulty rating of the hike, username, report name and URL. By yielding a dictionary, we tell Scrapy to persist these values.
The file_urls field only contains the link to the GPX file that we want to fetch. By adding the Files Pipeline in the Scrapy settings, the URLs listed in this field will be fetched and stored automatically, and the file path is added to the output dictionary. The following configuration entry has to be present in settings.py to enable the Files Pipeline:

ITEM_PIPELINES = {
   'scrapy.pipelines.files.FilesPipeline': 300,
}
FILES_STORE = 'downloads'

After the crawler had been run for several hours, a few thousand entries were stored in a text file in JSON Lines format.

There are some things to keep in mind when using Scrapy:

You can (and should) test your code in the interactive Scrapy shell. This is especially useful to quickly find good CSS or xpath selectors without running the spider.
Enable throttling. You don't want to cause a denial of service by bombarding your target with thousands of requests.
Only crawl a small amount in the beginning and verify that the collected data works for your problem. The earlier potential issues can be discovered, the better.
If the API or website has rate limits, consider starting the job with job persistence enabled so you can pause and continue later. Alternatively consider using a mesh proxy provider to distribute your requests over several IPs.

Processing GPX files with gpxpy and writing track information to MongoDB

It was assumed that length, uphill and downhill distance are the most important attributes to be extracted from the GPX files. Obviously the time in movement needed to be determined as well as this is the feature that our models should predict.
The library gpxpy was used to parse the downloaded GPX files and calculate these features. Instead of writing the results back to a text file, they were inserted into MongoDB (mainly for demonstration purposes). With the given amount of data this was not strictly necessary, however with larger datasets one can benefit from querying and indexing capabilities of databases to train models with batches of data. A CLI import script was built to do the processing and insertion of result batches into MongoDB.

./mongo_import.py -c tracks -i tracks.jl

The import script fetches a batch of tracks from the JSON lines file, processes the entries with gpxpy and then bulk inserts the extracted track information to MongoDB.

The source code of the import script is available here.

The created dataset has also been made available as a CSV file on kaggle.com/roccoli/gpx-hike-tracks.

Part 2: Building Models to predict hike times

We load the dataset that we have previously created and start with some basic analysis to understand the nature of the data.

import json
from pymongo import MongoClient
import gpxpy
import pandas as pd
from matplotlib import pyplot
%matplotlib

mongo_uri = "mongodb://localhost:27017"
mongo_db = "tracks"
mongo_collection = "hikr"

client = MongoClient(mongo_uri)
db = client[mongo_db]
collection = db[mongo_collection]

Loading the dataset into a dataframe

For further processing we load all tracks from the collection into a pandas dataframe.

# fetch a single document
track = collection.find_one(projection={"gpx": 0, "url": 0, "bounds": 0, "name": 0})
values = [track.values() for track in collection.find(projection={"gpx": 0, "url": 0, "bounds": 0, "name": 0})]
# we later use track document's field names to label the columns of the dataframe
df = pd.DataFrame(columns=track.keys(), data=values).set_index("_id")

The difficulty column contains category data as strings. As most models won't work with text values, we can extract the second character which results in an ordinal encoding.
For categorical data, which cannot be transformed that easily, there are some builtin helpers like the sklearn CategoricalEncoder or the Keras one-hot-encoding function.

df['avg_speed'] = df['length_3d']/df['moving_time']
df['difficulty_num'] = df['difficulty'].map(lambda x: int(x[1])).astype('int32')

Removing Outliers

Looking at min and max values it is apparent that there are some tracks which we want to exclude from our dataset. An infinite average speed, or a min elevation of more than 30km below sea level just don't seem right. We can remove the extremes at both sides and remove all rows where there are null values.

df.describe()

	max_speed	downhill	max_elevation	uphill	length_3d	moving_time	min_elevation
count	12141.000000	12141.000000	10563.000000	12141.000000	1.214100e+04	12141.000000	10563.000000
mean	1.746356	879.145539	1934.281708	942.184362	1.874771e+04	12848.445268	1003.331150
std	5.394065	1028.618856	784.968353	1065.498993	4.093098e+05	11599.792248	813.001041
min	0.000000	0.000000	-1.000000	0.000000e+00	0.000000e+00	0.000000	-32768.000000
25%	1.078841	256.519000	1382.275000	420.142000	8.254129e+03	5260.000000	560.020000
50%	1.367020	823.199002	1986.700000	882.000000	1.200577e+04	12990.000000	960.090000
75%	1.604181	1266.923000	2498.455848	1301.005000	1.645813e+04	18514.000000	1389.485000
max	192.768748	52379.200000	5633.462891	35398.006781	3.189180e+07	189380.000000	4180.000000

# drop na values
df.dropna()
df = df[df['avg_speed'] < 2] # an avg of > 2m/s is probably not a hiking activity
df = df[df['min_elevation'] > 0]
df = df[df['length_2d'] < 100000]

Correlations

We expect altitude changes and track distance to be highly correlated with the moving time as these two features are also used in the formulas used by the alpine clubs. However there might be other features which correlate with moving time and may be good candidates for our models. The chart below shows the Pearson correlation values.

pearson correlation chart

As expected, changes in altitude and distance have the highest correlations with moving time.
Max elevation also shows low correlation as the terrain in higher altitudes can be more challenging than in lower altitudes.
Interestingly the difficulty score doesn't seem to correlate much with the moving time. This might be due to several reasons: The difficulty score of a whole tour is based on the most difficult section, it is set by users and thus varies due to subjectivity. Also a difficult track may be exposed and only suitable for experienced hikers, but it is not automatically terrain which slows one down.

Building the models

Before putting too much time into a sophisticated model it is important to develop a simple baseline. Any other model can then be benchmarked against it. For many problems these simple baselines are already hard to beat and allow to identify approaches which can be discarded early.

A strong baseline

Given the simple nature of the problem, we will use a linear regression model to predict the moving time based on the most correlated fields (length_3d, uphill, downhill and max_elevation).

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y = df.reset_index()['moving_time']
x = df.reset_index()[['downhill', 'uphill', 'length_3d', 'max_elevation']]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

lr = LinearRegression()
lr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Scikit learn has a list of prebuilt evaluation functions for regression models.
For regression problems, the coefficient of determination ($r^2$) can be used as a measure of how well the regression approximates the real data points. An ($r^2$) of 1 indicates that the regression predictions perfectly fit the data.

y_pred_lr = lr.predict(x_test)
r2 = r2_score(y_test, y_pred_lr)
mse = mean_squared_error(y_test, y_pred_lr)

print("r2:\t{}\nMSE: \t{}".format(r2, mse))

r2:	0.8277514178527514
MSE: 	9947095.059578164

The coefficient of determination score is already quite high and might be hard to beat using a different model.
Let's compare this against the formula presented in the introduction.

y_pred_sac = df.apply(lambda row : sac(row.uphill, row.downhill, row.length_2d), axis=1)
r2 = r2_score(y, y_pred_sac)
mse = mean_squared_error(y, y_pred_sac)

print("r2:\t{}\nMSE: \t{}".format(r2, mse))

r2:	-2.0866499679137513
MSE: 	201072625.2795519

The formula parameters have not been fitted on our data and thus the results are far from the true values.

GradientBoostingRegressor

Kagglers have successfully applied variants of Gradient Boosting in competitions. This might also be a good fit for our problem.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

A good explanation of what Gradient Boosting does can be found in this article.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

gbr = GradientBoostingRegressor(n_estimators=50, random_state=9000)
gbr.fit(x_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=50, presort='auto', random_state=9000,
             subsample=1.0, verbose=0, warm_start=False)

y_pred_gbr = gbr.predict(x_test)

r2 = r2_score(y_test, y_pred_gbr)
mse = mean_squared_error(y_test, y_pred_gbr)

print("r2:\t{}\nMSE: \t{}".format(r2, mse))

r2:	0.8516737813634596
MSE: 	8565614.754054507

The GradientBoostingRegressor outperforms our baseline model.

Building a regression model with Keras

Regression models can also be built with neural networks. By not using an activation function in the last layer, the output can be any continuous value (as opposed to a range of 0 to 1 when using i.e. the sigmoid activation).

from keras import Sequential
from keras.optimizers import RMSprop
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

model = Sequential()
model.add(Dense(10, input_shape=(4,)))
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(1))
model.compile(optimizer=RMSprop(lr=0.0001), loss='mse')

model.fit(x_train, y_train, epochs=30, batch_size=20, validation_split=0.15, callbacks=[
    ModelCheckpoint(filepath='./keras-model.h5', save_best_only=True),
    EarlyStopping(patience=2),
    ReduceLROnPlateau()
], verbose=0)

model.load_weights(filepath='./keras-model.h5')
y_pred_keras = model.predict(x_test)

r2 = r2_score(y_test, y_pred_keras)
mse = mean_squared_error(y_test, y_pred_keras)

print("r2:\t{}\nMSE: \t{}".format(r2, mse))

r2:	0.8244549079469348
MSE: 	10137463.520027775

This simple Keras model is slightly worse than our baseline model.

Evaluating an ensemble model

The GBR is the strongest model, thus we will weight it higher than our second strongest model. On average the resulting predictions might be closer to the real values.

import numpy as np

combined = (y_pred_keras[:,0] + y_pred_gbr * 2) / 3.0
r2 = r2_score(y_test, combined)
mse = mean_squared_error(y_test, combined)

print("r2:\t{}\nMSE: \t{}".format(r2, mse))

r2:	0.853322624963087
MSE: 	8470396.530372122

The model ensemble gives us a slightly better result.

c = pd.DataFrame([combined, y_test]).transpose()
c.columns = ['combined', 'test']
c['diff_minutes'] = (c['test'] - c['combined']) / 60
c.describe()

	combined	test	diff_minutes
count	1824.000000	1824.000000	1824.000000
mean	15965.020362	16025.400219	1.006331
std	6886.272844	7601.326531	48.509436
min	1789.959391	239.000000	-444.008206
max	86009.021163	105431.000000	323.699647

As can be seen in the table above, the ensemble model performs well for most hikes in our test dataset. However the min and max values indicate that there are a few outliers with predicted times deviating more than 6 hours from the recorded times. These outliers would need special attention if such a model was to be deployed into production.

Limitations and potential improvements

The model is usually only as good as the data it has been trained on. In this case, the dataset that has been used is likely not representative for the majority of hikers. People using a platform like hikr.org are probably hiking on a regular basis and thus might be more trained and faster than the average person. So the results might be biased due to that. There might also be a geographic bias so that results aren't accurate for other geographic areas with different terrain.

While on average the accuracy might be useful, there are still some considerable deviations from the actual values. The results could be improved by grouping people with similar fitness levels and training distinct models for each group. For a new user, a short recorded track could be sufficient to determine which group this user belongs to. Given the group, a more personalised result would be the outcome.

Conclusion

With tools like Scrapy, collecting data from public sources can be done in just a few lines of code. However, be aware of the limitations - data sources may mainly contain data from a specific type of users.
We could show that a simple linear regression model performed very well for the given task of predicting hike times which outlines the importance of creating simple baselines. There are cases where sophisticated machine learning models are not necessary and add complexity to your solution.