Making sense of Endomondo's calorie estimation

The other day I got curious how Endomondo estimates energy expenditure during the exercise.

On their website, they mention some paywalled paper, but no specifics, so I figured it'd be interesting to reverse engineer that myself. I've extracted endomondo data from their JSON export and plotted a regression.

I'm using Wahoo TickrX chest strap monitor, so the HR data coming from it is pretty decent.

First, I'm importing the dataframe from the python package I'm using to interact with my data. (I've mentioned it here).

It's private at the moment, but it's pretty specific to my use cases and the only interfacing in this post it through Pandas dataframe, so hopefully that wouldn't confuse you.

In [1]:
from my.workouts.dataframes import endomondo
df = endomondo()
WARNING:workout-provider:Unhandled: Cycling
WARNING:workout-provider:Unhandled: Cycling

Some sample data:

In [2]:
display(df[df['dt'].apply(lambda dt: str(dt.date())) == '2019-04-21'])
dt error heartbeats kcal sport
384 2019-04-21 10:11:28+00:00 None 3873.500000 310.0 Rope jumping
385 2019-04-21 10:47:58+00:00 None 2860.666667 248.0 Running

Heartbeats were calculated as average HR multiplied by the duration of exercise.

Error column is a neat way of propagating exceptions from data provider.

E.g. I only have HR data for the last couple of years or so, so data provider doesn't have any of HR points from endomondo. While I could filter out these points in the data provider, they might still be useful for other plots and analysis pipelines (e.g. if I was actually only interested in kcals and didn't hare about heartbeats).

Instead, I'm just being defensive and propagating exceptions up through the dataframe, leaving it up to the user to handle them.

In [3]:
display(df[df['dt'].apply(lambda dt: str(dt.date())).isin(['2015-03-06', '2018-05-28'])])
dt error heartbeats kcal sport
17 2015-03-06 05:50:38+00:00 no hr NaN 397.0 Running
18 2015-03-06 13:20:06+00:00 no hr NaN 127.0 Table tennis
297 2018-05-28 10:11:45+00:00 Unhandled activity: Cycling NaN NaN NaN
298 2018-05-28 12:58:33+00:00 Unhandled activity: Cycling NaN NaN NaN

So, first we filter out the entries with errors:

In [4]:
df = df[df['error'].isnull() & (df['sport'] != 'Other')]

As well as some random entries which would end up as outliers:

In [5]:
df = df.groupby(['sport']).filter(lambda grp: len(grp) >= 10) 
In [6]:
%matplotlib inline
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns

matplotlib.rc('font', size=17, weight='regular')

sports = {
    g: len(f) for g, f in df.groupby(['sport'])
}

g = sns.lmplot(
    data=df,
    x='heartbeats',
    y='kcal',
    hue='sport', 
    hue_order=sports.keys(),
    legend_out=False,
    height=15,
    palette='colorblind',
)
ax = g.ax
ax.set_title('Dependency of energy spent during exercise on number of heartbeats')

ax.set_xlim((0, None))
ax.set_xlabel('Total heartbeats, measured by chest strap HR monitor')

ax.set_ylim((0, None))
ax.set_ylabel('Kcal,\nEndomondo\nestimate', rotation=0, y=1.0)

plt.grid(True)
# https://stackoverflow.com/a/55108651/706389
plt.legend(
    title='Sport',
    labels=[f'{s} ({cnt} points)' for s, cnt in sports.items()],
    loc='upper left',
)
pass

Unsurprising, it looks like a simple linear model (considering my weight and age barely changed).

What I find interesting is that for instance for me, running feels way more intense than any of other cardio I'm doing, definitely way harder than skiing!

However the regression coeffecient (basically, calories burnt per 'unit of heart activity') is more or less same. I guess that could potentially be explained by the fact that running involves more muscle activity, which Endomondo can't capture and doesn't try to infer from the exercise type (which you enter manually when you start logging the exercise).

With regards to the actual regression coefficient: seaborn wouldn't let you display them on the regplot (the author has a very strong opition about that, apparently), so we use sklearn to do that for us:

In [7]:
from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['heartbeats']], df['kcal'])

[coef] = reg.coef_
free = reg.intercept_

display(f"Regression coefficient: {coef}")
display(f"Free term: {free}")
'Regression coefficient: 0.0932774194157544'
'Free term: -9.503488864425037'

Basically, that means I get about 0.1 Kcal for each heartbeat during exercise. Free term ideally should be equal to 0 (i.e. just as a sanity sort of thing: not having heartbeat shouldn't result in calorie loss), and -10 is close enough.

Also, fun calculation:

In [8]:
normal_bpm = 60
minutes_in_day = 24 * 60

coef * normal_bpm * minutes_in_day
Out[8]:
8059.16903752118

8K Kcals per day? A bit too much for an average person. I wouldn't draw any conclusions from that one though :)