In this blog we work with time-series dataset consisting of bitcoin prices.
Published on January 05, 2021 by Udbhav Pangotra
Bitcoin Historical Data Analysis
Bitcoin (₿) is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto Some interesting facts about Bitcoin(BTC):
#Data Pre-Processing packages:
import numpy as np
import pandas as pd
from datetime import datetime
#Data Visualization Packages:
import seaborn as sns
custom_colors = ["#4e89ae", "#c56183","#ed6663","#ffa372"]
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.image as mpimg
from colorama import Fore, Back, Style # For text colors
y_= Fore.CYAN
m_= Fore.WHITE
#garbage collector - To free up unused space
import gc
import networkx as nx
import plotly.graph_objects as go #To construct network graphs
#To avoid printing of un necessary Deprecation warning and future warnings!
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
#Time series Analysis pacakages:
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import kpss
from statsmodels.tsa.stattools import adfuller
from import plot_acf, plot_pacf
#Facebook Prophet packages:
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import add_changepoints_to_plot, plot_cross_validation_metric
#Time -To find how long each cell takes to run
import time
#Importing of Data
Data set Overview & Pre-Processing
print(f"{m_}Total records:{y_}{data.shape}\n")
print(f"{m_}Data types of data columns: \n{y_}{data.dtypes}")
[37mTotal records:[36m(4727777, 8)
[37mData types of data columns:
[36mTimestamp int64
Open float64
High float64
Low float64
Close float64
Volume_(BTC) float64
Volume_(Currency) float64
Weighted_Price float64
dtype: object
Data Pre-processing steps
The data is available on a Hourly based on each day, So we need to resample them to day based.
data['Timestamp'] = [datetime.fromtimestamp(x) for x in data['Timestamp']]
data = data.set_index('Timestamp')
data = data.resample("24H").mean()
Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
Timestamp | |||||||
2011-12-31 | 4.465000 | 4.482500 | 4.465000 | 4.482500 | 23.829470 | 106.330084 | 4.471603 |
2012-01-01 | 4.806667 | 4.806667 | 4.806667 | 4.806667 | 7.200667 | 35.259720 | 4.806667 |
2012-01-02 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 19.048000 | 95.240000 | 5.000000 |
2012-01-03 | 5.252500 | 5.252500 | 5.252500 | 5.252500 | 11.004660 | 58.100651 | 5.252500 |
2012-01-04 | 5.200000 | 5.223333 | 5.200000 | 5.223333 | 11.914807 | 63.119577 | 5.208159 |
missed = pd.DataFrame()
missed['column'] = data.columns
missed['percent'] = [round(100* data[col].isnull().sum() / len(data), 2) for col in data.columns]
missed = missed.sort_values('percent',ascending=False)
missed = missed[missed['percent']>0]
fig = sns.barplot(
).set_title('Missed values percent for every column')
def fill_missing(df):
### function to impute missing values using interpolation ###
df['Open'] = df['Open'].interpolate()
df['Close'] = df['Close'].interpolate()
df['Weighted_Price'] = df['Weighted_Price'].interpolate()
df['Volume_(BTC)'] = df['Volume_(BTC)'].interpolate()
df['Volume_(Currency)'] = df['Volume_(Currency)'].interpolate()
df['High'] = df['High'].interpolate()
df['Low'] = df['Low'].interpolate()
print(f'{m_}No. of Missing values after interpolation:\n{y_}{df.isnull().sum()}')
[37mNo. of Missing values after interpolation:
[36mOpen 0
High 0
Low 0
Close 0
Volume_(BTC) 0
Volume_(Currency) 0
Weighted_Price 0
dtype: int64
Index(['Open', 'High', 'Low', 'Close', 'Volume_(BTC)', 'Volume_(Currency)',
new_df=new_df[['Volume_(BTC)', 'Close','Volume_(Currency)']]
Volume_market_mean | close_mean | volume_curr_mean | |
Timestamp | |||
2011-12-31 | 23.829470 | 4.482500 | 106.330084 |
2012-01-01 | 7.200667 | 4.806667 | 35.259720 |
2012-01-02 | 19.048000 | 5.000000 | 95.240000 |
2012-01-03 | 11.004660 | 5.252500 | 58.100651 |
2012-01-04 | 11.914807 | 5.223333 | 63.119577 |
data_df = data.merge(new_df, left_on='Timestamp',
data_df['volume(BTC)/Volume_market_mean'] = data_df['Volume_(BTC)'] / data_df['Volume_market_mean']
data_df['Volume_(Currency)/volume_curr_mean'] = data_df['Volume_(Currency)'] / data_df['volume_curr_mean']
data_df['close/close_market_mean'] = data_df['Close'] / data_df['close_mean']
data_df['open/close'] = data_df['Open'] / data_df['Close']
data_df["gap"] = data_df["High"] - data_df["Low"]
Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | Volume_market_mean | close_mean | volume_curr_mean | volume(BTC)/Volume_market_mean | Volume_(Currency)/volume_curr_mean | close/close_market_mean | open/close | gap | |
Timestamp | |||||||||||||||
2011-12-31 | 4.465000 | 4.482500 | 4.465000 | 4.482500 | 23.829470 | 106.330084 | 4.471603 | 23.829470 | 4.482500 | 106.330084 | 1.0 | 1.0 | 1.0 | 0.996096 | 0.017500 |
2012-01-01 | 4.806667 | 4.806667 | 4.806667 | 4.806667 | 7.200667 | 35.259720 | 4.806667 | 7.200667 | 4.806667 | 35.259720 | 1.0 | 1.0 | 1.0 | 1.000000 | 0.000000 |
2012-01-02 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 19.048000 | 95.240000 | 5.000000 | 19.048000 | 5.000000 | 95.240000 | 1.0 | 1.0 | 1.0 | 1.000000 | 0.000000 |
2012-01-03 | 5.252500 | 5.252500 | 5.252500 | 5.252500 | 11.004660 | 58.100651 | 5.252500 | 11.004660 | 5.252500 | 58.100651 | 1.0 | 1.0 | 1.0 | 1.000000 | 0.000000 |
2012-01-04 | 5.200000 | 5.223333 | 5.200000 | 5.223333 | 11.914807 | 63.119577 | 5.208159 | 11.914807 | 5.223333 | 63.119577 | 1.0 | 1.0 | 1.0 | 0.995533 | 0.023333 |
Sometimes, the data set might be too huge to process, since we are using dataframe. To make sure we dont hold up too much RAM. We could try other approaches like
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
print(f'{m_}Memory of the dataframe:\n{y_}{mem_usage(data_df)}')
[37mMemory of the dataframe:
[36m0.40 MB
#All the columns in float64 format, we can downsize them to float32 to reduce memory usage
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3289 entries, 2011-12-31 to 2020-12-31
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Open 3289 non-null float64
1 High 3289 non-null float64
2 Low 3289 non-null float64
3 Close 3289 non-null float64
4 Volume_(BTC) 3289 non-null float64
5 Volume_(Currency) 3289 non-null float64
6 Weighted_Price 3289 non-null float64
7 Volume_market_mean 3289 non-null float64
8 close_mean 3289 non-null float64
9 volume_curr_mean 3289 non-null float64
10 volume(BTC)/Volume_market_mean 3289 non-null float64
11 Volume_(Currency)/volume_curr_mean 3289 non-null float64
12 close/close_market_mean 3289 non-null float64
13 open/close 3289 non-null float64
14 gap 3289 non-null float64
dtypes: float64(15)
memory usage: 411.1 KB
gl_float = data_df.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['Before','After']
Before | After | |
float32 | NaN | 15.0 |
float64 | 15.0 | NaN |
print(f"{m_}Before float conversion:\n{y_}{mem_usage(data_df)}")
data_df[converted_float.columns] = converted_float
print(f"{m_}After float conversion:\n{y_}{mem_usage(data_df)}")
[37mBefore float conversion:
[36m0.40 MB
[37mAfter float conversion:
[36m0.21 MB
def triple_plot(x, title,c):
fig, ax = plt.subplots(3,1,figsize=(25,10),sharex=True)
sns.distplot(x, ax=ax[0],color=c)
ax[0].set_title('Histogram + KDE')
sns.boxplot(x, ax=ax[1],color=c)
sns.violinplot(x, ax=ax[2],color=c)
ax[2].set_title('Violin plot')
fig.suptitle(title, fontsize=30)
triple_plot(data['Open'],'Distribution of Opening price',custom_colors[0])
triple_plot(data['High'],'Distribution of the highest price',custom_colors[1])
triple_plot(data['Low'],'Distribution of Lowest Price',custom_colors[2])
triple_plot(data['Close'],'Distribution of the closing Price',custom_colors[3])
triple_plot(data['Volume_(BTC)'],'Distribution of Volume in BTC ',custom_colors[0])
triple_plot(data['Volume_(Currency)'],'Distribution of Volume',custom_colors[1])
triple_plot(data['Weighted_Price'],'Distribution of Weighted price',custom_colors[2])
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(data_df[data_df.columns[1:]].corr(), mask=mask, cmap='coolwarm', vmax=.3, center=0,
square=True, linewidths=.5,annot=True)
Index(['Open', 'High', 'Low', 'Close', 'Volume_(BTC)', 'Volume_(Currency)',
'Weighted_Price', 'Volume_market_mean', 'close_mean',
'volume_curr_mean', 'open/close', 'gap'],
indices = corr.index.values
cor_matrix = np.asmatrix(corr)
G = nx.from_numpy_matrix(cor_matrix)
G = nx.relabel_nodes(G,lambda x: indices[x])
def corr_network(G, corr_direction, min_correlation):
H = G.copy()
for s1, s2, weight in G.edges(data=True):
if corr_direction == "positive":
if weight["weight"] < 0 or weight["weight"] < min_correlation:
H.remove_edge(s1, s2)
if weight["weight"] >= 0 or weight["weight"] > min_correlation:
H.remove_edge(s1, s2)
edges,weights = zip(*nx.get_edge_attributes(H,'weight').items())
weights = tuple([(1+abs(x))**2 for x in weights])
d = dict(
node_size=tuple([x**2 for x in node_sizes]),alpha=0.8)
nx.draw_networkx_labels(H, positions, font_size=13)
if corr_direction == "positive":
edge_colour =
edge_colour =
nx.draw_networkx_edges(H, positions, edgelist=edges,style='solid',
width=weights, edge_color = weights, edge_cmap = edge_colour,
edge_vmin = min(weights), edge_vmax=max(weights))
corr_network(G, corr_direction="positive",min_correlation = 0.5)
Index(['Open', 'High', 'Low', 'Close', 'Volume_(BTC)', 'Volume_(Currency)',
'Weighted_Price', 'Volume_market_mean', 'close_mean',
'volume_curr_mean', 'open/close', 'gap'],
Time series Analysis and Prediction using Prophet
What is Prophet? Prophet is a facebooks’ open source time series prediction. Prophet decomposes time series into trend, seasonality and holiday. It has intuitive hyper parameters which are easy to tune.
things to note when using Prophet
series = data_df.Weighted_Price
result = seasonal_decompose(series, model='additive',period=1)
# Renaming the column names accroding to Prophet's requirements
<fbprophet.forecaster.Prophet at 0x7ff8c52e7c10>
pro_regressor= Prophet()
train_X= prophet_df[:2500]
test_X= prophet_df[2500:]
#Fitting the data
future_data = pro_regressor.make_future_dataframe(periods=249)
#Forecast the data for Test data
forecast_data = pro_regressor.predict(test_X)
df_cv = cross_validation(pro_regressor, initial='100 days', period='180 days', horizon = '365 days')
pm = performance_metrics(df_cv, rolling_window=0.1)
fig = plot_cross_validation_metric(df_cv, metric='mape', rolling_window=0.1)
horizon | mse | rmse | mae | mape | mdape | coverage | |
0 | 37 days | 0.413771 | 0.643251 | 0.216705 | 0.001091 | 0.000171 | 0.785388 |
1 | 38 days | 0.469662 | 0.685319 | 0.231994 | 0.001188 | 0.000175 | 0.779680 |
2 | 39 days | 0.539967 | 0.734825 | 0.249433 | 0.001294 | 0.000178 | 0.772831 |
3 | 40 days | 0.615809 | 0.784735 | 0.268262 | 0.001407 | 0.000180 | 0.767123 |
4 | 41 days | 0.688987 | 0.830052 | 0.286306 | 0.001525 | 0.000188 | 0.764840 |
horizon | mse | rmse | mae | mape | mdape | coverage | |
324 | 361 days | 23.897097 | 4.888466 | 1.934879 | 0.006200 | 0.000758 | 0.849315 |
325 | 362 days | 23.913066 | 4.890099 | 1.935748 | 0.006199 | 0.000381 | 0.851598 |
326 | 363 days | 23.928062 | 4.891632 | 1.936313 | 0.006196 | 0.000713 | 0.853881 |
327 | 364 days | 23.944867 | 4.893349 | 1.935958 | 0.006192 | 0.000586 | 0.856164 |
328 | 365 days | 23.963754 | 4.895279 | 1.936640 | 0.006186 | 0.000557 | 0.858447 |
The MAPE (Mean Absolute Percent Error) measures the size of the error in percentage terms. It is calculated as the average of the unsigned percentage error Many organizations focus primarily on the MAPE when assessing forecast accuracy. Most people are comfortable thinking in percentage terms, making the MAPE easy to interpret. It can also convey information when you don’t know the item’s demand volume. For example, telling your manager, “we were off by less than 4%” is more meaningful than saying “we were off by 3,000 cases,” if your manager doesn’t know an item’s typical demand volume.
What Prophet doesnt do