"This notebook contains a propototype approach for analyzing time series data.\n",
"The main goals of these applications are to:\n",
"1. Interactive view of the data (zoom in/out, show info when hovering over)\n",
"2. Cluster *windows* (consecutive series of data-points) based on similarity\n",
"3. Visualise the clusters in some way\n",
"\n",
"The applications are to be used in a large-scale data context. Thus the clustering and retrieval of similar windows should be fast. There are algorithms that compare the raw data of all windows to eachother (e.g. DTW), but this has a time-complexity of O(n^2), which is way too big for big data.\n",
"\n",
"A faster way to cluster windows is to define a window to a fixed length symbol. If the length of the symbol is fixed, we can assign each window to a bucket. A cluster would then be all the elements within the bucket. Would we recieve a (new) window, we can simply compute the according symbol and find all similar windows.\n",
"\n",
"One example of this is Symbolic Aggregate approXimation (SAX). SAX is used to transform a sequence of rational numbers (i.e., a time series) into a sequence of letters (i.e., a string).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We start of with a simple dataset containing weather information from 2013-2017. The code below shows a snippet of the dataset"
"Next we cluster the windows using the **saxpy** library. The parameters for the clustering are as follows:\n",
"*window_size*: The size (number of data points) of one window.\n",
"*step_size*: The size (number of data points) in between each window. If the window_size is not absurdly small, chances are high that the window starting at data point *n* will be almost identical to the one at data point *n+1*. Because this is not very interesting for a data analyst, it's better to keep some distance between windows.\n",
"*paa_size*: The number of letters in one symbol.\n",
"*cut_size*: The number of existing letters.\n",
"\n",
"The output of this code is a dictionary with for every symbol (cluster) a list of indexes of the windows contained in the cluster"
"Now that we have cluster information, it's time to visualize it. We start of with a simple visualization, where for each cluster we can see all the datapoints on the raw data that correspond to it."
"# Add the update function to the click event on each trace\n",
"for i in range( len(fig.data) ):\n",
" fig.data[i].on_click(update_trace) \n",
"\n",
"# Display the image\n",
"image"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
%% Cell type:markdown id: tags:
# Analysing Time Series using Sax
%% Cell type:markdown id: tags:
This notebook contains a propototype approach for analyzing time series data.
The main goals of these applications are to:
1. Interactive view of the data (zoom in/out, show info when hovering over)
2. Cluster *windows* (consecutive series of data-points) based on similarity
3. Visualise the clusters in some way
The applications are to be used in a large-scale data context. Thus the clustering and retrieval of similar windows should be fast. There are algorithms that compare the raw data of all windows to eachother (e.g. DTW), but this has a time-complexity of O(n^2), which is way too big for big data.
A faster way to cluster windows is to define a window to a fixed length symbol. If the length of the symbol is fixed, we can assign each window to a bucket. A cluster would then be all the elements within the bucket. Would we recieve a (new) window, we can simply compute the according symbol and find all similar windows.
One example of this is Symbolic Aggregate approXimation (SAX). SAX is used to transform a sequence of rational numbers (i.e., a time series) into a sequence of letters (i.e., a string).
%% Cell type:markdown id: tags:
We start of with a simple dataset containing weather information from 2013-2017. The code below shows a snippet of the dataset
Next we cluster the windows using the **saxpy** library. The parameters for the clustering are as follows:
*window_size*: The size (number of data points) of one window.
*step_size*: The size (number of data points) in between each window. If the window_size is not absurdly small, chances are high that the window starting at data point *n* will be almost identical to the one at data point *n+1*. Because this is not very interesting for a data analyst, it's better to keep some distance between windows.
*paa_size*: The number of letters in one symbol.
*cut_size*: The number of existing letters.
The output of this code is a dictionary with for every symbol (cluster) a list of indexes of the windows contained in the cluster
Now that we have cluster information, it's time to visualize it. We start of with a simple visualization, where for each cluster we can see all the datapoints on the raw data that correspond to it.
%% Cell type:code id: tags:
``` python
importplotly.graph_objsasgo
fromplotly.offlineimportinit_notebook_mode,iplot
fig=go.FigureWidget()
# fig.add_trace(
# go.Scatter(x=meantemp.index, y=meantemp.values)
# )
default_size=5
delta_size=2
default_opacity=0.3
# Draw a trace for every symbol (which may contain multiple windows)
forsymbolinsax_dict:
xvalues=[]
yvalues=[]
foriinsax_dict[symbol]:
xvalues.extend(meantemp.index[i:i+window_size])
yvalues.extend(meantemp.values[i:i+window_size])
fig.add_trace(
go.Scatter(
x=xvalues+xvalues,
y=yvalues+[0]*len(yvalues),
name=symbol,
mode='markers',
marker=dict(
size=default_size,
opacity=0.5,
)
)
)
# Update the figure when clicked on a point, so we can highlight the clicked trace