Seq2Seq系列(一):基于神经网络的高维时间序列预测

此Notebook旨在通过Python/Keras来验证 seq2seq神经网络如何做时间序列预测,尤其是在高维时间序列——也就是说,必须同步预测大量(10万+)序列的场景下。神经网络相对传统序列分析模型如ARIMA最有优势的地方在于——无需建立大量fine-tuned、针对序列的模型参数。

本文使用的数据是 Kaggle 的维基百科页面流量,对应的比赛赛题是向前预测未来60天,但这里我们只预测未来14天。不过,模型的编码(encoding)阶段将用到"train_1.csv"里所有的历史序列。本文的重点在于完成seq2seq核心结构的一个相对简单的实现,而不是训练一个最优模型(此部分请参考此系列后续的文章)。大致内容如下

  1. 数据载入和预览

  2. 数据格式化
  3. 建模 - Training Architecture
  4. 建模 - Inference Architecture
  5. 生成预测并可视化

数据载入和预览

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

df = pd.read_csv('../data/train_1.csv')
df.head()

可以看到,数据包含大量NaN值,方便起见,后面将会用0填充。

1
2
3
4
5
df.info()

data_start_date = df.columns[1]
data_end_date = df.columns[-1]
print('Data ranges from %s to %s' % (data_start_date, data_end_date))

输出:

1
2
3
4
5
6
7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145063 entries, 0 to 145062
Columns: 551 entries, Page to 2016-12-31
dtypes: float64(550), object(1)
memory usage: 609.8+ MB

Data ranges from 2015-07-01 to 2016-12-31

定义一个函数——随机选取一些网页的序列数据进行可视化,如下图。此外,为了防止被选中的序列在浮动范围上有较大不同,这里做了一个log1p转换:把序列中的每个 \(x\) 转为 \(\log(1+x)\)

1
2
3
4
5
6
7
8
9
10
11
12
def plot_random_series(df, n_series):
sample = df.sample(n_series, random_state=8)
page_labels = sample['Page'].tolist()
series_samples = sample.loc[:,data_start_date:data_end_date]
plt.figure(figsize=(10,6))
for i in range(series_samples.shape[0]):
np.log1p(pd.Series(series_samples.iloc[i]).astype(np.float64)).plot(linewidth=1.5)

plt.title('Randomly Selected Wikipedia Page Daily Views Over Time (Log(views) + 1)')
plt.legend(page_labels)

plot_random_series(df, 6)

数据格式化

遗憾的是,我们不能把现成的dataframe直接丢给Keras处理,而是需要通过几个数据转换步骤来得到一系列numpy数组,然后再传入Keras。在这之前,首先要弄清楚如何恰当地把时间序列分割成编码段(Encoding intervals) 和解码段 (decoding intervals),用来训练和验证我们的模型。

本文使用 walk-forward validation来划分训练和验证集——设定一个时间范围作为训练集,向右平移一个时间段(本例为14天)得到验证集,从而验证模型在未知数据上的性能。Artur Suilin 画了一个非常形象的示意图,将这种验证机制与传统的side-by-side验证机制做了对比。在此强烈推荐看一下他的整个repo,用此维基百科页面流量数据训练了首屈一指的seq2seq模型,在此次竞赛中获得金牌。

分割训练序列和验证序列

把数据分割成如下4个子段:

  1. 训练集编码段
  2. 训练集解码段(训练目标,长度为14天)
  3. 验证集编码段
  4. 验证集解码段(验证目标,长度为14天)

译者按:以上1 2 3 4 即 x_train, y_train, x_test, y_test

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from datetime import timedelta

pred_steps = 14
pred_length=timedelta(pred_steps)

first_day = pd.to_datetime(data_start_date)
last_day = pd.to_datetime(data_end_date)

val_pred_start = last_day - pred_length + timedelta(1)
val_pred_end = last_day

train_pred_start = val_pred_start - pred_length
train_pred_end = val_pred_start - timedelta(days=1)

enc_length = train_pred_start - first_day

train_enc_start = first_day
train_enc_end = train_enc_start + enc_length - timedelta(1)

val_enc_start = train_enc_start + pred_length
val_enc_end = val_enc_start + enc_length - timedelta(1)

print('Train encoding:', train_enc_start, '-', train_enc_end)
print('Train prediction:', train_pred_start, '-', train_pred_end, '\n')
print('Val encoding:', val_enc_start, '-', val_enc_end)
print('Val prediction:', val_pred_start, '-', val_pred_end)

print('\nEncoding interval:', enc_length.days)
print('Prediction interval:', pred_length.days)

输出:

1
2
3
4
5
6
7
8
Train encoding: 2015-07-01 00:00:00 - 2016-12-03 00:00:00
Train prediction: 2016-12-04 00:00:00 - 2016-12-17 00:00:00

Val encoding: 2015-07-15 00:00:00 - 2016-12-17 00:00:00
Val prediction: 2016-12-18 00:00:00 - 2016-12-31 00:00:00

Encoding interval: 522
Prediction interval: 14

数据格式化 for Keras

有了时间子段的日期后,就可以开始编写函数把数据转换为Keras适用的格式了,具体步骤:

  1. 把时间序列转为numpy array,并用 date_to_index 变量来存储日期和索引之间的对应关系。
  2. 编写函数get_time_block_series() 从所有序列中提取具体的时间段
  3. 编写函数用于处理所有时间序列
    • 用 「log1p 转换」和「减去序列均值」的方式平滑序列的数值范围,然后转换为适用于Keras的 (n_series, n_timesteps, n_features) 张量。
    • 同时编写一个逆转换函数,用于把log scale的预测值转回真实范围的预测值。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 1.
date_to_index = pd.Series(index=pd.Index([pd.to_datetime(c) for c in df.columns[1:]]),
data=[i for i in range(len(df.columns[1:]))])
series_array = df[df.columns[1:]].values

# 2.
def get_time_block_series(series_array, date_to_index, start_date, end_date):
inds = date_to_index[start_date:end_date]
return series_array[:,inds]

#3.
def transform_series_encode(series_array):
series_array = np.log1p(np.nan_to_num(series_array)) # filling NaN with 0
series_mean = series_array.mean(axis=1).reshape(-1,1)
series_array = series_array - series_mean
series_array = series_array.reshape((series_array.shape[0],series_array.shape[1], 1))
return series_array, series_mean

def transform_series_decode(series_array, encode_series_mean):
series_array = np.log1p(np.nan_to_num(series_array)) # filling NaN with 0
series_array = series_array - encode_series_mean
series_array = series_array.reshape((series_array.shape[0],series_array.shape[1], 1))
return series_array

建模 - Training Architecture

本章节的模型结构和代码改编自 keras blog introduction to seq2seq 。Chollet的例子展示了经典seq2seq在机器翻译上的应用,我们这里要实现的步骤和它十分相似。在训练时使用teacher forcing方法,把真实的序列值(滞后一个时间步长)作为解码器的输入。直观来讲就是教Neural Net模型如何通过拟合之前的time steps来预测下一个time step。

模型结构如下图(同样来自 Artur Suilin)所示,只不过我们用的是LSTM而不是GRU。

1
2
3
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras.optimizers import Adam

译者按:若TensorFlow的版本是2.0及以上,import模块时需把 keras.xxx 替换为 tensorflow.keras.xxx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
latent_dim = 50 # LSTM hidden units
dropout = .20

# Define an input series and encode it with an LSTM.
encoder_inputs = Input(shape=(None, 1))
encoder = LSTM(latent_dim, dropout=dropout, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the final states. These represent the "context"
# vector that we use as the basis for decoding.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
# This is where teacher forcing inputs are fed in.
decoder_inputs = Input(shape=(None, 1))

# We set up our decoder using `encoder_states` as initial state.
# We return full output sequences and return internal states as well.
# We don't use the return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, dropout=dropout, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)

decoder_dense = Dense(1) # 1 continuous output at each timestep
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Model: "model"
________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
========================================================================================
input_1 (InputLayer) [(None, None, 1)] 0
________________________________________________________________________________________
input_2 (InputLayer) [(None, None, 1)] 0
________________________________________________________________________________________
lstm (LSTM) [(None, 50), (None, 10400 input_1[0][0]
________________________________________________________________________________________
lstm_1 (LSTM) [(None, None, 50), ( 10400 input_2[0][0]
lstm[0][1]
lstm[0][2]
________________________________________________________________________________________
dense (Dense) (None, None, 1) 51 lstm_1[0][0]
========================================================================================
Total params: 20,851
Trainable params: 20,851
Non-trainable params: 0
________________________________________________________________________________________

模型结构定义好以后就可以开始训练了,GPU不是太好的话,这里可能要花上几分钟。

利用先前定义的函数处理序列数据,传给模型。训练时损失函数选平均绝对误差(Mean Absolute Error)。你也可以通过增加数据量、调超参(learning rate, epoch数等)来优化模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
first_n_samples = 20000
batch_size = 2**11
epochs = 100

# sample of series from train_enc_start to train_enc_end
encoder_input_data = get_time_block_series(series_array, date_to_index,
train_enc_start, train_enc_end)[:first_n_samples]
encoder_input_data, encode_series_mean = transform_series_encode(encoder_input_data)

# sample of series from train_pred_start to train_pred_end
decoder_target_data = get_time_block_series(series_array, date_to_index,
train_pred_start, train_pred_end)[:first_n_samples]
decoder_target_data = transform_series_decode(decoder_target_data, encode_series_mean)

# lagged target series for teacher forcing
decoder_input_data = np.zeros(decoder_target_data.shape)
decoder_input_data[:,1:,0] = decoder_target_data[:,:-1,0]
decoder_input_data[:,0,0] = encoder_input_data[:,-1,0]

model.compile(Adam(), loss='mean_absolute_error')
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
Train on 16000 samples, validate on 4000 samples
Epoch 1/100
16000/16000 [==============================] - 44s 3ms/sample - loss: 0.6736 - val_loss: 0.5213
Epoch 2/100
16000/16000 [==============================] - 38s 2ms/sample - loss: 0.5390 - val_loss: 0.4113
Epoch 3/100
16000/16000 [==============================] - 37s 2ms/sample - loss: 0.4568 - val_loss: 0.3751
……
Epoch 99/100
16000/16000 [==============================] - 38s 2ms/sample - loss: 0.3007 - val_loss: 0.2858
Epoch 100/100
16000/16000 [==============================] - 35s 2ms/sample - loss: 0.3003 - val_loss: 0.2853

画出训练集和验证集上的loss收敛曲线。

1
2
3
4
5
6
7
8
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.xlabel('Epoch')
plt.ylabel('Mean Absolute Error Loss')
plt.title('Loss Over Time')
plt.legend(['Train','Valid'])
plt.show();

image-20191101145155755

建模 - Inference Architecture

到此模型结构方面的还不算完。接下来需要用keras 构造一个 Inference model 利用神经网络来生成预测值。简单来说,模型先编码输入序列,然后挨个生成预测值。解码器从编码器那里接受初始状态向量,然后解码器每生成一个时间步的预测,状态向量就随之迭代更新。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# from our previous model - mapping encoder sequence to state vectors
encoder_model = Model(encoder_inputs, encoder_states)

# A modified version of the decoding stage that takes in predicted target inputs
# and encoded state vectors, returning predicted target outputs and decoder state vectors.
# We need to hang onto these state vectors to run the next step of the inference loop.
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)

def decode_sequence(input_seq):

# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)

# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, 1))

# Populate the first target sequence with end of encoding series pageviews
target_seq[0, 0, 0] = input_seq[0, -1, 0]

# Sampling loop for a batch of sequences - we will fill decoded_seq with predictions
# (to simplify, here we assume a batch of size 1).

decoded_seq = np.zeros((1,pred_steps,1))

for i in range(pred_steps):

output, h, c = decoder_model.predict([target_seq] + states_value)

decoded_seq[0,i,0] = output[0,0,0]

# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, 1))
target_seq[0, 0, 0] = output[0,0,0]

# Update states
states_value = [h, c]

return decoded_seq

生成预测并可视化

现在一切就绪,可以在「没有用来训练的那部分encoder/target series pairs」上做预测了(译按:就是X_test/y_test pairs)。首先生成验证集(回想一下,这部分数据在时间维度上向右平移了14天),然后以此为输入,定义一个predict_and_plot() 函数,用来生成预测并画出解码器序列真实目标序列预测目标序列的末尾段。我们可以从图中看出大致的预测效果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
encoder_input_data = get_time_block_series(series_array, date_to_index, val_enc_start, val_enc_end)
encoder_input_data, encode_series_mean = transform_series_encode(encoder_input_data)

decoder_target_data = get_time_block_series(series_array, date_to_index, val_pred_start, val_pred_end)
decoder_target_data = transform_series_decode(decoder_target_data, encode_series_mean)

def predict_and_plot(encoder_input_data, decoder_target_data, sample_ind, enc_tail_len=50):

encode_series = encoder_input_data[sample_ind:sample_ind+1,:,:]
pred_series = decode_sequence(encode_series)

encode_series = encode_series.reshape(-1,1)
pred_series = pred_series.reshape(-1,1)
target_series = decoder_target_data[sample_ind,:,:1].reshape(-1,1)

encode_series_tail = np.concatenate([encode_series[-enc_tail_len:],target_series[:1]])
x_encode = encode_series_tail.shape[0]

plt.figure(figsize=(10,6))

plt.plot(range(1,x_encode+1),encode_series_tail)
plt.plot(range(x_encode,x_encode+pred_steps),target_series,color='orange')
plt.plot(range(x_encode,x_encode+pred_steps),pred_series,color='teal',linestyle='--')

plt.title('Encoder Series Tail of Length %d, Target Series, and Predictions' % enc_tail_len)
plt.legend(['Encoding Series','Target Series','Predictions'])

从下面几个折线图中可以看出,模型的预测值有效的预见了部分趋势,并且知道何时应该趋于平稳。

不过,可以明显看出模型的预测过于保守,数据中许多变化都没能捕获。接下来或许能够通过增大训练样本数量、调整网络结构或者其他超参、增加 epoch 等方法提升模型。

各位看官若想进一步了解更fancy的模型结构和表达性更好的预测,请参考本系列的第二篇。

1
predict_and_plot(encoder_input_data, decoder_target_data, 100)

image-20191101171606978

1
predict_and_plot(encoder_input_data, decoder_target_data, 6007)

image-20191101171704692

1
predict_and_plot(encoder_input_data, decoder_target_data, 33000)

image-20191101171743064


本文非博主原创,all credit to 原作者:Joseph Eddy – Data scientist, teacher, debate coach

原文GitHub地址:An Introduction to High-Dimensional Time Series Forecasting with Neural Networks

如有翻译不当的地方,还请指正。

当我们谈SVM时我们谈些什么

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×