Predicting Bitcoin with News using R

The above dataset contains a daily summary of prices where the CHANGE column is the percentage change of the last price of the day (PRICE) with respect to the first (OPEN).

Goal: To make things simple, we’ll focus on predicting if the price will rise (change > 0) or fall (change ≤ 0) the following day. (So we may potentially use the predictions ‘in real life’).

  • Have R 3.2.0 or > installed
  • Install packages ‘OpenBlender’, ‘RandomForest’ and ‘MLmetrics’
install.packages("openblender")
install.packages("randomForest")
install.packages("MLmetrics")

Let’s import the libraries we’ll use:

library(openblender)
library(randomForest)
library(MLmetrics)

Now let’s pull the data through the OpenBlender API.

First, we’ll define the parameters (in this case it’s just the id of the Bitcoin dataset):

parameters <- list(
token='YOUR_TOKEN_HERE',
id_dataset='5d4c3af79516290b01c83f51'
)
  • Note: You need to create an account on openblender.io (free) and get your token in the ‘Account’ tab.

Now let’s pull the data into a Dataframe ‘df’:

action <- 'API_getObservationsFromDataset'response <- openblender::call(action, parameters)$sample
df <- response$sample
head(df)

Note: The values may vary as this dataset is updated daily.

First, we need to create our prediction target which is if ‘change’ will increase or decrease. To do this let’s add a target threshold for success over zero to our parameters and, if we want to predict observations for ‘tomorrow’ we can’t use information from the same day, so let’s add a one period lag on the target:

parameters <- list(
token="YOUR_TOKEN_HERE",
id_dataset="5d4c3af79516290b01c83f51",
target_threshold=list(feature="change", success_thr_over=0),
lag_target_feature=list(feature="change_over_0", periods=1)
)
df = openblender::call(action, parameters)$sample

Two things happened here.

  1. The ‘change’ feature was replaced by a new feature: ‘TARGET_change_over_0’ which is 1 if ‘change’ was positive and 0 if else.
  2. That feature was aligned with the data from the previous period (day).

TARGET_change_over_0’ will be our target for machine learning.

Now, let’s take a look at correlations:

target_variable <- "TARGET_change_over_0"
corr <- cor(df[,2:7], df[,'TARGET_change_over_0'], method="pearson")
View(corr[order(corr),])

They are linearly uncorrelated and unlikely to be useful.

After searching for correlations on OpenBlender I found a dataset of Fox Business News which helps generate good predictions on our specific target.

What we want is a way to convert the ‘title’s into numerical features by counting repetitions of words and groups of words per news item and time-blend them to our Bitcoin dataset. This is simpler than it sounds.

First, we need to create a TextVectorizer for the ‘title’ feature of the news:

action_vect <- "API_createTextVectorizer"vect_params <- list(
token="YOUR_TOKEN_HERE",
name="Fox Business TextVectorizer",
sources=list(list(id_dataset="5d571f9e9516293a12ad4f6d",
features=list('title'))),
ngram_range=list(min=1, max=2),
language="en",
remove_stop_words="on",
min_count_limit=2
)
resp_vect <- openblender::call(action_vect, vect_params)
resp_vect

The TextVectorizer was created and it generated 2753 n-grams with our configuration. We’ll use the generated id_textVectorizer later: 5df017779516296bfe384ce5 (this is mine, you may have tu use yours).

In the ‘vect_params’ above, we specified the following:

  • name: We’ll name it ‘Fox Business TextVectorizer’
  • anchor: The id of the dataset and the name of the features to include as source (in this case only ‘title’)
  • ngram_range: The min and max length of the set of words which will be tokenized
  • language: English
  • remove_stop_words: So it eliminates stop-words from the source
  • min_count_limit: The minimum of repetitions to be considered a token (one time occurrences rarely help)

Now, we want to time-blend the features with our Bitcoin data. This basically means to join the two datasets using the timestamp as key. Let’s add the blend to our original parameters for pulling data:

blend_action <- "API_getObservationsFromDataset"
blend_params <- list(
token="YOUR_TOKEN_HERE",
id_dataset="5d4c3af79516290b01c83f51",
target_threshold=list(feature="change", success_thr_over=0),
lag_target_feature=list(feature="change_over_0", periods=1),
blends=list(
list(
id_blend="YOUR_ID_TEXTVECTORIZER_HERE",
blend_type="text_ts",
restriction="predictive",
specifications=list(time_interval_size=3600*12)
)
),
date_filter=list(start_date="2019-08-20T16:59:35.825Z", end_date="2019-11-04T17:59:35.825Z"),
drop_non_numeric=1
)
resp_blend <- openblender::call(blend_action, blend_params)
df_blend <- resp_blend$sample
View(df_blend)

We got a very big dataframe with over 2K features, one for each n-gram and our original dataframe target. All aligned and ready for ML.

What we’re specifying above is the following:

  • id_blend : The id from our textVectorizer
  • blend_type : ‘text_ts’ so it knows it’s a text and timestamp blend
  • restriction : ‘predictive’, so that it doesn’t blend news from the future to each observation, only news that happened before
  • blend_class : ‘closest_observation’ so that it blends the closest observations in time
  • specifications : the maximum time to the past from which it will bring observations in seconds which in this case is 12 hours (3600*12). This only means that every Bitcoin price observation will be predicted with news from the past 12 hours
  • date_filter : I specified the 4th of November as ‘end_date’ because that’s the day I wrote it, but you can change the date you’re reading it.

Now we finally have the cleansed dataset exactly as we need it with the lagged target and the blended numerical data.

Let’s take a look at the top correlations with ‘Target_change_over_0’:

y <- df_blend[, "TARGET_change_over_0"]
df_blend$TARGET_change_over_0 <- NULL
x <- df_blend
View(cor(x, y))

Top Negative / Top Positive

There are several correlated features now. Let’s separate or train and test sets chronologically, so we can train with previous observations and test on future ones.

div <- as.integer(round(nrow(x) * 0.29))
x_test <- x[1:div,]
y_test <- y[1:div]
x_train <- x[(div+1):nrow(x),]
y_train <- y[(div+1):length(y)]
dim(x_test)
length(y_test)
dim(x_train)
length(y_train)

We have 41 observations to train with and 17 to test with.

rf <- randomForest(x=x_train, y=y_train, ntree = 1000)
y_pred <- predict(rf, newdata = x_test)
df_res <- as.data.frame(list(
y_test=y_test,
y_pred=y_pred
))
head(df_res)

Our real ‘y_test’ is binary but our predictions are floats, so let’s round them assuming if they are higher than 0.5 we would predict a price increase and lower than 0.5 a decrease.

threshold <- 0.5
preds <- c()
for(i in seq(1, nrow(df_res), by=1)) {
if(df_res[i, 'y_pred'] > threshold) {
preds[i] <- 1
} else {
preds[i] <- 0
}
}
AUC(preds, y_test)
ConfusionMatrix(preds, y_test)
Accuracy(preds, y_test)

We got 64.7% of the predictions correct with a 0.53 AUC to back it.

  • 10 times we predicted a decrease and it decreased (correct)
  • 5 times we predicted a decrease and it increased (wrong)
  • 1 time we predicted an increase and it decreased (wrong)
  • 1 time we predicted an increase and it increased (correct)

This tutorial was for educational purposes, thank you for reading.

Image taken from https://dbsnail.com/2018/01/09/a-tweet-sentiment-analysis-on-bitcoin/

来源

What do you think?

发表评论

电子邮件地址不会被公开。 必填项已用*标注

Loading…

0

Comments

0 comments

Deploy a complex smart contract easily and quickly with Remix IDE

How Understanding Hackers Changed This Entrepreneur’s Life