analysis.Rmd

---
title: "Analysis_Music_Challenge"
author: "Jaime Gacitua - ADS"
date: "November 11, 2016"
output:
  html_notebook: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(plotly)
library(rhdf5)
library(dplyr)
source('./lib/features.R')
#rm(list=ls(all=TRUE))
```

Extract and Perform High level analysis of song lyrics data
```{r, warning=FALSE, message=FALSE, error=FALSE}
# Extract lyrics data
common_id <- read_csv('./data/common_id.txt', col_names = FALSE)
load('./data/lyr.RData')

# Explore lyrics words
frequencies <- data.frame('word' = names(lyr[,-1]), 'freq' = colSums(lyr[,-1]) )
frequencies <- frequencies[order(frequencies$freq, decreasing = TRUE),]
frequencies$word <- factor(frequencies$word, levels = frequencies$word)
frequencies$pct <- frequencies$freq / sum(frequencies$freq)

a <- plot_ly(data = frequencies, x = ~word, y = ~pct, type = 'scatter', mode = 'markers',
        text = ~word) %>%
  layout(title = "Word frequency in songs | Percentage",
         scene = list(
           xaxis = list(title = "Word",
                        type = 'category',
                        categoryorder = 'array',
                        categoryarray = ~word), 
           yaxis = list(title = "Percentage")))

head(frequencies, n = 50)


a

```

## Understanding the music features

```{r}

###########################################
# Understanding features of  h5 files   ###
# More info in:                         ###
# http://labrosa.ee.columbia.edu/millionsong/pages/example-track-description
###########################################

#h5f <- h5dump('./data/data/A/A/A/TRAAABD128F429CF47.h5', load=TRUE)

h5f <- h5dump('./data/TestSongFile100/testsong1.h5', load=TRUE)
analysis.test <- h5f$analysis
data.test <- h5f$data
metadata.test <- h5f$metadata
musicbrainz.test <- h5f$musicbrainz

# Tempo could help. Seems to fit with beats
beat.diff <- diff(analysis.test$beats_start,1)
hist(beat.diff)
bpm.test <- mean(60 / beat.diff)
bpm.test
tempo.test <- analysis.test$songs$tempo
tempo.test

bpm.var <- var(60 / beat.diff)
bpm.var

# Tatum
tatums.diff <- diff(analysis.test$tatums_start,1)
hist(tatums.diff)
tatums.mean <- mean(tatums.diff)
tatums.var <- var(tatums.diff)
tatums.mean
tatums.var


analysis.test$songs$duration
max(analysis.test$beats_start)+mean(beat.diff)

mean(analysis.test$segments_loudness_max)
var(analysis.test$segments_loudness_max)
analysis.test$songs$loudness

c(analysis.test$segments_timbre)  %>% summary()

a <- apply(analysis.test$segments_timbre, 1, function(x){
  median(x)
} )
a <- t(a)

adf <- data.frame(t(a))


names(analysis.test)


H5close()


#####
# ETL features
#####

### Extract + transform Features


dir.h5 <- './data/data/'
files.list <- as.matrix(list.files(dir.h5, recursive = TRUE))
                        
#song.features <- get.features(files.list, dir.h5)


```

## Logistic Regression over each Word

```{r}
### Logistic regression over each word

## 1 Get word column
word.col <- lyr[,c(1, 500)]
word.col$bonita1 <- 0
word.col$bonita1[word.col$bonita >= 1] <- 1

names(word.col)[1] <- "song"

## 2 Merge column with features
word.col.and.features <- inner_join(word.col, song.features.df, by='song')

## Perform logistic regression
model <- glm(bonita ~ tempo, family = binomial(link = 'logit'), data = word.col.and.features)

## Understand output
summary(model)

anova(model, test = 'Chisq')
```

Calculate probability of occurrence of word in each song. Test algorighm, more than measuring prediction error.

Below songs that have a high prediction, with bonita, and what is their corresponding tempo.

```{r}
predictor <- data.frame(tempo = word.col.and.features$tempo)
fitted.results <- predict(model, newdata = predictor, type = 'response')
result = cbind(word.col.and.features, fitted.results)
head(result)

a <- filter(result, result$bonita1 == 1 | fitted.results == max(fitted.results))
a
```

There are some songs without tempo, those have a higher probability! Will research one of them.

```{r}
song.no.tempo <- "TRATCDG128F932B4D2"

song.features.df[song.features.df$song == "TRATCDG128F932B4D2",]

h5f <- h5dump('./data/data/A/T/C/TRATCDG128F932B4D2.h5', load=TRUE)

analysis.test <- h5f$analysis
data.test <- h5f$data
metadata.test <- h5f$metadata
musicbrainz.test <- h5f$musicbrainz

# Tempo could help. Seems to fit with beats
analysis.test$beats_start
analysis.test$songs$tempo

# Tatum
analysis.test$tatums_start

H5close()
```

There is no data for a few songs. Will have to do missing data management.