Notes BigData
===============
Definition
Web Intelligence and Big Data course
Why Big Data
Hadoop Ecosystem
MapReduce
Miscellaneous
Analysis
Definition
===========
Ref: http://www.intel.com.au/content/www/au/en/bigdata/unstructureddataanalyticspaper.html
 All history until 2003  5 exabytes
 2003 to 2012  2.7 zettabytes
 data generated by more sources, devices, including video,
 data are UNSTRUCTURED, texts, dates, facts. Traditional Analytics are Structured Data (RDBMS).
 Analytics = Profit. Gartner survey  outperform competitors by 20% for those who use Big Data.

Web Intelligence and Big Data (WIBD) course
======================================
50 billion pages indexed by google.
More surprising events is better news.
 if event has prob p, then
information = log_2 (p) bits of information.
Mutual Information (MI)  between transmitted and received channels
 need to maximise MI
 eg mutual information between Ad$ and Sales
 eg adsense  given webpage, guess keywords.
IDF = inverse document frequency
 rare words make better keyword.
 IDF of Word = Log_2 (N / N_w)
where N = total docs, N_w = number of word 'Word' in total docs.
TF = Term Frequency
 number of times the terms appear in that specific document.
 more frequent words (in that doc) make better keywords.
 TF = freq of w in doc d = n_w^d
TFIDF = term freq x IDF = n_w^d x log_2 (N/N_w)
 words with high TFIDF are good keywords.
Mutual Information between all pages and all words is prop to
SUM_d Sum_w { n_w^d x log_2 (N/N_w) }
Mutual Information: Input F  Machine Learning  Output B
Feature F, Behaviour B are independent.
Entropy H(F), H(B)
Mutual Information I(F,B) = SUM_f, SUM_b p(f,b) log { p(f,b) / p(f).p(b) }
Shannon: H(F) + H(B)  H(F,B)
WIBD  Naive Bayes
===================
Consider problem: P(BUY / r,f,g,c) where r,f,g,c are feature or keywords in web shopping.
Bayes Rule: P(B,R) = P(B/R).P(R) = P(R/B).P(B)
Naive Bayes assume r,f,g,c are INDEPENDENT
 can derive Likelihood
p(r/B)*p(c/B)*p(other features /B) ..... p(B)
 = L
p(r/notB)*p(c/notB)*p(other features /notB) ..... p(notB)
so if L > 1 we have a BUY, L < 1 then no Buy.
WIBD  Learn
==============
input X = x1,x2,...xn (ndimensional)
output y = y1,y2, ... ym
function f(X) = E[Y/X] expectation
= y1*P(y1/X) + y2*P(y2/X)
Classification(video 5.2)
 eg X=size, head, noise, legs, Y={animal names}
 eg X= {like, lot}, {hate, waste}, {not enjoy}, Y = {positive, negative}
Clustering  unsupervised
 allow us to get classes from data. Need to choose right features.
 used when we DON'T know outputs relationship to start with.
 by Defn Clustering are regions MORE populated than random data
 add random data so that Po(X). So that r= P(X)/P0(X) is large means there is clustering
then f(X)=E[Y/X]=r/(1+r) y=1 for real data, y=0 for added random uniform data.
 find things that do together to form a cluster. Eg Negative sentiment: hate, bad  but no one need to tell us they are Negative to start with.
 other means of clustering: kmeans, LSH
Rules
 finding which features are related (correlated) to be each other. ie trying to cluster the features, instead of clustering the data.
 compare data which are independent features: Po(X) = P(x1) * P(x2) * ... * P(xn)
where x1 = chirping, x2 = 4 legged, etc xi={animal features}
eg P(Chirping) = number of chirping / number of total Data
 Associative Rule Mining
if there are features, A,B,C,D, want to infer some rule, eg A,B,C => (infer) D
high support P(A,B,C,D) > s; technique is to find P(A)>s, P(B)>s etc first
high confidence P(D/A,B,C) > c
high interestingness P(D/A,B,C) / P(D) > i
 Recommendation of books  customers are features of books and vice versa.
Use latent Models: matrix m x n = m x k TIMES k x n
eg people x books = people x genre TIMES genre x books
NNMF  Nonnegative
features: unemployment direction, interest rate direction, fraud
WIBD  Connect
===============
Logic Inference
if A then B is SAME as ~A OR B
Obama is president of USA: isPresidentOf(Obama, USA)  predicates, variables
IF X is president of C THEN X is leader of C: IF isPresidentOf(X,C) THEN isLeaderOf(X,C)
Query: If K then Q, consider that the query means ~K OR Q is TRUE,
also same as K AND ~Q is FALSE.
So proving K AND ~Q is FALSE, this means If K Then Q.
WIRD  Prediction
==============
Linear Least Squares Regression:
x(i,j) with j features to predict, ith data points with results yi for the ith point
Minimizing f(x) = E(y/X) is same as minimizing error = E(yf(x))^2,
... so let f(x) = xT.f where f is the features vector of unknowns.
Finding vector derivative and equate to zero > xT.x.f  xT.y = 0
R^2 used to measure linear regression
Non  Linear correlation
 Logisitic Regression  f(x) = 1 / (1 + exp(f^T. x))
 Support Vector Machines  Data may be high order correlated, eg parabolic correlation etc.
Neural Networks
 linear least squares
 nonlinear like logistic
 feedforward, multilayer
 feedback, like belief network
Which Prediction technique
FEATURES TARGET CORRELATION TECHNIQUE
num num stable/linear Linear Regression
cat num Linear Regression, Neural Networks
num num unstable/nonlinear Neural Networks
num cat stable/linear Logistic Regression
num cat unstable/nonlinear Support Vector Machines
cat cat Support Vector Machines, Naive Bayes, Other Probabilistic Graphical Models
Why Big Data
==============
eg why Google(MapReduce), Yahoo(PIGS), Facebook(Hive) have to invent new stack
Challenges
1. Fault tolerance
2. Variety of Data Types, eg images, videos, music
3. Manage data volumes without archiving. Traditional need archives.
4. Parallelism was an addon
Disadvantages
1. Could not scale
2. Not suited for computeintensive deep analytics, eg in webworld
3. priceperformance challenge. uses commodity hardware, opensource
Hadoop Ecosystem (See NotesHadoop)
===================================
MapReduce (See NotesHadoop)
=============
Miscellaneous
==============
About our speaker: Bio: Ross is Chief Data Scientist at Teradata and currently works with major clients throughout Australia and New Zealand to help them exploit the value of ‘big data’. He specialized in deployments involving nonrelational, semi structured data and analyses such as path analysis, text analysis and social network analysis. Previously, Ross was deputy headmaster of John Colet School for 18 years before working as a SAS analyst, a business development manager at Minitab Statistical Software and founder and lead analyst at datamilk.com.
Ross Farrelly has a BSc (hons 1st class) in pure mathematics from Auckland University, a Masters in Applied Statistics from Macquarie University and a Masters of Applied Ethics from the Australian Catholic University.
Analysis
=========
path analysis
text analysis
social network analysis
natural language processing
===============
Definition
Web Intelligence and Big Data course
Why Big Data
Hadoop Ecosystem
MapReduce
Miscellaneous
Analysis
Definition
===========
Ref: http://www.intel.com.au/content/www/au/en/bigdata/unstructureddataanalyticspaper.html
 All history until 2003  5 exabytes
 2003 to 2012  2.7 zettabytes
 data generated by more sources, devices, including video,
 data are UNSTRUCTURED, texts, dates, facts. Traditional Analytics are Structured Data (RDBMS).
 Analytics = Profit. Gartner survey  outperform competitors by 20% for those who use Big Data.

Web Intelligence and Big Data (WIBD) course
======================================
50 billion pages indexed by google.
More surprising events is better news.
 if event has prob p, then
information = log_2 (p) bits of information.
Mutual Information (MI)  between transmitted and received channels
 need to maximise MI
 eg mutual information between Ad$ and Sales
 eg adsense  given webpage, guess keywords.
IDF = inverse document frequency
 rare words make better keyword.
 IDF of Word = Log_2 (N / N_w)
where N = total docs, N_w = number of word 'Word' in total docs.
TF = Term Frequency
 number of times the terms appear in that specific document.
 more frequent words (in that doc) make better keywords.
 TF = freq of w in doc d = n_w^d
TFIDF = term freq x IDF = n_w^d x log_2 (N/N_w)
 words with high TFIDF are good keywords.
Mutual Information between all pages and all words is prop to
SUM_d Sum_w { n_w^d x log_2 (N/N_w) }
Mutual Information: Input F  Machine Learning  Output B
Feature F, Behaviour B are independent.
Entropy H(F), H(B)
Mutual Information I(F,B) = SUM_f, SUM_b p(f,b) log { p(f,b) / p(f).p(b) }
Shannon: H(F) + H(B)  H(F,B)
WIBD  Naive Bayes
===================
Consider problem: P(BUY / r,f,g,c) where r,f,g,c are feature or keywords in web shopping.
Bayes Rule: P(B,R) = P(B/R).P(R) = P(R/B).P(B)
Naive Bayes assume r,f,g,c are INDEPENDENT
 can derive Likelihood
p(r/B)*p(c/B)*p(other features /B) ..... p(B)
 = L
p(r/notB)*p(c/notB)*p(other features /notB) ..... p(notB)
so if L > 1 we have a BUY, L < 1 then no Buy.
WIBD  Learn
==============
input X = x1,x2,...xn (ndimensional)
output y = y1,y2, ... ym
function f(X) = E[Y/X] expectation
= y1*P(y1/X) + y2*P(y2/X)
Classification(video 5.2)
 eg X=size, head, noise, legs, Y={animal names}
 eg X= {like, lot}, {hate, waste}, {not enjoy}, Y = {positive, negative}
Clustering  unsupervised
 allow us to get classes from data. Need to choose right features.
 used when we DON'T know outputs relationship to start with.
 by Defn Clustering are regions MORE populated than random data
 add random data so that Po(X). So that r= P(X)/P0(X) is large means there is clustering
then f(X)=E[Y/X]=r/(1+r) y=1 for real data, y=0 for added random uniform data.
 find things that do together to form a cluster. Eg Negative sentiment: hate, bad  but no one need to tell us they are Negative to start with.
 other means of clustering: kmeans, LSH
Rules
 finding which features are related (correlated) to be each other. ie trying to cluster the features, instead of clustering the data.
 compare data which are independent features: Po(X) = P(x1) * P(x2) * ... * P(xn)
where x1 = chirping, x2 = 4 legged, etc xi={animal features}
eg P(Chirping) = number of chirping / number of total Data
 Associative Rule Mining
if there are features, A,B,C,D, want to infer some rule, eg A,B,C => (infer) D
high support P(A,B,C,D) > s; technique is to find P(A)>s, P(B)>s etc first
high confidence P(D/A,B,C) > c
high interestingness P(D/A,B,C) / P(D) > i
 Recommendation of books  customers are features of books and vice versa.
Use latent Models: matrix m x n = m x k TIMES k x n
eg people x books = people x genre TIMES genre x books
NNMF  Nonnegative
features: unemployment direction, interest rate direction, fraud
WIBD  Connect
===============
Logic Inference
if A then B is SAME as ~A OR B
Obama is president of USA: isPresidentOf(Obama, USA)  predicates, variables
IF X is president of C THEN X is leader of C: IF isPresidentOf(X,C) THEN isLeaderOf(X,C)
Query: If K then Q, consider that the query means ~K OR Q is TRUE,
also same as K AND ~Q is FALSE.
So proving K AND ~Q is FALSE, this means If K Then Q.
WIRD  Prediction
==============
Linear Least Squares Regression:
x(i,j) with j features to predict, ith data points with results yi for the ith point
Minimizing f(x) = E(y/X) is same as minimizing error = E(yf(x))^2,
... so let f(x) = xT.f where f is the features vector of unknowns.
Finding vector derivative and equate to zero > xT.x.f  xT.y = 0
R^2 used to measure linear regression
Non  Linear correlation
 Logisitic Regression  f(x) = 1 / (1 + exp(f^T. x))
 Support Vector Machines  Data may be high order correlated, eg parabolic correlation etc.
Neural Networks
 linear least squares
 nonlinear like logistic
 feedforward, multilayer
 feedback, like belief network
Which Prediction technique
FEATURES TARGET CORRELATION TECHNIQUE
num num stable/linear Linear Regression
cat num Linear Regression, Neural Networks
num num unstable/nonlinear Neural Networks
num cat stable/linear Logistic Regression
num cat unstable/nonlinear Support Vector Machines
cat cat Support Vector Machines, Naive Bayes, Other Probabilistic Graphical Models
Why Big Data
==============
eg why Google(MapReduce), Yahoo(PIGS), Facebook(Hive) have to invent new stack
Challenges
1. Fault tolerance
2. Variety of Data Types, eg images, videos, music
3. Manage data volumes without archiving. Traditional need archives.
4. Parallelism was an addon
Disadvantages
1. Could not scale
2. Not suited for computeintensive deep analytics, eg in webworld
3. priceperformance challenge. uses commodity hardware, opensource
Hadoop Ecosystem (See NotesHadoop)
===================================
MapReduce (See NotesHadoop)
=============
Miscellaneous
==============
About our speaker: Bio: Ross is Chief Data Scientist at Teradata and currently works with major clients throughout Australia and New Zealand to help them exploit the value of ‘big data’. He specialized in deployments involving nonrelational, semi structured data and analyses such as path analysis, text analysis and social network analysis. Previously, Ross was deputy headmaster of John Colet School for 18 years before working as a SAS analyst, a business development manager at Minitab Statistical Software and founder and lead analyst at datamilk.com.
Ross Farrelly has a BSc (hons 1st class) in pure mathematics from Auckland University, a Masters in Applied Statistics from Macquarie University and a Masters of Applied Ethics from the Australian Catholic University.
Analysis
=========
path analysis
text analysis
social network analysis
natural language processing
No comments:
Post a Comment