Wednesday, July 03, 2013

Notes BigData

Notes BigData
===============

Definition
Web Intelligence and Big Data course
Why Big Data
Hadoop Ecosystem
MapReduce
Miscellaneous
Analysis


Definition
===========
Ref:  http://www.intel.com.au/content/www/au/en/big-data/unstructured-data-analytics-paper.html

- All history until 2003 - 5 exabytes
- 2003 to 2012 - 2.7 zettabytes
- data generated by more sources, devices, including video,
- data are UNSTRUCTURED, texts, dates, facts. Traditional Analytics are Structured Data (RDBMS).
- Analytics = Profit. Gartner survey - outperform competitors by 20% for those who use Big Data.
-



Web Intelligence and Big Data (WIBD) course
======================================
50 billion pages indexed by google.



More surprising events is better news.
- if event has prob p, then
    information = log_2 (p)  bits of information.

Mutual Information (MI) - between transmitted and received channels
- need to maximise MI
- eg mutual information between Ad$ and Sales
- eg adsense - given webpage, guess keywords.

IDF = inverse document frequency
- rare words make better keyword.
- IDF of Word = Log_2 (N / N_w)
  where N = total docs, N_w = number of word 'Word' in total docs.

TF = Term Frequency
- number of times the terms appear in that specific document.
- more frequent words (in that doc) make better keywords.
- TF = freq of w in doc d = n_w^d

TF-IDF = term freq x IDF = n_w^d x log_2 (N/N_w)
- words with high TF-IDF are good keywords.

Mutual Information between all pages and all words is prop to
   SUM_d  Sum_w  {  n_w^d x log_2 (N/N_w)  }

Mutual Information:  Input F -- Machine Learning -- Output B
Feature F, Behaviour B are independent.
Entropy H(F), H(B)
Mutual Information I(F,B) = SUM_f, SUM_b p(f,b) log { p(f,b) / p(f).p(b) }
Shannon: H(F) + H(B) - H(F,B)

WIBD - Naive Bayes
===================
Consider problem: P(BUY / r,f,g,c) where r,f,g,c are feature or keywords in web shopping.
Bayes Rule: P(B,R) = P(B/R).P(R) = P(R/B).P(B)
Naive Bayes assume r,f,g,c are INDEPENDENT
   - can derive Likelihood
      p(r/B)*p(c/B)*p(other features /B) ..... p(B)
      ---------------------------------------------  = L
      p(r/notB)*p(c/notB)*p(other features /notB) ..... p(notB)

      so if L > 1 we have a BUY, L < 1 then no Buy.



WIBD - Learn
==============
input X = x1,x2,...xn  (n-dimensional)
output y = y1,y2, ... ym
function f(X) = E[Y/X] expectation
              = y1*P(y1/X) + y2*P(y2/X)
Classification(video 5.2)
- eg X=size, head, noise, legs, Y={animal names}
- eg X= {like, lot}, {hate, waste}, {not enjoy}, Y = {positive, negative}
Clustering - unsupervised
- allow us to get classes from data. Need to choose right features.
- used when we DON'T know outputs relationship to start with.
- by Defn Clustering are regions MORE populated than random data
- add random data so that Po(X). So that r= P(X)/P0(X) is large means there is clustering
  then f(X)=E[Y/X]=r/(1+r) y=1 for real data, y=0 for added random uniform data.
- find things that do together to form a cluster. Eg Negative sentiment: hate, bad - but no one need to tell us they are Negative to start with.
- other means of clustering: k-means, LSH
Rules
- finding which features are related (correlated) to be each other. ie trying to cluster the features, instead of clustering the data.
- compare data which are independent features: Po(X) =  P(x1) * P(x2) * ... * P(xn)
  where x1 = chirping, x2 = 4 legged, etc xi={animal features}
  eg P(Chirping) = number of chirping / number of total Data
- Associative Rule Mining
  if there are features, A,B,C,D, want to infer some rule, eg A,B,C => (infer) D
  high support P(A,B,C,D) > s;   technique is to find P(A)>s, P(B)>s etc first
  high confidence P(D/A,B,C) > c
  high interestingness P(D/A,B,C) / P(D)  > i
- Recommendation of books - customers are features of books and vice versa.
  Use latent Models: matrix m x n = m x k TIMES k x n
                  eg people x books = people x genre TIMES genre x books
  NNMF - Non-negative

features: unemployment direction, interest rate direction, fraud

WIBD - Connect
===============
Logic Inference
if A then B    is SAME as ~A OR B  
Obama is president of USA:  isPresidentOf(Obama, USA) - predicates, variables
IF X is president of C  THEN X is leader of C:    IF  isPresidentOf(X,C) THEN isLeaderOf(X,C)
Query: If K then Q, consider that the query means ~K OR Q is TRUE,
       also same as K AND ~Q is FALSE.
       So proving K AND ~Q is FALSE, this means If K Then Q.

WIRD - Prediction
==============
Linear Least Squares Regression:
  x(i,j) with j features to predict, i-th data points with results yi for the i-th point
  Minimizing f(x) = E(y/X) is same as minimizing error = E(y-f(x))^2,  
  ... so let f(x) = xT.f   where f is the features vector of unknowns.
  Finding vector derivative and equate to zero ->  xT.x.f - xT.y = 0
  R^2 used to measure linear regression
Non - Linear correlation
  - Logisitic Regression - f(x) = 1 / (1 + exp(-f^T. x))
  - Support Vector Machines - Data may be high order correlated, eg parabolic correlation etc.
 Neural Networks
   - linear least squares
   - non-linear like logistic
   - feed-forward, multilayer
   - feed-back, like belief network
Which Prediction technique
FEATURES   TARGET  CORRELATION         TECHNIQUE
num        num     stable/linear       Linear Regression
cat        num                         Linear Regression, Neural Networks
num        num     unstable/nonlinear  Neural Networks
num        cat     stable/linear       Logistic Regression
num        cat     unstable/nonlinear  Support Vector Machines
cat        cat                         Support Vector Machines, Naive Bayes, Other Probabilistic Graphical Models



Why Big Data
==============
eg why Google(MapReduce), Yahoo(PIGS), Facebook(Hive) have to invent new stack
Challenges
1. Fault tolerance
2. Variety of Data Types, eg images, videos, music
3. Manage data volumes without archiving. Traditional need archives.
4. Parallelism was an add-on
Disadvantages
1. Could not scale
2. Not suited for compute-intensive deep analytics, eg in web-world
3. price-performance challenge. uses commodity hardware, open-source


Hadoop Ecosystem (See NotesHadoop)
===================================

MapReduce (See NotesHadoop)
=============


Miscellaneous
==============
About our speaker: Bio: Ross is Chief Data Scientist at Teradata and currently works with major clients throughout Australia and New Zealand to help them exploit the value of ‘big data’. He specialized in deployments involving non-relational, semi structured data and analyses such as path analysis, text analysis and social network analysis. Previously, Ross was deputy headmaster of John Colet School for 18 years before working as a SAS analyst, a business development manager at Minitab Statistical Software and founder and lead analyst at datamilk.com.

Ross Farrelly has a BSc (hons 1st class) in pure mathematics from Auckland University, a Masters in Applied Statistics from Macquarie University and a Masters of Applied Ethics from the Australian Catholic University.

Analysis
=========
path analysis
text analysis
social network analysis
natural language processing

No comments: