• Home
  • About
    • Miles' Blog photo

      Miles' Blog

      Miles 用來記錄學習心得及筆記的地方。

    • Learn More
    • Email
    • Github
  • Posts
    • All Posts
    • All Tags
  • Projects

Association Rule - Apriori Algorithm

12 Mar 2017

Reading time ~1 minute

經典的解決啤酒尿布問題演算法。 具體的作法如下: 1.將所有的項目做成每個unique Item。 2.執行組合產生Sub Item Set。 3.若Sub Item Set的子項目已經被剔除則先移除該Sub Item Set。 3.根據Sub Item Set對其計次。 4.低於Min Support的對其剔除。 5.若待計次的Sub Item Set已經沒東西則脫離迴圈,否則回到step2。 6.根據Minimum Support和Minimum Confidence對不符合條件的進行剔除。

其中Support的算法為該Sub Item Set出現在Database的總次數/Database的資料總數(Denote the frequency of the rule within transactions)。 support(A=>B)=support({A,B})

Confidence的算法為該Sub Item Set出現在Database的總次數/已知該Sub Item Set的總數(Denote the percentage of transactions containing A which also contains B)。 Confidence(A=>B)=support({A,B})/support({A})

另外,由於Apriori演算法是建立在物品存在正關聯的基礎上,因此需要再引入一個Correlation做參考,Correlation算法為 Corr(A,B)=Confidence(A=>B)/Support(B) 當Corr>1時則為正關聯,該Rule較具參考價值,當介於0~1之間則為負關聯,較不具參考加值。

import itertools
#database  data
dataSet={'100':['a','b','c'],'200':['a','c'],'400':['a','d'],'500':['b','e','f']}
#minimum confidence value
minConfidenceScore=0.5
#minimum support value
minSupportScore=0.5
#initialize the needed variable
dtSet={}
supportSet={}
lastDeletedTradeList=[]
#get all unique trading items
allItemsList=sorted({x for v in dataSet.itervalues() for x in v})
allItemLen=len(allItemsList)
totalRecord=float(len(dataSet))
combineChooseNo=1
#algorithm start
while True:
    #combination
    combinationList=[]
    for combineTuple in list(itertools.combinations(allItemsList, int(combineChooseNo))):
        combinationList.append(set(list(combineTuple)))
    #delete the combinationSet which does not need to be analytics
    if len(lastDeletedTradeList)>0:
        for deletedTrade in lastDeletedTradeList:
            combinationList[:]=[combinationSet for combinationSet in combinationList if not deletedTrade.issubset(combinationSet)]
    #The database shoud be scan only if the length of candidate's item set does not equal to 0
    if len(combinationList)>0:
        for tID,items in dataSet.iteritems():
            for combinationSet in combinationList:
                if combinationSet.issubset(set(items)):
                    dtSet[tuple(combinationSet)]=dtSet.get(tuple(combinationSet),0)+1
        for k, v in dtSet.iteritems():
            if v/totalRecord >=minSupportScore:
                supportSet[k]=v/totalRecord
            else:
                lastDeletedTradeList.append(set(list(k)))
        combineChooseNo+=1
    else:
        break
#print out final analytics result
for supportA in supportSet:
    for supportB in supportSet:
        if supportA==supportB:
            continue
        elif set(list(supportA)).issubset(set(list(supportB))):
            if supportSet[supportB]/supportSet[supportA]>=minConfidenceScore:
                confidence=supportSet[supportB]/supportSet[supportA]
                corr=confidence/supportSet[supportB]
                print str(list(supportA))+'->'+str(list(set(list(supportB))-set(list(supportA))))+',Support:'+str(supportSet[supportB])+',Confidence:'+str(round(confidence,2))+',Corr:'+str(round(corr,2))
#Result

另外也可以擴展到高維度。也可視為分類問題(Classification)。 缺點是每次Create Item時都要去掃Database顯得很浪費時間,為改善效率可在期待的規則上做些限制。

參考: 老師書面講義



PythonData ScienceData Mining Share Tweet +1