最近被老闆分配到文本挖掘的相關項目裡面,感覺又有一種需要開闢全新道路的不祥的feel,之前對於文本挖掘相關工作不算十分了解,所以就前往data science的勝地kaggle那看了一下相關工作的流程,同時也找了一點數據練手。

項目介紹:

本次數據完全來自於kaggle中的Twitter數據,是關於世界盃的Tweets的消息,本次就根據這些短推特數據來進行文本挖掘和分析

數據介紹:

library(tidyverse)
library(tidytext)
library(visNetwork)

fifa<-read_csv(FIFA.csv)
glimpse(fifa)

————————————————————————————————————————————————
## Observations: 530,000
## Variables: 16
## $ ID <dbl> 1.013597e+18, 1.013597e+18, 1.013597e+18, 1.0...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "en", "en...
## $ Date <dttm> 2018-07-02 01:35:45, 2018-07-02 01:35:44, 20...
## $ Source <chr> "Twitter for Android", "Twitter for Android",...
## $ len <int> 140, 139, 107, 142, 140, 140, 140, 138, 138, ...
## $ Orig_Tweet <chr> "RT @Squawka: Only two goalkeepers have saved...
## $ Tweet <chr> "Only two goalkeepers have saved three penalt...
## $ Likes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ RTs <int> 477, 1031, 488, 0, 477, 153, 4, 1, 2199, 5146...
## $ Hashtags <chr> "WorldCup,POR,ENG", "WorldCup", "worldcup", "...
## $ UserMentionNames <chr> "Squawka Football", "FC Barcelona,Ivan Rakiti...
## $ UserMentionID <chr> "Squawka", "FCBarcelona,ivanrakitic,HNS_CFF",...
## $ Name <chr> "Cayleb", "Febri Aditya", "??", "Frida Carril...
## $ Place <chr> "Accra", "Bogor", NA, "Zapopan, Jalisco", NA,...
## $ Followers <int> 861, 667, 65, 17, 137, 29, 208, 7, 1, 158, 34...
## $ Friends <int> 828, 686, 67, 89, 216, 283, 338, 9, 6, 245, 3...

數據表示的非常清晰,我們整個工作也用不到所有的數據列,所以我們選取了Source,Tweet,Hashtags,RTs,Name,Place。

EDA:

Tweet中出現的最多的單詞是哪些:

fifa_tidy<-fifa %>% unnest_tokens(words,Tweet) %>% filter(!(words %in% stop_words$word)) %>% filter(str_detect(words,[a-z]))

fifa_tidy %>% count(words,sort = T) %>% top_n(20,wt=n) %>%
ggplot(aes(x=reorder(words,n),y=n))+geom_col(fill="#AAB7B8")+theme_bw()+
labs(y=,x=,title="Top words in tweets")+coord_flip()

看來所有的tweets裡面,france,world,cup,final,congratulations都是出現最頻繁的單詞,和去年世界盃的現象還是很符合的。

fifa_tidy %>% filter(str_detect(Source,^Twitter for)) %>%
count(Source,words,sort = T) %>% group_by(Source) %>% top_n(10,wt=n) %>%
ggplot(aes(x=reorder(words,n),y=n,fill=Source))+geom_col()+
theme_bw()+facet_wrap(~Source,scales = "free",ncol = 2)+
labs(y=,x=,title="Top words of tweets in Each source")+coord_flip()+
theme(legend.position = "none")

這一塊看的是從不同OS平台出現的單詞頻率有哪些,經過Twitter上網友的提醒,發現第一個『Tweet for iPhone』多了一個空格,也確實是自己一時的疏忽大意。

fifa_tidy %>% count(words,sort = T) %>% top_n(500,wt=n) %>% wordcloud2::wordcloud2()

詞雲這種東西是絕對不可以少的!!!

文本挖掘工作很重要的一部分就是情感判斷,詞性的情感往往決定了整個文本語意的走向,這裡我們要將文本的情感做一個可視化分析。

fifa_tidy_sentiment<-fifa_tidy %>% rename(word=words) %>%
inner_join(get_sentiments(bing),by=word)

fifa_tidy_sentiment %>% group_by(word,sentiment) %>% summarise(total=n()) %>%
ungroup() %>% group_by(sentiment) %>% arrange(desc(total)) %>% top_n(10) %>%
ggplot(aes(x=reorder(word,total),y=total,fill=sentiment))+geom_col()+
facet_wrap(~sentiment,scales = "free")+theme_bw()+coord_flip()+
theme(legend.position = "none")

可能negative的都是克羅埃西亞球迷說了吧,LOL!!!!

fifa_tidy_sentiment %>% group_by(word,sentiment) %>% summarise(total=n()) %>%
arrange(desc(total)) %>%
reshape2::acast(word ~ sentiment, value.var = "total", fill = 0) %>%
wordcloud::comparison.cloud(colors = c("#F8766D", "#00BFC4"),max.words = 350)

詞雲裡面有一個好玩的就是不同意義的詞雲進行比較。

Tweets中的情感深入:

fifa_all_sens<-fifa_tidy %>% rename(word=words) %>% inner_join(get_sentiments(nrc),by="word")

fifa_all_sens %>% count(word,sentiment,sort = T) %>% group_by(sentiment) %>% top_n(10) %>%
ggplot(aes(x=reorder(word,n),y=n,fill=sentiment))+
geom_col(show.legend = F)+ theme_bw()+facet_wrap(~sentiment,scales = "free",ncol = 3)+
theme(legend.position = "none")+coord_flip()+labs(x=,y=,title="The top 10 words under each sentiment category")

fifa_all_sens %>% group_by(word,sentiment) %>% count() %>% bind_tf_idf(word,sentiment,n) %>%
arrange(desc(tf_idf)) %>% group_by(sentiment) %>% top_n(15) %>% ggplot(aes(x=reorder(word,-n),y=n,fill=sentiment))+
geom_col(show.legend = F)+labs(x=NULL,y="tf-idf")+facet_wrap(~sentiment,ncol = 3,scales = "free")+coord_flip()

文本挖掘中重要的tf-idf,具體內容請自行百度,我了解的也不多。

fifa_ngram<-fifa %>% unnest_tokens(bigram,Tweet,token = "ngrams", n=2) %>% select(bigram) %>%
separate(bigram,c("w1","w2"),sep=" ") %>%
filter(!w1 %in% stop_words$word,!w2 %in% stop_words$word) %>% count(w1,w2,sort = T)

fifa_ngram %>% unite(bigram,w1,w2,sep = " ") %>% wordcloud2::wordcloud2()

之前的分詞都是單個,現在我們進行兩個詞的分解:

fifa_ngram %>% filter(w1==worldcup) %>% inner_join(get_sentiments(afinn),by=c(w2="word")) %>%
count(w2,score,sort = T) %>% mutate(contribution=nn*score) %>% arrange(desc(abs(contribution))) %>%
mutate(w2=reorder(w2,contribution)) %>% ggplot(aes(w2,contribution,fill=contribution > 0))+
geom_col(show.legend = F)+coord_flip()

既然已經是進行兩個詞的分詞,那麼可以計算單詞之間的網路相關性:

big_graph<-na.omit(fifa_ngram) %>% mutate(section=row_number() %/% 10) %>% filter(n>4000) %>%
igraph::graph_from_data_frame() %>% toVisNetworkData()
visNetwork(big_graph$nodes,big_graph$edges) %>% visOptions(highlightNearest = TRUE)

寫在最後:

答主第一次撰寫文本挖掘相關的東西,很多專業名詞還是只是一知半解,很多東西都是從這本書中學來的,推薦大家看一下:

Text Mining with R - A Tidy Approach?

www.tidytextmining.com

很值得一讀的一本書,推薦大家看看,英語也還好,暫時沒有中文版,如果有時間答主可能會翻譯成中文版並放在GitHub上面。


推薦閱讀:
相关文章