文本挖掘與R

最近被老闆分配到文本挖掘的相關項目裡面，感覺又有一種需要開闢全新道路的不祥的feel，之前對於文本挖掘相關工作不算十分了解，所以就前往data science的勝地kaggle那看了一下相關工作的流程，同時也找了一點數據練手。

項目介紹：

本次數據完全來自於kaggle中的Twitter數據，是關於世界盃的Tweets的消息，本次就根據這些短推特數據來進行文本挖掘和分析

數據介紹：

library(tidyverse) library(tidytext) library(visNetwork)

fifa<-read_csv(FIFA.csv)
glimpse(fifa)

————————————————————————————————————————————————
## Observations: 530,000
## Variables: 16
## $ ID <dbl> 1.013597e+18, 1.013597e+18, 1.013597e+18, 1.0...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "en", "en...
## $ Date <dttm> 2018-07-02 01:35:45, 2018-07-02 01:35:44, 20...
## $ Source <chr> "Twitter for Android", "Twitter for Android",...
## $ len <int> 140, 139, 107, 142, 140, 140, 140, 138, 138, ...
## $ Orig_Tweet <chr> "RT @Squawka: Only two goalkeepers have saved...
## $ Tweet <chr> "Only two goalkeepers have saved three penalt...
## $ Likes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ RTs <int> 477, 1031, 488, 0, 477, 153, 4, 1, 2199, 5146...
## $ Hashtags <chr> "WorldCup,POR,ENG", "WorldCup", "worldcup", "...
## $ UserMentionNames <chr> "Squawka Football", "FC Barcelona,Ivan Rakiti...
## $ UserMentionID <chr> "Squawka", "FCBarcelona,ivanrakitic,HNS_CFF",...
## $ Name <chr> "Cayleb", "Febri Aditya", "??", "Frida Carril...
## $ Place <chr> "Accra", "Bogor", NA, "Zapopan, Jalisco", NA,...
## $ Followers <int> 861, 667, 65, 17, 137, 29, 208, 7, 1, 158, 34...
## $ Friends <int> 828, 686, 67, 89, 216, 283, 338, 9, 6, 245, 3...

數據表示的非常清晰，我們整個工作也用不到所有的數據列，所以我們選取了Source，Tweet，Hashtags，RTs，Name，Place。

EDA：

Tweet中出現的最多的單詞是哪些：

fifa_tidy<-fifa %>% unnest_tokens(words,Tweet) %>% filter(!(words %in% stop_words$word)) %>% filter(str_detect(words,[a-z]))

fifa_tidy %>% count(words,sort = T) %>% top_n(20,wt=n) %>%
ggplot(aes(x=reorder(words,n),y=n))+geom_col(fill="#AAB7B8")+theme_bw()+
labs(y=,x=,title="Top words in tweets")+coord_flip()

看來所有的tweets裡面，france，world，cup，final，congratulations都是出現最頻繁的單詞，和去年世界盃的現象還是很符合的。

fifa_tidy %>% filter(str_detect(Source,^Twitter for)) %>% count(Source,words,sort = T) %>% group_by(Source) %>% top_n(10,wt=n) %>% ggplot(aes(x=reorder(words,n),y=n,fill=Source))+geom_col()+ theme_bw()+facet_wrap(~Source,scales = "free",ncol = 2)+ labs(y=,x=,title="Top words of tweets in Each source")+coord_flip()+ theme(legend.position = "none")