merging tweets by date

Refresh

April 2019

Views

70 time

1

I hope this is not to basic a question, I have a dataframe of tweets (in R). My aim is to calculate the sentiment by date.

I would be so grateful if anyone would advise me, how to concatenate tweets tweet$text by date, where each observation becomes a string of merged tweets/text

For example, if I had:

Created_Date       Tweet

2014-01-04         "the iphone is magnificent"

2014-01-04         "the iphone's screen is poor"

2014-01-04         "I will always use Apple products"

2014-01-03         "iphone is overpriced, but I love it"

2014-01-03         "Siri is very sluggish"

2014-01-03         "iphone's maps app is poor compared to Android"

I would like a loop/function to merge the tweets by Created_Date resulting in something like this

Created_Date       Tweet

2014-01-04         "the iphone is magnificent", "the iphone's screen is poor",              "I will always use Apple products"

2014-01-03         "iphone is overpriced, but I love it", "Siri is very sluggish", "iphone's maps app is poor compared to Android"

Here are my data

 dat <-   structure(list(Created_Date = structure(c(1388793600, 1388793600, 
    1388793600, 1388707200, 1388707200, 1388707200), class = c("POSIXct", 
    "POSIXt"), tzone = "UTC"), Tweet = c("the iphone is magnificent", 
    "the iphone's screen is poor", "I will always use Apple products", 
    "iphone is overpriced, but I love it", "Siri is very sluggish", 
    "iphone's maps app is poor compared to Android")), .Names = c("Created_Date", 
    "Tweet"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -6L))

3 answers

0

If you are using the corpus library, then you can use the group argument to term_counts or term_matrix to aggregate (sum) by date.

In your case, if you are interested in counting the number of positive, negative, and neutral words, you can first create a "stemmer" that maps words to these categories:

library(corpus)
# map terms in the AFINN dictionary to Positive/Negative; others to Neutral
stem_sent <- new_stemmer(sentiment_afinn$term,
                         ifelse(sentiment_afinn$score > 0, "Positive", "Negative"),
                         default = "Neutral")

Then, you can use this as a stemmer and get the counts by group:

term_counts(dat$Tweet, group = dat$Created_Date, stemmer = stem_sent)
##   group      term     count
## 1 2014-01-03 Negative     2 
## 2 2014-01-04 Negative     1
## 3 2014-01-03 Neutral     17
## 4 2014-01-04 Neutral     14
## 5 2014-01-03 Positive     1

Or get a matrix of counts:

term_matrix(dat$Tweet, group = dat$Created_Date, stemmer = stem_sent)
## 2 x 3 sparse Matrix of class "dgCMatrix"
##            Negative Neutral Positive
## 2014-01-03        2      17        1
## 2014-01-04        1      14        .
1

An example using data.table

setDT(ta)
# first we aggregate the data, by applying the function paste, we get 6 rows
ta[,cTweet:=paste(Tweet,collapse=","),by=Created_Date]
# I'm removing the Tweet column
ta1<-ta[,.(cTweet,Created_Date)]
# using a key on the table and unique() I only extract unique values
setkey(ta1,Created_Date)
unique(ta1)
   Created_Date                                                                                                  cTweet
1:   2014-01-03 iphone is overpriced, but I love it,Siri is very sluggish,iphone's maps app is poor compared to Android
2:   2014-01-04                  the iphone is magnificent,the iphone's screen is poor,I will always use Apple products

An example using dplyr (tidyverse)

library(tidyverse)
# this approach first use the group_by function to group by date, 
# pipes `%>%` are used to pass from one data to the next with a 
# transformation at each step.

ta %>%
      group_by(Created_Date) %>%
      summarise(cTweet = paste(Tweet, collapse = ","))

# A tibble: 2 x 2
  Created_Date                                                                                                  cTweet
        <dttm>                                                                                                   <chr>
1   2014-01-03 iphone is overpriced, but I love it,Siri is very sluggish,iphone's maps app is poor compared to Android
2   2014-01-04                  the iphone is magnificent,the iphone's screen is poor,I will always use Apple products

An example using base R

aggregate(ta$Tweet,by=list(ta$Created_Date),FUN=function(X)paste(X, collapse = ","))
0

Just a simple implementation using loops. Probably not the fastest solution imaginable, but easy to understand.

# construction of a sample data.frame
text = c("Some random text.", 
         "Yet another line.",
         "Will this ever stop.",
         "This may be the last one.",
         "It was not the last.")
date = c("9-11-2017",
         "11-11-2017",
         "10-11-2017",
         "11-11-2017",
         "10-11-2017")
tweet = data.frame(text, date)

# array with dates in the data.frame
dates = levels(tweet$date)

# initialise results with empty strings
resultString = rep.int("", length(dates)) 

for(i in 1:length(dates)) # loop over different dates
{
    for(j in 1:length(tweet$text)) # loop over tweets
    {
        if (tweet$date[j] == dates[i]) # concatenate to resultString if dates match
        {
            resultString[i] = paste0(resultString[i], tweet$text[j])
        }
    }
}

# combine concatenated strings with dates in new data.frame
result = data.frame(date=dates, tweetsByDate=resultString)
result

# output:
# date                               tweetsByDate
# 1 10-11-2017   Will this ever stop.It was not the last.
# 2 11-11-2017 Yet another line.This may be the last one.
# 3  9-11-2017                          Some random text.