[ad_1]
Clustering of Twitter information with Python, Okay-Means, and t-SNE
Within the article “What People Write about Climate” I analyzed Twitter posts utilizing pure language processing, vectorization, and clustering. Utilizing this method, it’s doable to search out distinct teams in unstructured textual content information, for instance, to extract messages about ice melting or about electrical transport from hundreds of tweets about local weather. Throughout the processing of this information, one other query arose: what if we may apply the identical algorithm to not the messages themselves however to the time when these messages had been revealed? It will enable us to research when and how usually completely different individuals make posts on social media. It may be vital not solely from sociological or psychological views however, as we’ll see later, additionally for detecting bots or customers sending spam. Final however not least, nearly all people is utilizing social platforms these days, and it’s simply attention-grabbing to be taught one thing new about us. Clearly, the identical algorithm can be utilized not just for Twitter posts however for any media platform.
Methodology
I’ll use principally the identical method as described within the first half about Twitter information evaluation. Our information processing pipeline will encompass a number of steps:
- Accumulating tweets together with the particular hashtag and saving them in a CSV file. This was already achieved within the earlier article, so I’ll skip the main points right here.
- Discovering the final properties of the collected information.
- Calculating embedding vectors for every consumer based mostly on the time of their posts.
- Clustering the info utilizing the Okay-Means algorithm.
- Analyzing the outcomes.
Let’s get began.
1. Loading the info
I might be utilizing the Tweepy library to gather Twitter posts. Extra particulars could be discovered within the first part; right here I’ll solely publish the supply code:
import tweepyapi_key = "YjKdgxk..."
api_key_secret = "Qa6ZnPs0vdp4X...."
auth = tweepy.OAuth2AppHandler(api_key, api_key_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
hashtag = "#local weather"
language = "en"
def text_filter(s_data: str) -> str:
""" Take away further characters from textual content """
return s_data.change("&", "and").change(";", " ").change(",", " ")
.change('"', " ").change("n", " ").change(" ", " ")
def get_hashtags(tweet) -> str:
""" Parse retweeted information """
hash_tags = ""
if 'hashtags' in tweet.entities:
hash_tags = ','.be part of(map(lambda x: x["text"], tweet.entities['hashtags']))
return hash_tags
def get_csv_header() -> str:
""" CSV header """
return "id;created_at;user_name;user_location;user_followers_count;user_friends_count;retweets_count;favorites_count;retweet_orig_id;retweet_orig_user;hash_tags;full_text"
def tweet_to_csv(tweet):
""" Convert a tweet information to the CSV string """
if not hasattr(tweet, 'retweeted_status'):
full_text = text_filter(tweet.full_text)
hasgtags = get_hashtags(tweet)
retweet_orig_id = ""
retweet_orig_user = ""
favs, retweets = tweet.favorite_count, tweet.retweet_count
else:
retweet = tweet.retweeted_status
retweet_orig_id = retweet.id
retweet_orig_user = retweet.consumer.screen_name
full_text = text_filter(retweet.full_text)
hasgtags = get_hashtags(retweet)
favs, retweets = retweet.favorite_count, retweet.retweet_count
s_out = f"{tweet.id};{tweet.created_at};{tweet.consumer.screen_name};{addr_filter(tweet.consumer.location)};{tweet.consumer.followers_count};{tweet.consumer.friends_count};{retweets};{favs};{retweet_orig_id};{retweet_orig_user};{hasgtags};{full_text}"
return s_out
if __name__ == "__main__":
pages = tweepy.Cursor(api.search_tweets, q=hashtag, tweet_mode='prolonged',
result_type="latest",
rely=100,
lang=language).pages(restrict)
with open("tweets.csv", "a", encoding="utf-8") as f_log:
f_log.write(get_csv_header() + "n")
for ind, web page in enumerate(pages):
for tweet in web page:
# Get information per tweet
str_line = tweet_to_csv(tweet)
# Save to CSV
f_log.write(str_line + "n")
Utilizing this code, we are able to get all Twitter posts with a particular hashtag, made throughout the final 7 days. A hashtag is definitely our search question, we are able to discover posts about local weather, politics, or another subject. Optionally, a language code permits us to go looking posts in several languages. Readers are welcome to do further analysis on their very own; for instance, it may be attention-grabbing to match the outcomes between English and Spanish tweets.
After the CSV file is saved, let’s load it into the dataframe, drop the undesirable columns, and see what sort of information we have now:
import pandas as pddf = pd.read_csv("local weather.csv", sep=';', dtype={'id': object, 'retweet_orig_id': object, 'full_text': str, 'hash_tags': str}, parse_dates=["created_at"], lineterminator='n')
df.drop(["retweet_orig_id", "user_friends_count", "retweets_count", "favorites_count", "user_location", "hash_tags", "retweet_orig_user", "user_followers_count"], inplace=True, axis=1)
df = df.drop_duplicates('id')
with pd.option_context('show.max_colwidth', 80):
show(df)
In the identical method, as within the first half, I used to be getting Twitter posts with the hashtag “#local weather”. The end result appears to be like like this:
We really don’t want the textual content or consumer id, however it may be helpful for “debugging”, to see how the unique tweet appears to be like. For future processing, we might want to know the day, time, and hour of every tweet. Let’s add columns to the dataframe:
def get_time(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.time()def get_date(dt: datetime.datetime):
""" Get date from datetime """
return dt.date()
def get_hour(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.hour
df["date"] = df['created_at'].map(get_date)
df["time"] = df['created_at'].map(get_time)
df["hour"] = df['created_at'].map(get_hour)
We are able to simply confirm the outcomes:
show(df[["user_name", "date", "time", "hour"]])
Now we have now all of the wanted info, and we’re able to go.
2. Common Insights
As we may see from the final screenshot, 199,278 messages had been loaded; these are messages with a “#Local weather” hashtag, which I collected inside a number of weeks. As a warm-up, let’s reply a easy query: what number of messages per day about local weather had been individuals posting on common?
First, let’s calculate the full variety of days and the full variety of customers:
days_total = df['date'].distinctive().form[0]
print(days_total)
# > 46users_total = df['user_name'].distinctive().form[0]
print(users_total)
# > 79985
As we are able to see, the info was collected over 46 days, and in complete, 79,985 Twitter customers posted (or reposted) a minimum of one message with the hashtag “#Local weather” throughout that point. Clearly, we are able to solely rely customers who made a minimum of one submit; alas, we can not get the variety of readers this manner.
Let’s discover the variety of messages per day for every consumer. First, let’s group the dataframe by consumer identify:
gr_messages_per_user = df.groupby(['user_name'], as_index=False).measurement().sort_values(by=['size'], ascending=False)
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)
The “measurement” column offers us the variety of messages each consumer despatched. I additionally added the “size_per_day” column, which is simple to calculate by dividing the full variety of messages by the full variety of days. The end result appears to be like like this:
We are able to see that essentially the most energetic customers are posting as much as 18 messages per day, and essentially the most inactive customers posted only one message inside this 46-day interval (1/46 = 0,0217). Let’s draw a histogram utilizing NumPy and Bokeh:
import numpy as np
from bokeh.io import present, output_notebook, export_png
from bokeh.plotting import determine, output_file
from bokeh.fashions import ColumnDataSource, LabelSet, Whisker
from bokeh.remodel import factor_cmap, factor_mark, cumsum
from bokeh.palettes import *
output_notebook()customers = gr_messages_per_user['user_name']
quantity = gr_messages_per_user['size_per_day']
hist_e, edges_e = np.histogram(quantity, density=False, bins=100)
# Draw
p = determine(width=1600, top=500, title="Messages per day distribution")
p.quad(high=hist_e, backside=0, left=edges_e[:-1], proper=edges_e[1:], line_color="darkblue")
p.x_range.begin = 0
# p.x_range.finish = 150000
p.y_range.begin = 0
p.xaxis[0].ticker.desired_num_ticks = 20
p.left[0].formatter.use_scientific = False
p.under[0].formatter.use_scientific = False
p.xaxis.axis_label = "Messages per day, avg"
p.yaxis.axis_label = "Quantity of customers"
present(p)
The output appears to be like like this:
Apparently, we are able to see just one bar. Of all 79,985 customers who posted messages with the “#Local weather” hashtag, nearly all of them (77,275 customers) despatched, on common, lower than a message per day. It appears to be like shocking at first look, however really, how usually will we submit tweets in regards to the local weather? Actually, I by no means did it in all my life. We have to zoom the graph rather a lot to see different bars on the histogram:
Solely with this zoom stage can we see that amongst all 79,985 Twitter customers who posted one thing about “#Local weather”, there are lower than 100 “activists”, posting messages day-after-day! Okay, possibly “local weather” shouldn’t be one thing persons are making posts about each day, however is it the identical with different matters? I created a helper operate, returning the proportion of “energetic” customers who posted greater than N messages per day:
def get_active_users_percent(df_in: pd.DataFrame, messages_per_day_threshold: int):
""" Get proportion of energetic customers with a messages-per-day threshold """
days_total = df_in['date'].distinctive().form[0]
users_total = df_in['user_name'].distinctive().form[0]
gr_messages_per_user = df_in.groupby(['user_name'], as_index=False).measurement()
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)
users_active = gr_messages_per_user[gr_messages_per_user['size_per_day'] >= messages_per_day_threshold].form[0]
return 100*users_active/users_total
Then, utilizing the identical Tweepy code, I downloaded information frames for six matters from completely different domains. We are able to draw outcomes with Bokeh:
labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
counts = [get_active_users_percent(df_climate, messages_per_day_threshold=1),
get_active_users_percent(df_politics, messages_per_day_threshold=1),
get_active_users_percent(df_cats, messages_per_day_threshold=1),
get_active_users_percent(df_humour, messages_per_day_threshold=1),
get_active_users_percent(df_space, messages_per_day_threshold=1),
get_active_users_percent(df_war, messages_per_day_threshold=1)]palette = Spectral6
supply = ColumnDataSource(information=dict(labels=labels, counts=counts, coloration=palette))
p = determine(width=1200, top=400, x_range=labels, y_range=(0,9),
title="Proportion of Twitter customers posting 1 or extra messages per day",
toolbar_location=None, instruments="")
p.vbar(x='labels', high='counts', width=0.9, coloration='coloration', supply=supply)
p.xgrid.grid_line_color = None
p.y_range.begin = 0
present(p)
The outcomes are attention-grabbing:
The most well-liked hashtag right here is “#Cats”. On this group, about 6.6% of customers make posts each day. Are their cats simply cute, they usually can not resist the temptation? Quite the opposite, “#Humour” is a well-liked subject with a lot of messages, however the quantity of people that submit a couple of message per day is minimal. On extra severe matters like “#Struggle” or “#Politics”, about 1.5% of customers make posts each day. And surprisingly, far more persons are making each day posts about “#Area” in comparison with “#Humour”.
To make clear these digits in additional element, let’s discover the distribution of the variety of messages per consumer; it isn’t instantly related to message time, however it’s nonetheless attention-grabbing to search out the reply:
def get_cumulative_percents_distribution(df_in: pd.DataFrame, steps=200):
""" Get a distribution of complete p.c of messages despatched by p.c of customers """
# Group dataframe by consumer identify and kind by quantity of messages
df_messages_per_user = df_in.groupby(['user_name'], as_index=False).measurement().sort_values(by=['size'], ascending=False)
users_total = df_messages_per_user.form[0]
messages_total = df_messages_per_user["size"].sum()# Get cumulative messages/customers ratio
messages = []
proportion = np.arange(0, 100, 0.05)
for perc in proportion:
msg_count = df_messages_per_user[:int(perc*users_total/100)]["size"].sum()
messages.append(100*msg_count/messages_total)
return proportion, messages
This methodology calculates the full variety of messages posted by essentially the most energetic customers. The quantity itself can strongly fluctuate for various matters, so I take advantage of percentages as each outputs. With this operate, we are able to examine outcomes for various hashtags:
# Calculate
proportion, messages1 = get_cumulative_percent(df_climate)
_, messages2 = get_cumulative_percent(df_politics)
_, messages3 = get_cumulative_percent(df_cats)
_, messages4 = get_cumulative_percent(df_humour)
_, messages5 = get_cumulative_percent(df_space)
_, messages6 = get_cumulative_percent(df_war)labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
messages = [messages1, messages2, messages3, messages4, messages5, messages6]
# Draw
palette = Spectral6
p = determine(width=1200, top=400,
title="Twitter messages per consumer proportion ratio",
x_axis_label='Proportion of customers',
y_axis_label='Proportion of messages')
for ind in vary(6):
p.line(proportion, messages[ind], line_width=2, coloration=palette[ind], legend_label=labels[ind])
p.x_range.finish = 100
p.y_range.begin = 0
p.y_range.finish = 100
p.xaxis.ticker.desired_num_ticks = 10
p.legend.location = 'bottom_right'
p.toolbar_location = None
present(p)
As a result of each axes are “normalized” to 0..100%, it’s straightforward to match outcomes for various matters:
Once more, the end result appears to be like attention-grabbing. We are able to see that the distribution is strongly skewed: 10% of essentially the most energetic customers are posting 50–60% of the messages (spoiler alert: as we’ll see quickly, not all of them are people;).
This graph was made by a operate that’s solely about 20 traces of code. This “evaluation” is fairly easy, however many further questions can come up. There’s a distinct distinction between completely different matters, and discovering the reply to why it’s so is clearly not easy. Which matters have the most important variety of energetic customers? Are there cultural or regional variations, and is the curve the identical in several nations, just like the US, Russia, or Japan? I encourage readers to do some exams on their very own.
Now that we’ve received some primary insights, it’s time to do one thing tougher. Let’s cluster all customers and attempt to discover some widespread patterns. To do that, first, we might want to convert the consumer’s information into embedding vectors.
3. Making Person Embeddings
An embedded vector is an inventory of numbers that represents the info for every consumer. Within the earlier article, I received embedding vectors from tweet phrases and sentences. Now, as a result of I wish to discover patterns within the “temporal” area, I’ll calculate embeddings based mostly on the message time. However first, let’s discover out what the info appears to be like like.
As a reminder, we have now a dataframe with all tweets, collected for a particular hashtag. Every tweet has a consumer identify, creation date, time, and hour:
Let’s create a helper operate to point out all tweet occasions for a particular consumer:
def draw_user_timeline(df_in: pd.DataFrame, user_name: str):
""" Draw cumulative messages time for particular consumer """
df_u = df_in[df_in["user_name"] == user_name]
days_total = df_u['date'].distinctive().form[0]# Group messages by time of the day
messages_per_day = df_u.groupby(['time'], as_index=False).measurement()
msg_time = messages_per_day['time']
msg_count = messages_per_day['size']
# Draw
p = determine(x_axis_type='datetime', width=1600, top=150, title=f"Cumulative tweets timeline throughout {days_total} days: {user_name}")
p.vbar(x=msg_time, high=msg_count, width=datetime.timedelta(seconds=30), line_color='black')
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
p.toolbar_location = None
p.x_range.begin = datetime.time(0,0,0)
p.x_range.finish = datetime.time(23,59,0)
p.y_range.begin = 0
p.y_range.finish = 1
present(p)
draw_user_timeline(df, user_name="UserNameHere")
...
The end result appears to be like like this:
Right here we are able to see messages made by some customers inside a number of weeks, displayed on the 00–24h timeline. We could already see some patterns right here, however because it turned out, there may be one drawback. The Twitter API doesn’t return a time zone. There’s a “timezone” area within the message physique, however it’s all the time empty. Possibly after we see tweets within the browser, we see them in our native time; on this case, the unique timezone is simply not vital. Or possibly it’s a limitation of the free account. Anyway, we can not cluster the info correctly if one consumer from the US begins sending messages at 2 AM UTC and one other consumer from India begins sending messages at 13 PM UTC; each timelines simply is not going to match collectively.
As a workaround, I attempted to “estimate” the timezone myself through the use of a easy empirical rule: most individuals are sleeping at evening, and extremely possible, they don’t seem to be posting tweets at the moment 😉 So, we are able to discover the 9-hour interval, the place the typical variety of messages is minimal, and assume that it is a “evening” time for that consumer.
def get_night_offset(hours: Checklist):
""" Estimate the evening place by calculating the rolling common minimal """
night_len = 9
min_pos, min_avg = 0, 99999
# Discover the minimal place
information = np.array(hours + hours)
for p in vary(24):
avg = np.common(information[p:p + night_len])
if avg <= min_avg:
min_avg = avg
min_pos = p# Transfer the place proper if doable (in case of lengthy sequence of comparable numbers)
for p in vary(min_pos, len(information) - night_len):
avg = np.common(information[p:p + night_len])
if avg <= min_avg:
min_avg = avg
min_pos = p
else:
break
return min_pos % 24
def normalize(hours: Checklist):
""" Transfer the hours array to the suitable, protecting the 'evening' time on the left """
offset = get_night_offset(hours)
information = hours + hours
return information[offset:offset+24]
Virtually, it really works properly in circumstances like this, the place the “evening” interval could be simply detected:
In fact, some individuals get up at 7 AM and a few at 10 AM, and with out a time zone, we can not discover it. Anyway, it’s higher than nothing, and as a “baseline”, this algorithm can be utilized.
Clearly, the algorithm doesn’t work in circumstances like that:
On this instance, we simply don’t know if this consumer was posting messages within the morning, within the night, or after lunch; there isn’t a details about that. However it’s nonetheless attention-grabbing to see that some customers are posting messages solely at a particular time of the day. On this case, having a “digital offset” remains to be useful; it permits us to “align” all consumer timelines, as we’ll see quickly within the outcomes.
Now let’s calculate the embedding vectors. There could be other ways of doing this. I made a decision to make use of vectors within the type of [SumTotal, Sum00,.., Sum23], the place SumTotal is the full quantity of messages made by a consumer, and Sum00..Sum23 are the full variety of messages made by every hour of the day. We are able to use Panda’s groupby methodology with two parameters “user_name” and “hour”, which does nearly all of the wanted calculations for us:
def get_vectorized_users(df_in: pd.DataFrame):
""" Get embedding vectors for all customers
Embedding format: [total hours, total messages per hour-00, 01, .. 23]
"""
gr_messages_per_user = df_in.groupby(['user_name', 'hour'], as_index=True).measurement()vectors = []
customers = gr_messages_per_user.index.get_level_values('user_name').distinctive().values
for ind, consumer in enumerate(customers):
if ind % 10000 == 0:
print(f"Processing {ind} of {customers.form[0]}")
hours_all = [0]*24
for hr, worth in gr_messages_per_user[user].objects():
hours_all[hr] = worth
hours_norm = normalize(hours_all)
vectors.append([sum(hours_norm)] + hours_norm)
return customers, np.asarray(vectors)
all_users, vectorized_users = get_vectorized_users(df)
Right here, the “get_vectorized_users” methodology is doing the calculation. After calculating every 00..24h vector, I take advantage of the “normalize” operate to use the “timezone” offset, as was described earlier than.
Virtually, the embedding vector for a comparatively energetic consumer could seem like this:
[120 0 0 0 0 0 0 0 0 0 1 2 0 2 2 1 0 0 0 0 0 18 44 50 0]
Right here 120 is the full variety of messages, and the remaining is a 24-digit array with the variety of messages made inside each hour (as a reminder, in our case, the info was collected inside 46 days). For the inactive consumer, the embedding could seem like this:
[4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
Completely different embedding vectors can be created, and a extra sophisticated scheme can present higher outcomes. For instance, it could be attention-grabbing so as to add a complete variety of “energetic” hours per day or to incorporate a day of the week into the vector to see how the consumer’s exercise varies between working days and weekends, and so forth.
4. Clustering
As within the previous article, I might be utilizing the Okay-Means algorithm to search out the clusters. First, let’s discover the optimum Okay-value utilizing the Elbow method:
import matplotlib.pyplot as plt
%matplotlib inlinedef graw_elbow_graph(x: np.array, k1: int, k2: int, k3: int):
k_values, inertia_values = [], []
for ok in vary(k1, k2, k3):
print("Processing:", ok)
km = KMeans(n_clusters=ok).match(x)
k_values.append(ok)
inertia_values.append(km.inertia_)
plt.determine(figsize=(12,4))
plt.plot(k_values, inertia_values, 'o')
plt.title('Inertia for every Okay')
plt.xlabel('Okay')
plt.ylabel('Inertia')
graw_elbow_graph(vectorized_users, 2, 20, 1)
The end result appears to be like like this:
Let’s write the tactic to calculate the clusters and draw the timelines for some customers:
def get_clusters_kmeans(x, ok):
""" Get clusters utilizing Okay-Means """
km = KMeans(n_clusters=ok).match(x)
s_score = silhouette_score(x, km.labels_)
print(f"Okay={ok}: Silhouette coefficient {s_score:0.2f}, inertia:{km.inertia_}")sample_silhouette_values = silhouette_samples(x, km.labels_)
silhouette_values = []
for i in vary(ok):
cluster_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append((i, cluster_values.form[0], cluster_values.imply(), cluster_values.min(), cluster_values.max()))
silhouette_values = sorted(silhouette_values, key=lambda tup: tup[2], reverse=True)
for s in silhouette_values:
print(f"Cluster {s[0]}: Dimension:{s[1]}, avg:{s[2]:.2f}, min:{s[3]:.2f}, max: {s[4]:.2f}")
print()
# Create new dataframe
data_len = x.form[0]
cdf = pd.DataFrame({
"id": all_users,
"vector": [str(v) for v in vectorized_users],
"cluster": km.labels_,
})
# Present high clusters
for cl in silhouette_values[:10]:
df_c = cdf[cdf['cluster'] == cl[0]]
# Present cluster
print("Cluster:", cl[0], cl[2])
with pd.option_context('show.max_colwidth', None):
show(df_c[["id", "vector"]][:20])
# Present first customers
for consumer in df_c["id"].values[:10]:
draw_user_timeline(df, user_name=consumer)
print()
return km.labels_
clusters = get_clusters_kmeans(vectorized_users, ok=5)
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.
Login if you have purchased