Natural Language Analysis of Steem Posts

in #utopian-io7 years ago (edited)

MT Analysis Banner.png

<h1>INTRODUCTION <h2>Abstract <p dir="auto">This is a natural language text analysis of the <strong>contents of user Posts contained in the Steem blockchain. The analysis is performed with the R language and supporting libraries. <p dir="auto">Posts are an important part of the Steem ecosystem and arguably the backbone of the social platforms (portals) that use it. <p dir="auto">While traditional BI analysis is useful, it tends to be time series analysis of discreet data (e.g. trending population size and transaction volume). This analysis differs by trying to mine insights about psychographic, sentiment or cultural influences. <p dir="auto">Rich and thriving social platforms should exhibit thematic word patterns associated with cultural groups or topics of interest. This is a first attempt at identifying these patterns. <p dir="auto">The content of a Post includes URLs, emojis, dingbats, images as well as text in numerous different character sets and encodings. These contributions come from a wide variety of different technologies from smart phones to PCs from Windows to Android, each with text encoding nuances. Emoji for example are poorly supported by MS Windows, and will originate from iOS, Android and OSX users. While emoji is limited by platform they appear in sufficient volume to generalize over the population (sampling theory). <h2>Bias & Exclusions <p dir="auto">The Steem blockchain contains high volume of marginal-value content such as food pictures, meme gifs, bible quotes, inspiration and motivation pictures. While users may enjoy consuming this content, it offers little value in NLP or text analysis. Extracting meaningful content from these binary images and videos is an image process exercise and out of scope of this analysis. Therefore I've excluded several high ranking categories/tags of predominately multi-media content. <p dir="auto">I've also excluded Korean and Spanish which rank highly and I apologize to those native speakers for my ignorance of their language. <h2>Target Selection & First Data Draw <p dir="auto">The first dataset is drawn from Q1 of 2017. At the time of this analysis this is the most recent available from <a href="https://steemdata.com/" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">Steemdata.com which had been undergoing engineering work. <p dir="auto">With this code we select all the Categories and count the number of posts they contain. <pre><code>mdb <- mongo(collection="Posts",db="SteemData",url="mongodb://steemit:steemit@mongo1.steemdata.com:27017/SteemData") cats<- paste('[{ "$match": {"created": {"$gte": {"$date": "2017-01-01T00:00:00.00Z"}, "$lte": {"$date": "2017-03-30T00:00:00.00Z"} } } },{ "$group": { "_id": { "category" : "$category" }, "Post Count" : {"$sum" : 1 } } }, { "$project": {"_id": 0, "category": "$_id.category" , "Post Count" : 1 } } , { "$sort": { "Post Count": -1 } }]', sep="") categories<- mdb$aggregate(cats) <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516902769/ft0rh62plxfgnxdenz4q.png" alt="Extract-1.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516902769/ft0rh62plxfgnxdenz4q.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516902769/ft0rh62plxfgnxdenz4q.png 2x" /> <p dir="auto">The dataset returns 9,463 distinct category tags. The average number of posts per category is 13 (the median being one). Somewhat surprisingly the 3rd Quartile is two posts, indicating the vast majority of tags are an empty wasteland with all the action going on in this top 20 or so. Given the rather generic nature of these top tags and the low averages, we can assume niche and specialized communities are few and far between (with top quartile exceptions like <a href="https://steemit.com/trending/steemsilvergold" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">#steemsilvergold and <a href="https://steemit.com/trending/blockchainbi" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">#blockchainbi. <p dir="auto">As mentioned above we exclude multimedia and non-english categories. The <a href="https://steemit.com/trending/life" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">#Life category appears to offer sufficient Post volume for text analysis. <h2>Target Acquisition & Second Data Draw <p dir="auto">The second dataset extracts all the Posts tagged to the #Life category. <pre><code>mdb <- mongo(collection="Posts",db="SteemData",url="mongodb://steemit:steemit@mongo1.steemdata.com:27017/SteemData") # Extract Jan 2017 raw1<- mdb$find(query='{"created": {"$gte": {"$date": "2017-01-01T00:00:00.00Z"}, "$lte": {"$date": "2017-01-31T00:00:00.00Z"} },"category": {"$eq" : "life"} }', fields='{"_id":0, "body":1}') # Extract Feb 2017 raw2<- mdb$find(query='{"created": {"$gte": {"$date": "2017-02-01T00:00:00.00Z"}, "$lte": {"$date": "2017-02-28T00:00:00.00Z"} },"category": {"$eq" : "life"} }', fields='{"_id":0, "body":1}') # Extract Mar 2017 raw3<- mdb$find(query='{"created": {"$gte": {"$date": "2017-03-01T00:00:00.00Z"}, "$lte": {"$date": "2017-03-31T00:00:00.00Z"} },"category": {"$eq" : "life"} }', fields='{"_id":0, "body":1}') <p dir="auto">I have to break the data into three sets (one per month) due to my crappy, underpowered 10 year old macbook. <p dir="auto">The query takes 23.63 seconds to run and returns, <ul> <li>Month 1: 2432 Posts <li>Month 2: 2316 Posts <li>Month 3: 2905 Posts <p dir="auto">Browsing the content of the raw data for Month 1 shows it's pretty messy. Much of the kanji and foreign language is going to have to be filtered out (coerced to UTF-8 encoding) for analysis, reducing our dataset further. <p dir="auto"><img src="https://images.hive.blog/0x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516902927/jaijsvgyu99mdy8auyox.gif" alt="rawtext.gif" /> <p dir="auto">There is a large volume of non-printing meta elements including hyperlinks. These links are leaking traffic out of the Steem ecosystem to other inernet destinations. <p dir="auto">This code will extract the URLs, pull out the fully qualified domain names (FQDNs) and count them. <pre><code> urls1 <- rm_url(raw1, replacement = " ", extract=TRUE, trim=FALSE, clean=TRUE) urls1 <- domain(urls1[[1]]) urls1 <- as.data.frame(urls1, stringsAsFactors = FALSE) names(urls1) <- c("domain") urls1 <- sqldf("SELECT [domain], COUNT([domain]) AS [link count] FROM urls1 GROUP BY [domain] ORDER BY [link count] DESC LIMIT 50") <h4>January Top Traffic Referral Destinations <p dir="auto">The Top 10 are mostly image and video hosting sites, with a Content Delivery Network (CDN) in the mix. Nothing too surprising here with no significant changes month on month. <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903000/cedasg47aenix3sfgngx.png" alt="unnamed-chunk-1-1.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903000/cedasg47aenix3sfgngx.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903000/cedasg47aenix3sfgngx.png 2x" /> <h3>February Top Traffic Referral Destinations <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903026/qenqgzujzrdomwrsbxn7.png" alt="unnamed-chunk-1-2.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903026/qenqgzujzrdomwrsbxn7.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903026/qenqgzujzrdomwrsbxn7.png 2x" /> <h3>March Top Traffic Referral Destinations <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903048/rxmlibphwhnj8gpjlyff.png" alt="unnamed-chunk-1-3.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903048/rxmlibphwhnj8gpjlyff.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903048/rxmlibphwhnj8gpjlyff.png 2x" /> <h2>Building a Document Corpus <p dir="auto">Before further analysis, we want to preprocess our collection of texts and purge these URLs. We can use the <a href="https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">Quanteda package to do this. While not perfect it will make a pretty good effort. <pre><code>raw1.1 <- rm_url(raw1, replacement = " ", extract=FALSE, trim=FALSE, clean=TRUE) raw2.1 <- rm_url(raw2, replacement = " ", extract=FALSE, trim=FALSE, clean=TRUE) raw3.1 <- rm_url(raw3, replacement = " ", extract=FALSE, trim=FALSE, clean=TRUE) <p dir="auto">We can now bring the the processed text into a document Corpus; a data structure designed for text analysis. We repeat the code below three times on each dataset resulting in a separate Corpus for Jan, Feb and Mar. <pre><code># Load cleansed posts into a data.frame cps1 <- as.data.frame(raw1.1) # Assign a sequence id to each post cps1$id <- seq.int(nrow(cps1)) # Assign friendly column names colnames(cps1) <- c("text", "id") # Swap/Reverse the column positions cps1 <- cps1[c("id", "text")] # Build a document Corpus Corpus1 <- quanteda::corpus(cps1) <p dir="auto">The Jan Corpus contains 1,478,770 words (of which 58,433 are unique) and 48,434 sentences. <p dir="auto">The Feb Corpus contains 1,419,230 words (of which 56,802 are unique) and 40,146 sentences. <p dir="auto">The Mar Corpus contains 1,353,873 words (of which 55196 are unique) and 44,815 sentences. <p dir="auto">We observe more words were written in Feb despite having two fewer days than Mar. Incidentally, other analysis suggests user account growth between these two months too. More users contributing fewer words is a curious anomaly. <h2>Creating a Document Frequency Matrix (DFM) <p dir="auto">With our three Corpi we can now perform some basic text processing. Specifically we eliminate "Stop Words" and punctuation. Stop words are those with little meaning, such as "and", "the", "a", "an". <p dir="auto">Some Steem specific Stop Words are also removed. These included stray html tags, css elements, line breaks as well as Steem vocabulary. Given the relative youngness of Steem it was clear users want to talk about the platform. Without removing them, these words consistently appear as the most frequent terms, drowning out any #Life related posts. <p dir="auto">This preprocessing takes about 25 seconds. <pre><code># Define some Stop Words steem_stops <- c("steem", "steemit", "resteem", "upvote", "SBD", "n", "s", "t", "re", "nbsp", "p", "li", "br", "strong", "quot", "img", "height", "width", "src", "center", "em", "html", "de", "href", "h1", "h2", "h3", "960", "720", "en", tm::stopwords("en")) # Create a DFM and further preprocess dfm1<-dfm(Corpus1, tolower=TRUE, stem=FALSE, remove=steem_stops, remove_punct=TRUE) # Cal and sort Word Frequency dfm1.1 <- sort(colSums(dfm1), decreasing=TRUE) dfm1.1.wf <- data.frame(word=names(dfm1.1), freq=dfm1.1) <h4>January Top 10 Word Frequency <pre><code> can will one people like time life just get day 3807 3660 2985 2887 2802 2794 2708 2524 2019 1850 <h4>February Top 10 Word Frequency <pre><code> can will time people one like life just get day 3555 3056 2581 2563 2560 2452 2126 2107 1839 1576 <h4>March Top 10 Word Frequency <pre><code> can will one people time like life just get us 3700 3201 2934 2762 2660 2622 2340 2339 1834 1708 <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903203/eiuohztwlaf9ygspdbkg.png" alt="unnamed-chunk-6-4.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903203/eiuohztwlaf9ygspdbkg.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903203/eiuohztwlaf9ygspdbkg.png 2x" /> <p dir="auto">It appears similar words reappear consistently with "can" being the consitently top verb. Collective nouns ("people", "us") are common but without action verbs we can't infer what these persons might be up to. I was expecting to see words like "yoga", "meditation", "happiness", "gratitude" etc. <p dir="auto">Phrasal verbs might give more insight but this will require assembly of bi-grams. Additional Time and more serious Compute resources would be required for this. <h1>Assess Topics with Latent Dirichlet Allocation Model (LDA) <p dir="auto">In an attempt to get more insight to what users are thinking and feeling, we can attempt to mine out word groupings with a word cluster analysis. We hope these word clusters can identify Topics and Themes. <pre><code>library(topicmodels) dfm1LDAFit<- LDA(convert(dfm1, to = "topicmodels"), k = 5) get_terms(dfm1LDAFit, 10) <p dir="auto">After playing around with different parameters (number of groups and words per group) we find no obvious themes in the clusters. <h4>January Topic Clusters <pre><code> ## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 ## [1,] "one" "life" "can" "can" "will" ## [2,] "will" "make" "will" "people" "can" ## [3,] "life" "us" "just" "time" "time" ## [4,] "time" "one" "us" "like" "want" ## [5,] "day" "now" "day" "know" "people" ## [6,] "people" "much" "like" "get" "just" ## [7,] "just" "like" "think" "one" "make" ## [8,] "get" "know" "get" "see" "like" ## [9,] "always" "can" "go" "even" "now" ## [10,] "really" "way" "one" "something" "good" <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903247/qxhygnn9lqgxlpypwwih.png" alt="unnamed-chunk-6-5.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903247/qxhygnn9lqgxlpypwwih.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903247/qxhygnn9lqgxlpypwwih.png 2x" /> <h4>February Topic Clusters <pre><code> ## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 ## [1,] "can" "one" "will" "life" "can" ## [2,] "one" "life" "can" "people" "people" ## [3,] "get" "time" "time" "like" "get" ## [4,] "people" "many" "also" "one" "just" ## [5,] "like" "like" "something" "new" "even" ## [6,] "will" "will" "just" "see" "will" ## [7,] "day" "see" "know" "make" "day" ## [8,] "us" "us" "like" "time" "time" ## [9,] "make" "now" "take" "day" "want" ## [10,] "time" "things" "work" "just" "much" <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903274/ivihh5k4zeyi1gnvkroo.png" alt="unnamed-chunk-6-6.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903274/ivihh5k4zeyi1gnvkroo.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903274/ivihh5k4zeyi1gnvkroo.png 2x" /> <h4>March Topic Clusters <pre><code> ## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 ## [1,] "life" "time" "can" "one" "one" ## [2,] "will" "will" "will" "people" "life" ## [3,] "people" "want" "people" "time" "like" ## [4,] "just" "can" "get" "just" "people" ## [5,] "know" "like" "time" "much" "get" ## [6,] "also" "just" "us" "know" "good" ## [7,] "first" "one" "like" "things" "something" ## [8,] "us" "life" "love" "jpg" "just" ## [9,] "good" "many" "want" "can" "back" ## [10,] "like" "need" "feel" "now" "things" <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903296/m2lne0gkkbwvtyuenagj.png" alt="unnamed-chunk-6-7.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903296/m2lne0gkkbwvtyuenagj.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903296/m2lne0gkkbwvtyuenagj.png 2x" /> <h2>Retargeting & Refocusing <p dir="auto">At this pont I'm beginning to think this Category is full of rather generic, non-specific and uninteresting abstract material. This might seem obvious given the name but I was hoping to see themes or subgroupings. So I decided to try and compare this to other Category Tags. <p dir="auto">Given the steep drop off in Post volume and the exclusions mentioned earlier, there aren't many to choose from. <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903372/tfbjq2a576vcuw8eqklj.png" alt="Extract-1.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903372/tfbjq2a576vcuw8eqklj.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903372/tfbjq2a576vcuw8eqklj.png 2x" /> <p dir="auto">I drew data from January 2017 for the categories <a href="https://steemit.com/trending/health" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">#Health and <a href="https://steemit.com/trending/travel" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">#Travel <pre><code> raw2<- mdb$find(query='{"created": {"$gte": {"$date": "2017-01-01T00:00:00.00Z"}, "$lte": {"$date": "2017-01-31T00:00:00.00Z"} },"category": {"$eq" : "health"} }', fields='{"_id":0, "body":1}') raw3<- mdb$find(query='{"created": {"$gte": {"$date": "2017-01-01T00:00:00.00Z"}, "$lte": {"$date": "2017-01-31T00:00:00.00Z"} },"category": {"$eq" : "travel"} }', fields='{"_id":0, "body":1}') <p dir="auto">There isn't much data to work with. <ul> <li>Life Category : 2432 Posts <li>Health Category: 379 Posts <li>Travel Category: 379 Posts <p dir="auto">It appears that #Health and #Travel also contain a large number of URLs referring traffic out of the Steemit platforms. These are a similar mix of media hosts and CDNs. <h4>Health Category Referral Destinations <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903408/njgupvvh2pmzkviaa3zn.png" alt="unnamed-chunk-1-2.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903408/njgupvvh2pmzkviaa3zn.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903408/njgupvvh2pmzkviaa3zn.png 2x" /> <p dir="auto">I can't explain the curious appearance of <a href="http://saramiller.com/" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">saramiller in this list. <h4>Travel Category Referral Destinations <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903433/jxpqlmvezekg3ik0dg1y.png" alt="unnamed-chunk-1-3.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903433/jxpqlmvezekg3ik0dg1y.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903433/jxpqlmvezekg3ik0dg1y.png 2x" /> <p dir="auto">If we create a Corpus for Health and Travel and recalculate their Word Frequency we can compare them to the Life category. <p dir="auto">Given the significanty fewer posts in these categories I have to tune the Word Frequency parameters to observe the top performers. The Word Clouds, have a minimum frequency of 500. <h4>Comparing Word Frequency <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903458/pr7xyyjugnw2ej8tzkez.png" alt="unnamed-chunk-6-4.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903458/pr7xyyjugnw2ej8tzkez.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903458/pr7xyyjugnw2ej8tzkez.png 2x" /> <h4>Health <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903483/cpgs9a0jwmbj9isegwva.png" alt="unnamed-chunk-6-2.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903483/cpgs9a0jwmbj9isegwva.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903483/cpgs9a0jwmbj9isegwva.png 2x" /><br /> <img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903498/tyb7oqg9herfrpqzfauu.png" alt="unnamed-chunk-6-6.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903498/tyb7oqg9herfrpqzfauu.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903498/tyb7oqg9herfrpqzfauu.png 2x" /> <h4>Travel <p dir="auto"><img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903523/tpm4yke7mxqxn2qpeakq.png" alt="unnamed-chunk-6-3.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903523/tpm4yke7mxqxn2qpeakq.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903523/tpm4yke7mxqxn2qpeakq.png 2x" /><br /> <img src="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903532/atx8k9apcvszuqx22d6g.png" alt="unnamed-chunk-6-7.png" srcset="https://images.hive.blog/768x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903532/atx8k9apcvszuqx22d6g.png 1x, https://images.hive.blog/1536x0/https://res.cloudinary.com/hpiynhbhq/image/upload/v1516903532/atx8k9apcvszuqx22d6g.png 2x" /> <h1>Emoji & Emoticons <p dir="auto">This is based on the very impressive work by <a href="https://github.com/today-is-a-good-day/emojis" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">Jessica Peterka-Bonetta. I won't repost her code or show how I butchered it so inelegantly. I also credit <a href="https://apps.timwhitlock.info/emoji/tables/unicode" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">Tim Whitlock for his invaluable online resource. <p dir="auto">Extracting and counting the Emojis in #Life, #Health and #Travel, show many similarities. <h4>LIFE - Top Emoji for month of Jan 2017 <div class="table-responsive"><table> <thead> <tr><th><th><th>description<th>unicode<th>count <tbody> <tr><td>1<td>⤴<td>Arrow Curving<td>U+2934<td>37 <tr><td>2<td>©<td>Copyright<td>U+00A9<td>31 <tr><td>3<td>♥<td>Heart Suit<td>U+2665<td>23 <tr><td>4<td>Ⓜ<td>Circled<td>U+24C2<td>9 <tr><td>5<td>™<td>Trade Mark<td>U+2122<td>8 <tr><td>6<td>❤<td>Red Heart<td>U+2764<td>3 <tr><td>7<td>®<td>Registered<td>U+00AE<td>2 <tr><td>8<td>❄<td>Snowflake<td>U+2744<td>2 <tr><td>9<td>✈<td>Airplane<td>U+2708<td>1 <tr><td>10<td>😄<td>Smiling Face<td>U+1F604<td>1 <h4>HEALTH - Top Emoji for month of Jan 2017 <div class="table-responsive"><table> <thead> <tr><th><th><th>description<th>unicode<th>count <tbody> <tr><td>1<td>©<td>Copyright<td>U+00A9<td>7 <tr><td>2<td>®<td>Registered<td>U+00AE<td>5 <tr><td>3<td>❄<td>Snowflake<td>U+2744<td>2 <tr><td>4<td>✌<td>Victory Hand<td>U+270C<td>2 <tr><td>5<td>♂<td>Male Sign<td>U+2642<td>1 <tr><td>6<td>❤<td>Red Heart<td>U+2764<td>1 <tr><td>7<td>☠<td>Skull & Crossbones<td>U+2620<td>1 <tr><td>8<td>☀<td>Sun<td>U+2600<td>1 <h4>TRAVEL - Top Emoji for month of Jan 2017 <div class="table-responsive"><table> <thead> <tr><th><th><th>description<th>unicode<th>count <tbody> <tr><td>1<td>©<td>Copyright<td>U+00A9<td>65 <tr><td>2<td>✈<td>Airplane<td>U+2708<td>5 <tr><td>3<td>✔<td>Heavy Check Mark<td>U+2714<td>2 <tr><td>4<td>⤴<td>Arrow Curving<td>U+2934<td>2 <tr><td>5<td>®<td>Registered<td>U+00AE<td>1 <tr><td>6<td>✌<td>Victory Hand<td>U+270C<td>1 <p dir="auto">I was somewhat surprised to see the "copyright", "registered" and "trade marked" emoji appearing so dominantly. For a community of open source advocates I'd expect a more liberal, re-sharing mindset. However this is a small sample size and there may be confounding factors such as user accounts belonging to commercial entities. <p dir="auto">There are too few sentiment-emoji for sentiment analysis at this time. With a bigger dataset we can attempt to <a href="http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">score sentiment with the weights defined by in the <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144296" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">paper by P. Kralj Novak, J. Smailovic, B. Sluban & I. Mozetic. <h1>CONSLUSIONS <p dir="auto">The dataset is drawn from a period when Steem was les than one year old. It was unreasonable to expect more than light weight, trivial content from such an immature platform. <p dir="auto">No obvious themes and topics could be discerned from the Tag Categories. This is a function of so little data. Rerunning this analysis later in 2018 may provide sufficient data to identify stronger word patterns and themes. <p dir="auto">The common practice of cross-tagging, or tag-spamming will have confounding effects. As more data becomes available and tag use becomes more strategic on the part of the user, this situation will improve. However, given the incentive is up-voting rather than page ranking behavior will be slower to change. <p dir="auto">I hope this analysis provides a framework to build on, with larger datasets in the future. Hopefully by then I'll have a better computer! <p dir="auto"><br /><hr /><em>Posted on <a href="https://utopian.io/utopian-io/@morningtundra/natural-language-analysis-of-steem-posts" target="_blank" rel="noreferrer noopener" title="This link will take you away from hive.blog" class="external_link">Utopian.io - Rewarding Open Source Contributors<hr /><p>
Sort:  

Hey @morningtundra I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Would love to learn more about this, i'll check out the group.

.

Terrific analytics man, this is very useful information for people looking to target certain demos. Bravo -- subscribed!

I'm a lil smarter after reading this post. Thanks.

Thank you
Hopefully by then I'll have a better computer! :D -- I hope❤️

:-) You should see this piece of junk - on its 3rd battery, 2nd power supply, 2nd Screen and 2nd HDD. It's been dropped, splashed and frozen (in my car during a snow storm). It's a survivor for sure.

Thank You So Much For Sharing This As Well As Including The Links , Greatly Appreciated :)

You can gain more attention to your post, Just make it a little bit more attractive. Here is how can you do this !
https://steemit.com/steemit/@teamnepal/7-ways-make-your-post-get-real-attention

Great post, !

This is very interesting, I'd be interested to see what this analysis would look like if you only looked at posts over a certain reward amount, and filtered out auxiliary verbs like "can, will, would, should."

I see steemdata is nearly caught up. I might have another run at this in a few days after my surgery

thanks for sharing

I wonder how this would contrast if you did an analysis against Medium's content. Of course, Steemit is younger than Medium. I wonder when Steemit's content will reach the quality of Medium's posts. I wonder if that comparison is fair.

I hope with time it’ll get there as I’m getting rather tired of the spam and food pics. It feels like the early days of IGram.

Haha. Tired of those color contests, too.

Steemit on follow me back

HELLO FRIEND ... YOU ARE INCREDIBLE, THANK YOU FOR THIS DETAILED INFORMATION.BLESSINGS

oui bonne chose monsieur morningtundra

one who started early always has an advantage

Great post, I love how you outline your process.

As an aside, how large a portion of Steemit consists of Spanish or Korean speakers? I wonder if it's a big slice.

That’s a good questions I might try and answer in a future posts. I’d have to count posts containing espaniol. Counting other foreign character sets will be harder without help from native speakers. I can see from just eyeballing the data that nearly half use foreign character sets.

Thanks for answering, that answers my question. Not sure if that's useful information, though.

Thank You So Much Science To Share It And Including Links, Very Good :)

we all are a strong family of steemit