CrowdTangle Codebook
C
Written by Christina Fan
Updated over a week ago

Matthew Garmur, Gary King, Zagreb Mukerjee, Nathaniel Persily, and Brandon Silverman

Version 1.1

October 16th, 2019

This document describes the CrowdTangle API and user interface being provided to researchers by Social Science One under its collaboration framework with Facebook. CrowdTangle is a content discovery and analytics platform designed to give content creators the data and insights they need to succeed. The CrowdTangle API surfaces stories, and data to measure their social performance and identify influencers. It describes the data’s  scope, structure, and fields.


Data Access

To obtain access to these data, see crowdtangle.com/academics.

Recommended capabilities. Research teams should have experience working with data sets that do not fit into memory. Specifically, teams will need the capability to write queries using HQL or SQL and will need to write R and/or Python analysis code that does not exhaust system RAM.


Requirements. (1) Each team granted access to this API must participate in the Social Science One community Slack channel to answer occasional questions from other users. (2) All publications that result must include this citation: Garmur, Matt; King, Gary; Mukerjee, Zagreb; Persily, Nate; Silverman, Brandon, 2019, "CrowdTangle Platform and API", https://doi.org/10.7910/DVN/SCCQYD, Harvard Dataverse, V2


Unit of analysis

A Facebook “post” -- a link, text post, video, or image, shared by a public page, group, or (possibly) verified public person who chooses to make their profile public.

Scope

All posts from Facebook that are:

  1. Made by a public page, group, or (possibly) verified public person, whoHas ever (since 2014) had > 110k likes, ORHas ever been added to a CrowdTangle list by anyone,

  2. AND are posted without the poster aiming at a particular audience using Facebook targeting and gating tools (eg. age-gating for alcohol pages, geo-gating if the content has country rights restrictions, targeting to women, etc.)

There are no explicit restrictions on language or country.

 
Variables

A data example can be found here (Google Sheet), which is helpful to follow along when understanding the variables below. This particular example is based on US General News, from 9/16/2018 to 10/16/2018. It excludes benchmark information.


Summary Information

Example Post:

  1. Name

  2. A Post

  3. Message

  4. Link Text

  5. Link Description

 

 

Name: The page's unique visible and searchable name

Page Category: For Pages only, this section includes the self-described category of the page. View all page categories here.

Page Admin Top Country: For Pages only, the country from which the plurality of page administrators originate.

Page Description: For Pages only, the description for the Page as submitted by the Page administrators.

 
Number of Likes: The size of the page (in terms of Likes on Facebook, not Facebook “followers” ) at the time the page posted a specific post. As of January 26, 2021, this will display the number of Followers on a page, rather than Page Likes. You can select either Page Likes or Followers in your Dashboard settings. Learn more here

 
Created: The date and time a post was officially posted, UTC time zone. Example: 2018-09-21 05:22:27 EDT


Type: The format of the post. For Facebook this includes links, photo, native video, non-native video (i.e. YouTube links), and live video. For Instagram this includes photos and videos in the post stream. Will be one of these text strings: Photo, Native Video, Link, Status, Live Video


Video Length: If the post is a video, the length of the video.

URL: The URL of a post on Facebook.

 
Message: The blurb of the post, written when the post is uploaded.

 
Link: The link the publisher uploaded, which could be a link-shortened URL.

 
Final Link: The unfurled link, if a URL has been shortened.

Image Text: Any text written on the image, scraped using OCR. Learn more about which languages are supported here.


Link Text: The headline of a link URL or the title of a native video. For example, this will often be a news article title.


Description: For link posts, the sub-header of a link URL: the text that shows up under a link, which is set in the HTML of the linked page (by the author of that page, not by the author of the post)

 
Sponsor ID: For branded content, the page ID of the marketer, not the page poster.

 
Branded content - aka a “handshake” is a special feature available to certain brands and pages, where a post on a page can be sponsored by a brand for native advertising. This will show the ID of the brand. ID is a number. It correlates to an address -- eg Nike is 15087023444, and facebook.com/15087023444 is a redirect to facebook.com/nike.

Sponsor Name: For branded content, the Marketer page name.

Sponsor Category: For branded content, the Marketer page category.

Score: Based off of CrowdTangle's “overperforming” metric, this is the level at which a post overperformed. The overperformance is computed relative to similar posts from the same page in similar timeframes - high overperformance from a New York Times video posted in the last 15 minutes would mean that the post got more interactions than previously posted New York Times videos in their first 15 minutes of posting.

The score can be computed with the following equations:

  • Interactions is the total number of interactions (like, share etc.). The default behavior is that all of these are simply added together.

  • The threshold is a minimum set to avoid high variance for small numbers of interactions. For Facebook posts it is 5 likes, 2 of comments/shares/non-like reactions, 100 total page views, and 2 post views. For Instagram it is 5 likes, 2 views, and 2 comments.

  • Benchmark is the smoothed average of interactions for that page for the last 100 similar posts (so if the post is a Fox News video, the last 100 videos from Fox News). To generate the benchmarks, we take the last 100 posts from a given account and of a given post type (link post, image post, etc.). We drop the top and bottom 25% of those 100 posts, and calculate the mean number of interactions that the middle 50% of the posts have at each age (15 minutes old, 60 minutes old, 5 hours old, etc.) For more details on benchmark calculation, see below.  

  • The setup for different cases is intended to give a relatively smooth curve for different values of interactions and benchmark, without asymptotic behavior for small values of either.

A more detailed description of scores, along with descriptions of the reasoning behind the logic, can be found here.

Reactions and Interactions:

Each column in this set of columns is a total number of interactions with posts, of different kinds (likes, comments, shares etc. ) The interactions include summary statistics on that post gathered between when the post is posted and the time the API call is made. The summary numbers here do not include either interactions with shares or interactions with comments.

Reactions:

Users can “react” to posts on Facebook to communicate a range of emotional responses. By clicking the “Like” button displayed beneath each post, the user can “Like” the post - a default reaction. By hovering or long-pressing the Like button, a user can access a variety of other reaction types, called: Like, Wow, Sad, Angry, Love, Haha.

Each person can give only one of the reaction types, and give it only once.

 

 

Likes: The total number of likes on a Facebook post, created by users clicking the thumbs-up “Like” icon. This does not include other reaction forms.

Love: On Facebook, the total number of love reactions on that post, created by users clicking the heart-shaped icon.

Wow: On Facebook, the total number of wow reactions on that post, created by users clicking the Wow face

Haha: On Facebook, the total number of haha reactions on that post.

Sad: On Facebook, the total number of sad reactions on that post.

Angry: On Facebook, the total number of angry reactions on that post.

 
Other interactions:


Comments: Users can create a top-level comment on a post by clicking the comment button beneath the post. Users can also click the “Reply” button on a comment to create a second-level comment as a reply to the original, and react to each comment in the same way they would to a post.

This field is the total number of top-level comments on a Facebook or Instagram post. "Top-level" means it does not include “threaded comments,” or replies to comments: the Facebook comments are in a two-tier hierarchy, with comments on the post and replies to comments. For privacy reasons, only the first tier is included. See image below for more details.

   

Shares: Users may “share” a post to push the post to their own friends and followers. This field represents the total number of shares off of that post, not including shares of shares.

 
Video Share Status: On Facebook, if a native video was originally uploaded as this post, or cross-posted from another page. Can be: “original”, “share” or blank.

 
Crossposting is a special Facebook feature available to certain brands/media entities, and means posting from a central video library to several pages. For example, a parent media company might control several local media station pages. It would then be able to crosspost a news video from its central library to local station pages across a state or region. The total number of views for that video across pages is shown here.

 
Post Views: On Facebook, the number of views a native video accumulated directly from that particular post. This does not include video views accumulated from shares of that post.

 
Total Views: The combined views for a native Facebook video of both the views from the parent post, and the shares of that parent post.

 
Total Views for all Crossposts: The total number of views, across all crossposts, for a Facebook native video that has been cross-posted on Facebook. See description of crossposting under “Video Share”.

 
Benchmarks: The CrowdTangle benchmarks are computed for each post and each interaction type. The benchmarks are used in showing over/underperformance, and roughly correspond to the average number of interactions of that type on similar posts by the same page.

 
Benchmarks are calculated from the last 100 posts across 3 dimensions:

  • Account (New York Times, Nike, etc.)

  • Post Type (photo, video, link, etc.)

  • Age of post (broken into buckets that increase in size as the post ages -- 0-15 minutes old is a bucket, as is 12-15 hours old, as is 6-7 days old)

Within the last 100 posts that share 3 particular dimensions, we sort by each metric (likes, comments, shares, etc.) and then delete the top 25 and bottom 25 to try and account for power law. We then average the middle 50 to get a benchmark for that metric (eg. likes) for that account (eg. NYT) for that type (eg. photo) for that age (eg. 0-15 minutes old). We do this for every iteration and then compare a post's actual data against the benchmark that matches its profile (eg. 10 actual likes vs. 5 expected/benchmarked likes).

 
As an example, suppose the New York Times posted a photo 12 minutes ago and we want to compute benchmark Likes. We will consider the last 100 photos posted by the NYT and count how many Likes each one got in the first 15 minutes of posting. Then we will throw out the top and bottom 25 photos by Likes, and average the Likes of the remaining 50.

 
These benchmarks are referred to below as “expected number of likes” etc.

 
Benchmark Likes: The expected number of likes a post should have for a certain type after a given amount of time.

 
Benchmark Comments: The expected number of comments a post should have after a given amount of time.

 
Benchmark Shares: On Facebook, the expected number of shares a post should have after a given amount of time.

 
Benchmark Love: On Facebook, the expected number of love reactions a post should have after a given amount of time.

 
Benchmark Wow: On Facebook, the expected number of wow reactions a post should have after a given amount of time.

 
Benchmark Haha: On Facebook, the expected number of haha reactions a post should have after a given amount of time.

 
Benchmark Sad: On Facebook, the expected number of sad reactions a post should have after a given amount of time.

 
Benchmark Angry: On Facebook, the expected number of angry reactions a post should have after a given amount of time.

 
Benchmark Post Views: For Facebook native videos, the expected number of post-level video views a post should have after a given amount of time.

 
Benchmark Total Views: For Facebook native videos, the expected number of post-level plus shared video views a post should have after a given amount of time.

 
Benchmark Total Views for all Crossposts: For crossposted videos on Facebook, the expected number of crossposted video views a post should have after a given amount of time. See description of crossposting under “Video Share”.


Benchmarks and Timesteps in Post CSVs:

In the Benchmarks section of this codebook, “age of post” is one of the 3 dimensions used to create a benchmark. The dimensions are there to cluster posts based on relevant information, and posts typically gain more engagement the longer they exist, so we chose to bucket posts within comparable ages.

Since posts on many social media platforms tend to display more variability early in their lives than later, our time buckets (“timesteps”) start off very short, and grow as they get older. Comparing a popular post that’s 2 hours old to a post that’s 15 minutes old does not seem terribly relevant, whereas comparing a post that’s 19 days and 2 hours old to a post that’s 19 days and 10 hours old could be an actionable comparison.

The timestep sizes follow a shape that mimics a logarithmic curve, though it’s not actually logarithmic. The first timesteps are each 15 minutes long, then they grow to 30 minutes, and eventually end up at a full 24 hours.

End times specified in the link below are exclusive of the actual final moment. For example, “0-15 minutes” means between 0 and anything just under 15 minutes. Once it hits 15 minutes exactly, that becomes part of the “15-30 minutes” timestep. 

Timesteps are listed below:


Did this answer your question?