A Twitter Developer account is required to run this script.

In [1]:
import tweepy
import json
import pandas as pd
import numpy as np
In [2]:
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.max_rows', 20)

Read Twitter API credentials

Read Twitter API credentials from twitter-credentials.json. The JSON file should contain the following key/values:

{
    "consumer_key": "YOUR_CONSUMER_KEY",
    "consumer_secret": "YOUR_CONSUMER_SECRET",
    "access_token": "YOUR_ACCESS_TOKEN",
    "access_token_secret": "YOUR_ACCESS_TOKEN_SECRET",
    "bearer_token": "BEARER_TOKEN"
}

Create the twitter-credentials.json with your own keys and tokens in the same folder as this Jupyter notebook.

In [3]:
with open('twitter-credentials.json', 'r') as f:
    twitter_credentials_data = f.read()
    
twitter_credentials = json.loads(twitter_credentials_data)

Create Tweepy client with credentials

Initialize tweepy.Client (Twitter API V2).

In [4]:
client = tweepy.Client(
    consumer_key=twitter_credentials["consumer_key"],
    consumer_secret=twitter_credentials["consumer_secret"],
    access_token=["access_token"],
    access_token_secret=twitter_credentials["access_token_secret"],
    bearer_token=twitter_credentials["bearer_token"],
    wait_on_rate_limit=True
)

Create a function to retrieve tweets

Tweepy does not provide an automated way to handle pagination. This function uses next_token in each request to make follow-up requests until the maximum number of tweets are filled.

In [5]:
def retrieve_tweets(query, max_num_tweets=100, sort_order="recency"):
    response = None
    df = None
    num_tweets_retrieved = 0
    next_token = None
    author_id_username_map = {}
    
    print(f'==========================================')
    print(f'Retrieving tweets using query: {query}')
    
    while (max_num_tweets > num_tweets_retrieved) and ((response is None) or next_token):
        max_results = max(min(max_num_tweets - num_tweets_retrieved, 100), 10)
        
        print(f'num_tweets_retrieved={num_tweets_retrieved}, max_results={max_results}, next_token={next_token}')
        
        response = client.search_recent_tweets(
            query=query,
            max_results=max_results,
            sort_order=sort_order,
            next_token=next_token,
            user_fields=["name" , "username"],
            tweet_fields=["author_id"],
            expansions=["entities.mentions.username"]
        )
        
        new_df = pd.DataFrame(response.data)
        
        if 'edit_history_tweet_ids' in new_df.columns:
            new_df.drop(columns=['edit_history_tweet_ids'], inplace=True)
        num_tweets_retrieved += len(response.data)
        
        ids = new_df['author_id'].unique().tolist()

        for user in client.get_users(ids=ids).data:
            author_id_username_map[user['id']] = user['username']
            
        new_df['username'] = new_df['author_id'].map(author_id_username_map)

        if df is None:
            df = new_df
        else:
            df = pd.concat([df, new_df])
        
        # extract next_token for subsequent call
        next_token = response.meta['next_token'] if 'next_token' in response.meta else None
        
    df = df[['id', 'username', 'text', 'entities']]

    return df

Invoke retrieve_tweets() with a keyword

The first invocation queries all non-retweets from verified users.

The second invocation queries only retweets from all users (including both verified and non-verified users).

In [6]:
search_keyword = 'ChatGPT'

df_tweets = retrieve_tweets(
    query=f'"{search_keyword}" -is:retweet is:verified lang:en',
    max_num_tweets=5000
)

df_retweets = retrieve_tweets(
    query=f'"{search_keyword}" is:retweet lang:en',
    max_num_tweets=5000
)

display(df_tweets.head(3))
print(f"df_tweets has {df_tweets.shape[0]} row(s)")
display(df_retweets.head(3))
print(f"df_retweets has {df_retweets.shape[0]} row(s)")
==========================================
Retrieving tweets using query: "ChatGPT" -is:retweet is:verified lang:en
num_tweets_retrieved=0, max_results=100, next_token=None
num_tweets_retrieved=100, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zzuniq3oq41sg2rtdf9njx
num_tweets_retrieved=200, max_results=100, next_token=b26v89c19zqg8o3fqk70du2ejjroey2sg7m0cmvlv9ksd
num_tweets_retrieved=300, max_results=100, next_token=b26v89c19zqg8o3fqk70du23l079iimfv5q4bx09sq2yl
num_tweets_retrieved=400, max_results=100, next_token=b26v89c19zqg8o3fqk70du1i37bvfy5t8s81007yi3r7h
num_tweets_retrieved=500, max_results=100, next_token=b26v89c19zqg8o3fqk70drz6ascr52oai4o9mfvfgi16l
num_tweets_retrieved=600, max_results=100, next_token=b26v89c19zqg8o3fqk70drxz9mi8p1fnh1m6qjgc4h8cd
num_tweets_retrieved=700, max_results=100, next_token=b26v89c19zqg8o3fqk70drx2orhj03hsx9atnbe2w08e5
num_tweets_retrieved=800, max_results=100, next_token=b26v89c19zqg8o3fqk70dpu5j3aoy18onnqfql4tnrm2l
num_tweets_retrieved=900, max_results=100, next_token=b26v89c19zqg8o3fqk70dpt8zryl6o6xdz2eymc6ubkhp
num_tweets_retrieved=1000, max_results=100, next_token=b26v89c19zqg8o3fqk70dpscsjwg8g7l9w4xj6oeixkzh
num_tweets_retrieved=1100, max_results=100, next_token=b26v89c19zqg8o3fqk70dprr34uqr0p8zfxuquzy8m1dp
num_tweets_retrieved=1200, max_results=100, next_token=b26v89c19zqg8o3fqk70dnpfas0fpgab565rt60gdk29p
num_tweets_retrieved=1300, max_results=100, next_token=b26v89c19zqg8o3fqk70dnp4c8fmtmqe51l9y1w3bv3b1
num_tweets_retrieved=1400, max_results=100, next_token=b26v89c19zqg8o3fqk70dnoiyyl8fajas81hkce93rn99
num_tweets_retrieved=1500, max_results=100, next_token=b26v89c19zqg8o3fqk70dno80hczfu2xxqffcg9nl2dfh
num_tweets_retrieved=1600, max_results=100, next_token=b26v89c19zqg8o3fqk70dnnx6k08p0kmofauxw0wl3jb1
num_tweets_retrieved=1700, max_results=100, next_token=b26v89c19zqg8o3fqk70dnnmk5ud90iy2jqs6vjvzhful
num_tweets_retrieved=1800, max_results=100, next_token=b26v89c19zqg8o3fqk70dnnbusrycxzpnxnv26sr2it8d
num_tweets_retrieved=1900, max_results=100, next_token=b26v89c19zqg8o3fqk70dnn0usaed9vh41hi32ueu8f3x
num_tweets_retrieved=2000, max_results=100, next_token=b26v89c19zqg8o3fqk6zz1a0poszl1ggzu693zegi1ev1
num_tweets_retrieved=2100, max_results=100, next_token=b26v89c19zqg8o3fqk6zz19f4tu60tmiyhdmzvckm743h
num_tweets_retrieved=2200, max_results=100, next_token=b26v89c19zqg8o3fqk6zz1944t52xubi3726tqhexxlz1
num_tweets_retrieved=2300, max_results=100, next_token=b26v89c19zqg8o3fqk6zz18ijzo6hujro6m35aqvi89kt
num_tweets_retrieved=2400, max_results=100, next_token=b26v89c19zqg8o3fqk6zz17mit1cejsluljjfsxlgkk8t
num_tweets_retrieved=2500, max_results=100, next_token=b26v89c19zqg8o3fqk6zyz5aftm53i7g77sw4z3mmws59
num_tweets_retrieved=2600, max_results=100, next_token=b26v89c19zqg8o3fqk6zyz4ebmsqklqezf7zswmd9cepp
num_tweets_retrieved=2700, max_results=100, next_token=b26v89c19zqg8o3fqk6zyz3snqle5vjz301g6mbau59j1
num_tweets_retrieved=2800, max_results=100, next_token=b26v89c19zqg8o3fqk6zyz372uzpj0actvengjm2vmvst
num_tweets_retrieved=2900, max_results=100, next_token=b26v89c19zqg8o3fqk6zyx0v2vbj6gctvmfyqe9f4gvi5
num_tweets_retrieved=3000, max_results=100, next_token=b26v89c19zqg8o3fqk6zyx09o3hjxnxyx89hjzp5doku5
num_tweets_retrieved=3100, max_results=100, next_token=b26v89c19zqg8o3fqk6zywznypxtpbn7y8zfnjnw0stbx
num_tweets_retrieved=3200, max_results=100, next_token=b26v89c19zqg8o3fqk6zywz2aso2o9062qu4p93e48u4d
num_tweets_retrieved=3300, max_results=100, next_token=b26v89c19zqg8o3fqk6zywygw1waj9rngjf43lueb8osd
num_tweets_retrieved=3400, max_results=100, next_token=b26v89c19zqg8o3fqk6zywy5xiyv3bodgblchi2ats0hp
num_tweets_retrieved=3500, max_results=100, next_token=b26v89c19zqg8o3fqk6zyuw4t0k2un1eow3hzlxyxhchp
num_tweets_retrieved=3600, max_results=100, next_token=b26v89c19zqg8o3fqk6zyuvu240oebxqvb5s6ka08lunx
num_tweets_retrieved=3700, max_results=100, next_token=b26v89c19zqg8o3fqk6zyuvjcpw5nv6unkypefkg44upp
num_tweets_retrieved=3800, max_results=100, next_token=b26v89c19zqg8o3fqk6zyuuxwesoy5gtd4heg6ldxr4vx
num_tweets_retrieved=3900, max_results=100, next_token=b26v89c19zqg8o3fqk6zyuuc8idjpixne6sdy4a5zacu5
num_tweets_retrieved=4000, max_results=100, next_token=b26v89c19zqg8o3fqk6zyutqj3c0p44586eoionlbvgql
num_tweets_retrieved=4100, max_results=100, next_token=b26v89c19zqg8o3fqk6zysr47dznyt549ihptlztiyxkt
num_tweets_retrieved=4200, max_results=100, next_token=b26v89c19zqg8o3fqk6zysq7r2732m7qm4qdudgoo235p
num_tweets_retrieved=4300, max_results=100, next_token=b26v89c19zqg8o3fqk6zysopmw441hrhwh1ws13c9bukd
num_tweets_retrieved=4400, max_results=100, next_token=b26v89c19zqg8o3fqk6zyqkl9yy5tjr6dwarjb8wl7ddp
num_tweets_retrieved=4500, max_results=100, next_token=b26v89c19zqg8o3fqk6zyogrddgk16zhlk956oxcgwvi5
num_tweets_retrieved=4600, max_results=100, next_token=b26v89c19zqg8o3fqk6zymd8ibpuc7ns8qyv3mwlrpo59
num_tweets_retrieved=4700, max_results=100, next_token=b26v89c19zqg8o3fqk6zymc1b3dmdopieamqpst35ocxp
num_tweets_retrieved=4800, max_results=100, next_token=b26v89c19zqg8o3fqk6zyk87fz0s1g8sfdghsj7o2it8d
num_tweets_retrieved=4900, max_results=100, next_token=b26v89c19zqg8o3fqk6zyi36f7pyda6x2stla0l5c4eil
num_tweets_retrieved=4999, max_results=10, next_token=b26v89c19zqg8o3fqk6zjvo2pmewu4i54qp9abvs7bhbx
==========================================
Retrieving tweets using query: "ChatGPT" is:retweet lang:en
num_tweets_retrieved=0, max_results=100, next_token=None
num_tweets_retrieved=100, max_results=100, next_token=b26v89c19zqg8o3fqk70du3awtof4uojjedysj0dgnywt
num_tweets_retrieved=200, max_results=100, next_token=b26v89c19zqg8o3fqk70du3awseklq1wtctfln027w0al
num_tweets_retrieved=300, max_results=100, next_token=b26v89c19zqg8o3fqk70du3avb9s16t065ae6ultq4v3x
num_tweets_retrieved=400, max_results=100, next_token=b26v89c19zqg8o3fqk70du3av9zpz7z89bcsu8awy6lj1
num_tweets_retrieved=500, max_results=100, next_token=b26v89c19zqg8o3fqk70du3atsfmiji9m37gzyrlz7tkt
num_tweets_retrieved=600, max_results=100, next_token=b26v89c19zqg8o3fqk70du3atr5ml4tva2m05t6sypocd
num_tweets_retrieved=700, max_results=100, next_token=b26v89c19zqg8o3fqk70du3as9lnhinxzlkoi3xm2e4cd
num_tweets_retrieved=800, max_results=100, next_token=b26v89c19zqg8o3fqk70du3as8qv879rvk61jlnxz8xh9
num_tweets_retrieved=900, max_results=100, next_token=b26v89c19zqg8o3fqk70du3aqrehddifsrv25r2b3snst
num_tweets_retrieved=1000, max_results=100, next_token=b26v89c19zqg8o3fqk70du3aqqjp3lqgkjsl9dwa2habh
num_tweets_retrieved=1100, max_results=100, next_token=b26v89c19zqg8o3fqk70du3aqphgxyta5sdw5ryk74dx9
num_tweets_retrieved=1200, max_results=100, next_token=b26v89c19zqg8o3fqk70du3ap84ytpmjij4itlfh61wu5
num_tweets_retrieved=1300, max_results=100, next_token=b26v89c19zqg8o3fqk70du3ap7a9rzquh61tgi05ifpml
num_tweets_retrieved=1400, max_results=100, next_token=b26v89c19zqg8o3fqk70du30agl3p0z2ejsuzcg8y3tdp
num_tweets_retrieved=1500, max_results=100, next_token=b26v89c19zqg8o3fqk70du30afq871vh1yby4sbwdjfr1
num_tweets_retrieved=1600, max_results=100, next_token=b26v89c19zqg8o3fqk70du308y6ca4ck77a6t8e166we5
num_tweets_retrieved=1700, max_results=100, next_token=b26v89c19zqg8o3fqk70du308x3wkfe8hsrww4fbzle2l
num_tweets_retrieved=1800, max_results=100, next_token=b26v89c19zqg8o3fqk70du307fju7p5unefsswfva7799
num_tweets_retrieved=1900, max_results=100, next_token=b26v89c19zqg8o3fqk70du307ehb9yf3pkx4zzmk7w4u5
num_tweets_retrieved=2000, max_results=100, next_token=b26v89c19zqg8o3fqk70du305wpu48vghwon0qqcmet8d
num_tweets_retrieved=2100, max_results=100, next_token=b26v89c19zqg8o3fqk70du305vnhm5azfyp027wew8xa5
num_tweets_retrieved=2200, max_results=100, next_token=b26v89c19zqg8o3fqk70du304eb2pc84rngjon2lph37h
num_tweets_retrieved=2300, max_results=100, next_token=b26v89c19zqg8o3fqk70du304d8o2u3rv0hifrqty9mkd
num_tweets_retrieved=2400, max_results=100, next_token=b26v89c19zqg8o3fqk70du302vwcdxinwaespmusbclml
num_tweets_retrieved=2500, max_results=100, next_token=b26v89c19zqg8o3fqk70du302v1k4x4j3fj6ya6560hvh
num_tweets_retrieved=2600, max_results=100, next_token=b26v89c19zqg8o3fqk70du302u6u1u6n1z9bwbx3c6xrx
num_tweets_retrieved=2700, max_results=100, next_token=b26v89c19zqg8o3fqk70du301cuicryibsazs84fpkj99
num_tweets_retrieved=2800, max_results=100, next_token=b26v89c19zqg8o3fqk70du301bzp0vo4nwtxcmrotd6rh
num_tweets_retrieved=2900, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zzufwd2jzyi59uqe0oxbwd
num_tweets_retrieved=3000, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zztl5617d7ovkerq32gd4t
num_tweets_retrieved=3100, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zzsqbtzj7ysp2soa6nefst
num_tweets_retrieved=3200, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zyb6dscdak49lehm27mc8t
num_tweets_retrieved=3300, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zyabh7xwnwfowkb6c16oot
num_tweets_retrieved=3400, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zwsk273cafb50t0lf0h8ql
num_tweets_retrieved=3500, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zwrhmi073z5s3kks0em071
num_tweets_retrieved=3600, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pi0s9wyit4ppkjxif5z7ul
num_tweets_retrieved=3700, max_results=100, next_token=b26v89c19zqg8o3fqk70du2phzxa3ql3q0lvsxye98osd
num_tweets_retrieved=3800, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pgidaz28zyq22dynne9h19
num_tweets_retrieved=3900, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pghb0nyh50nhsx6r7yv4ot
num_tweets_retrieved=4000, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pezylq8lvodykytriln799
num_tweets_retrieved=4100, max_results=100, next_token=b26v89c19zqg8o3fqk70du2peywenmzhxn3d5gi9nxhml
num_tweets_retrieved=4200, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pdhrl0ofvcyn2on9x2ok8t
num_tweets_retrieved=4300, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pdh4llkfxpzyjjwt49zdrx
num_tweets_retrieved=4400, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pdghl4hdrn35q9l2bs8nzx
num_tweets_retrieved=4500, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pdfujionmbg01lmm8nm459
num_tweets_retrieved=4600, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pbypqzdetirzwxhtxadmrh
num_tweets_retrieved=4700, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pbxv1yetw2mvthnve8akql
num_tweets_retrieved=4800, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pagb62uow9s53klf6xium5
num_tweets_retrieved=4900, max_results=100, next_token=b26v89c19zqg8o3fqk70du2pafgcqcibwzb9jr611glfh
idusernametextentities
01623376134317690881thedailybeastPlus! Weill and Sommer recap a bombshell investigation into Eliza Bleu, an overnight internet sensation who rose to prominence by becoming an ombu...NaN
11623376131570413568thedailybeastAlso on the podcast, Weill and Sommer interview @trevoraaronson, the host of investigative podcast the Alphabet Boys, which tells the story of FBI...{'mentions': [{'start': 48, 'end': 63, 'username': 'trevoraaronson', 'id': '22180521'}]}
21623376128651182081thedailybeast“Ultimately, I think this AI stuff is awful,” Sommer says. “I think the people in 'Dune' knew what was up when they banned AI and it seemed like t...NaN
df_tweets has 5009 row(s)
idusernametextentities
01623376268245798939afaf11140627605RT @acrianetwork: https://t.co/SZ1H9KCR87 will take #ChatGPT to the next level: We enable the sharing of training data 💡\n\n10 Million AINF T…{'mentions': [{'start': 3, 'end': 16, 'username': 'acrianetwork', 'id': '1344595602479468546'}]}
11623376261480587265ayon_parvezRT @EstadoLatente: The ΔI𝚝𝚊𝚗𝚜 are a new breed of being\n❤️‍🔥\nSubmission teaser for @runwayml's #AIFilm Festival. Pictures, animation, poems…{'mentions': [{'start': 3, 'end': 17, 'username': 'EstadoLatente', 'id': '1137227054'}, {'start': 82, 'end': 91, 'username': 'runwayml', 'id': '10...
21623376255163764740LanieGirl00RT @StopTechnocracy: AI will totally disrupt the knowledge world in 2023\nhttps://t.co/b3tIxqIHpC{'mentions': [{'start': 3, 'end': 19, 'username': 'StopTechnocracy', 'id': '77918673'}]}
df_retweets has 5000 row(s)

Save as CSV files

While the "entities" column can be useful, we can extract mentions using regular expressions later if required.

In [7]:
for d in [df_tweets, df_retweets]:
    if 'withheld' in d.columns:
        d.drop(columns=['withheld'], inplace=True)

df_tweets.drop(columns=['entities']).to_csv(f'{search_keyword}-tweets.csv', index=None)
df_retweets.drop(columns=['entities']).to_csv(f'{search_keyword}-retweets.csv', index=None)