Code for replicating "Right and left, partisanship predicts vulnerability to misinformation" by Dimitar Nikolov, Alessandro Flammini and Filippo Menczer

public public 1yr ago 0 bookmarks

Introduction

In this repository, you can find code and instructions for reproducing the plots from Right and left, partisanship predicts vulnerability to misinformation by Dimitar Nikolov, Alessandro Flammini, and Filippo Menczer.

To start, clone the repo:

$ git clone https://github.com/dimitargnikolov/twitter-misinformation.git

You should run all subsequent commands from the directory where you clone the repo.

Datasets

There are three datasets you need to obtain. Before you begin, create a data directory at the root of the repo.

Link Sharing on Twitter

This dataset contains a set of link sharing actions that occurred on Twitter during the month of June 2017. The dataset is available on the Harvard Dataverse .

Political Valence

This is a dataset from Facebook , which gives political valence scores to several popular news sites. You can request access to the dataset from Dataverse . Once you have access, put the top500.csv file into the data directory.

Misinformation

This is a dataset of manually curated sources of misinformation available at OpenSources.co . Clone it from Github in your data directory.

$ git clone https://github.com/BigMcLargeHuge/opensources.git data/opensources

data Directory

Once you obtain all data as described above, your data directory should look like this:

data
├── domain-shares.data
├── opensources
│ ├── CONTRIBUTING.md
│ ├── LICENSE
│ ├── README.md
│ ├── badges.txt
│ ├── releasenotes.txt
│ └── sources
│ ├── sources.csv
│ └── sources.json
└── top500.csv

Environment

Make sure you have Python 3 installed on your system. Then, set up a virtualenv with the required modules at the root of the cloned repository:

$ virtualenv -p python3 VENV
$ source VENV/bin/activate
$ pip install -r requirements.txt

From now on, any time you want to run the analysis, activate your virtual environment with:

$ source VENV/bin/activate

Workflow

The replication code is contained in the .py files in the scripts directory. You can automate their execution with the provided snakemake workflow:

$ cd workflow
$ snakemake -p

The execution will display the actual shell commands being executed, so you can run them individually if you want. You can inspect the workflow/Snakefile file to see how the inputs and outputs for each script are specified. In addition, you can execute each script with

$ python <script_name.py> --help

to learn about what it does.

At the end of the execution, the generated plots will be in the data directory.

To regenerate the plots from scratch, in the workflow directory you can do:

$ snakemake clean
$ snakemake -p

Contact

If you have any questions about running this code or obtaining the data, please open an issue in this repository and we will get back to you as soon as possible.

Code Snippets

 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
import os
import argparse
import csv
import logging
from operator import itemgetter
from utils import domain
from config import DEBUG_LEVEL

logging.basicConfig(level=DEBUG_LEVEL)


def read_domain_data(filepath, domain_col, data_cols_to_read, delimiter, skip_rows):
    domains = {}
    row_count = 0 # for debugging

    with open(filepath, 'r') as f:
        reader = csv.reader(f, delimiter=delimiter)

        # skip headers
        if skip_rows is not None:
            for _ in range(skip_rows):
                row_count += 1
                next(reader)

        # process the rows
        for row in reader:
            if domain_col >= len(row):
                raise ValueError('Invalid domain index: {}, {}'.format(domain_col, ', '.join(row)))
            d = domain(row[domain_col])

            if d in domains:
                logging.info('Domain has already been processed: {}. Skipping new values.'.format(d))
                logging.info('Existing data: {}'.format(', '.join(domains[d])))
                logging.info('New data: {}'.format(', '.join(row)))

            new_row = []
            if data_cols_to_read is not None:
                for idx in data_cols_to_read:
                    if idx >= len(row):
                        raise ValueError('Invalid index: {}, {}'.format(idx, ', '.join(row)))
                    elif idx == domain_col:
                        logging.info('Data column the same as the domain column. Skipping.')
                    else:
                        new_row.append(row[idx])

            row_count += 1
            domains[d] = new_row
    return domains


def main():
    parser = argparse.ArgumentParser(
        description=('Create a list of domains with standardized URLs.'
                     'Do this either from a primary CSV or from a provided list.'),
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('dest_file', type=str, 
                        help=('Destination file for the combined data. '
                              'The normalized domain will always be in the first column, '
                              'followed by the columns to keep from the primary file, '
                              'followed by the columns to keep from the secondary file.'))

    parser.add_argument('-p', '--primary_csv', type=str, 
                        help=('A CSV file containing domains. '
                              'All domains from this file will be kept in the final output.'))

    parser.add_argument('-s', '--secondary_csv', type=str, default=None,
                        help=('A CSV containing domain data. '
                              'Domains from this file will not be kept '
                              'unless they appear in the primary file.'))

    parser.add_argument('-domain1', '--primary_domain_col', type=int, default=0,
                        help='The column with the domain in the primary source file.')

    parser.add_argument('-data1', '--primary_data_cols', type=int, nargs='+',
                        help='Columns with additional data from the primary source file to keep in the output.')

    parser.add_argument('-delim1', '--primary_delim', type=str, default='\t',
                        help='The delimiter in the primary source file.')

    parser.add_argument('-skip1', '--primary_skip_rows', type=int, default=0,
                        help='The number of header rows in the primary source file to skip.')

    parser.add_argument('-domain2', '--secondary_domain_col', type=int, default=0,
                        help='The column with the domain in the secondary source file.')

    parser.add_argument('-data2', '--secondary_data_cols', type=int, nargs='+',
                        help='Columns with additional data from the secondary source file to keep in the output..')

    parser.add_argument('-delim2', '--secondary_delim', type=str, default='\t',
                        help='The delimiter in the secondary source file.')

    parser.add_argument('-skip2', '--secondary_skip_rows', type=int, default=0,
                        help='The number of header rows in the secondary source file to skip.')

    parser.add_argument('-ddelim', '--dest_delim', type=str, default='\t',
                        help='The delimiter in the destination file.')

    parser.add_argument('-dhead', '--dest_col_headers', type=str, nargs='+',
                        help=('The column headers in the destination file. '
                              'Must match the number of columns being kept from both source files, '
                              'plus the first column for the domain.'))

    parser.add_argument('-exclude', '--exclude_domains', type=str, nargs='+',
                        help='A list of domains to exclude.')

    parser.add_argument('-include', '--include_domains', type=str, nargs='+',
                        help='A list of additional domains to include in the final list.')

    args = parser.parse_args()

    if os.path.dirname(args.dest_file) != '' and not os.path.exists(os.path.dirname(args.dest_file)):
        os.makedirs(os.path.dirname(args.dest_file))

    if (args.primary_csv is None or not os.path.exists(args.primary_csv)) and args.include_domains is None:
        raise ValueError('No input provided.')

    # read the CSVs
    logging.debug('Reading primary file.')
    if args.primary_csv is not None:
        primary_data = read_domain_data(
            args.primary_csv,
            args.primary_domain_col,
            args.primary_data_cols,
            args.primary_delim,
            args.primary_skip_rows
        )
    else:
        primary_data = {}

    if args.include_domains is not None:
        for raw_d in args.include_domains:
            d = domain(raw_d)
            if d not in primary_data:
                primary_data[d] = []

    logging.debug('Reading secondary file.')
    if args.secondary_csv is not None:
        secondary_data = read_domain_data(
            args.secondary_csv,
            args.secondary_domain_col,
            args.secondary_data_cols,
            args.secondary_delim,
            args.secondary_skip_rows
        )
    else:
        secondary_data = {}

    # combine the data from both files into rows
    excluded_domains = frozenset(args.exclude_domains) if args.exclude_domains is not None else frozenset()
    combined_rows = []
    for d in primary_data.keys():
        if d in excluded_domains:
            logging.info('Skipping {}'.format(d))
            continue
        new_row = [d]
        new_row.extend(primary_data[d])
        if d in secondary_data:
            new_row.extend(secondary_data[d])
        combined_rows.append(new_row)
    sorted_data = sorted(combined_rows)

    # write the data to the dest file
    logging.debug('Writing combined file.')
    with open(args.dest_file, 'w') as f:
        writer = csv.writer(f, delimiter=args.dest_delim)
        if args.dest_col_headers is not None:
            sorted_data.insert(0, args.dest_col_headers)
        else:
            sorted_data.insert(0, ['domain'])

        writer.writerows(sorted_data)


if __name__ == '__main__':
    main()
69
70
71
72
73
74
75
shell:
    '''
    rm -rf {data_dir}/tweets/clean {data_dir}/tweets/with-* {data_dir}/tweets/only-* \
           {data_dir}/counts {data_dir}/sources \
           {data_dir}/indexed-tweets {data_dir}/measures \
           {data_dir}/plots
    '''.format(**config)
90
91
92
93
    shell:
        'python {code_dir}/scripts/clean_tweets.py {{threads}} {{input}} {{output}}'.format(**config)

'''
106
107
108
109
    shell:
        'python {code_dir}/scripts/count_links.py {{threads}} {{input}} {{output}} --transform_fn=domain -hdr Domain "Link Count"'.format(**config)

''' 
127
128
129
130
    shell:
        'python {code_dir}/scripts/index_tweets.py {{threads}} {index_level} {{input}} {{output}}'.format(**config)

'''
140
141
142
143
144
145
146
147
shell:
    '''
    python {code_dir}/scripts/create_domain_list.py {{output}} -p {{input.news}} \
    -domain1 0 -data1 1 -delim1 , -skip1 1 \
    -dhead domain "political bias" \
    -ddelim $'\t' \
    -exclude en.wikipedia.org amazon.com vimeo.com m.youtube.com youtube.com whitehouse.gov twitter.com
    '''.format(**config)
158
159
160
161
162
163
164
shell:
    '''
    python {code_dir}/scripts/create_domain_list.py {{output}} -p {{input.misinfo}} \
    -domain1 0 -data1 1 2 3 -delim1 , -skip1 1 \
    -dhead domain type1 type2 type3 \
    -ddelim $'\t'
    '''.format(**config)
174
175
176
177
178
179
180
181
shell:
    '''
    python {code_dir}/scripts/create_domain_list.py {{output}} -s {{input}} \
    -domain2 0 -data2 1 -delim2 $'\t' -skip2=0 \
    -dhead domain pagerank \
    -ddelim $'\t' \
    -include Snopes.com PolitiFact.com FactCheck.org OpenSecrets.org TruthOrFiction.com HoaxSlayer.com
    '''.format(**config)
198
199
200
201
    shell:
        'python {code_dir}/scripts/strip_tweets.py {{threads}} --domains={{input.domains}} {{input.tweets}} {{output}}'.format(**config)

'''
213
214
215
216
    shell:
        'python {code_dir}/scripts/expand_misinfo_dataset.py {{threads}} {{input}} {{wildcards.misinfo_type}} {{output}}'.format(**config)

'''
233
234
235
236
    shell:
        'python {code_dir}/scripts/count_tweets.py {{threads}} {{input}} {{output}} -hdr User "Tweet Count"'.format(**config)

'''
251
252
253
254
    shell:
        'python {code_dir}/scripts/strip_tweets.py {{threads}} --users={{input.users}} {{input.tweets}} {{output}}'.format(**config)

'''
266
267
268
269
    shell:
        'python {code_dir}/scripts/count_tweets.py {{threads}} {{input}} {{output}} -hdr User "Tweet Count"'.format(**config)

'''
282
283
284
285
    shell:
        'python {code_dir}/scripts/index_tweets.py {{threads}} {index_level} {{input}} {{output}}'.format(**config)

'''
296
297
298
299
    shell:
        'python {code_dir}/scripts/count_links.py {{threads}} {{input}} {{output}} --transform_fn=domain -hdr Domain "Link Counts"'.format(**config)

'''
318
319
320
321
322
shell:
    '''
    python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \
           --min_num_tweets={tweets_to_sample}
    '''.format(**config)
336
337
338
339
340
shell:
    '''
    python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \
           --num_tweets={tweets_to_sample}
    '''.format(**config)
354
355
356
357
358
shell:
    '''
    python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \
           --num_tweets={tweets_to_sample} --use_partition
    '''.format(**config)
ShowHide 17 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/osome-iu/misinfo-partisanship-hksmisinforeview-2021
Name: misinfo-partisanship-hksmisinforeview-2021
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...