Thirty-Two Years of IEEE VIS

public public 1yr ago 0 bookmarks

This repository contains data files and codes (data processing & analysis) for the paper of Thirty-two years of IEEE VIS: Authors, Fields of Study and Citations.

Updated Findings

In Fig. 3(d) and 3(e), we showed that the number of citations for VIS from non-VIS papers has been increasing dramatically but we did not analyze the publication venues of these citation papers. We did it later and found that citations coming from IEEE Transactions on Visualization and Computer Graphics accounted for 12.4% of all 153,549 citations (undeduplicated). Citations from Computer Graphics Forum , HCI venues, PacificVis, and journals in the filed of Visualization such as Information Visualization and Journal of Visualization are also major sources. This indicate that the impacts of VIS are mostly confined to visualization and HCI areas . Detailed results are available at https://hongtaoh.com/files/top_venues.html .

For recalicability committee:

Please go to the folder of reproduce and simply run bash script.sh .

Table of Contents

Structure

This repository consists of four folders:

  1. analyses_and_get_figures contains Jupyter notebooks that get the reported statistics and figures in the Results section of our paper.

  2. data are data files we created and analyzed.

  3. results are the output figures generated from codes in analyses_and_get_figures . Figures in both the paper and the supplementary material are included.

  4. workflow contains (1) scripts to obtain data, and (2) Jupyter notebooks to validate data.

analyses_and_get_figures and results are easy to understand. The most difficult and critical parts are workflow and data . For detailed data generation & processing procedures, refere to workflow . For detailed descriptions of data that were generated and used in the study, refer to the data folder.

Important data

The most important data files in analysis are as follows:

  1. data/ht_class/ht_cleaned_author_df.csv

  2. data/ht_class/ht_cleaned_paper_df.csv

  3. data/interim/openalex_author_df.csv

  4. data/processed/openalex_concept_df.csv

  5. data/processed/large/openalex_citation_concept_df.csv

  6. data/processed/large/openalex_reference_concept_df.csv

  7. data/processed/openalex_refeernce_concept_df_unique.csv

Data dicionaries for public data

We have also made data that might be useful for other researcers working on scientometric analysis available on Google Sheets: https://docs.google.com/spreadsheets/d/1JRo33XurW28bGK_Snplno1dbRLDkSZf1T7JmpjNDvTw/

VIS PAPER 1990-2021

  • Conference: The conference track of VIS papers. There are four tracks: InfoVis, SciVis, VAST, vis. Since 2021, IEEE VIS no longer distinguishes between conference tracsk and we assigned the term 'VIS' for all papers published in and after 2021

  • Year: The year this paper was published

  • Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)

  • DOI: Paper DOI

  • PaperType: either 'J' (Journal paper) or 'C' (conference paper). This data is from vispubdata . For IEEEVIS 2021 papers, we classified them all as 'J'

  • OpenAlex ID: The OpenAlex ID associated with this paper. With an ID, for example, W3203914472 , you can assess this paper's metadata on OpenAlex through https://api.openalex.org/works/W3203914472

  • Number of References: Number of references as shown on OpenAlex (as of June 2022)

  • Number of Concepts: Number of concepts as shown on OpenAlex (as of June 2022)

  • Number of Citations: Number of citations as shown on OpenAlex (as of June 2022)

  • Number of Authors: Number of authors

  • Cross-type Collaboration: Whether a paper involves collaborations among researchers from universities and non-educational affiliations (e.g., companies, facilities, government, healthcare, etc.)

  • Cross-country Collaboration: Whether a paper involves collaborations among researchers from different countries or regions

  • With US Authors: Whether a paper involves at least one author from the United States

  • Both Cross-type and Cross-country Collaboration: Whether a paper is both a cross-type and a cross-country collaboration paper

  • Google Scholar Citation: Citation counts as shown on Google Scholar (as of June 2022)

  • Award: Whether a paper is an award-winning paper. Note that we exclude Test of Time awards

  • Award Name: If a paper is an award-winning one, what award did it get. BP: Best Paper; HM: Honorable Mention; BCS: Best Case Study

  • Award Track: The conference track that presented this paper this award

VIS AUTHORS 1990-2021

  • Year: The year this paper was published

  • DOI: Paper DOI

  • Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)

  • Number of Authors: Number of authors

  • Author Position: Author position

  • Author Name: Author name

  • OpenAlex Author ID: OpenAlex author ID

  • Affiliation Name: Author affiliation name

  • Affiliation country code: alpha-2 (ISO 3166) country code for affiliations

  • Affiliation Type: The type of an affiliation, as defined by ROR

  • Binary Type: The type of an affiliation, either education or non-education

VIS PAPER CONCEPTS

  • Year: The year this paper was published

  • DOI: Paper DOI

  • Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)

  • Number of Concepts: Number of concepts as shown on OpenAlex (as of June 2022)

  • Index of Concept: Index of Concept as shown on OpenAlex (as of June 2022)

  • Concept: Concept name

  • Concept ID: Concept ID on OpenAlex

  • Wikidata: Link to Wikidata page of a Concept

  • Level: The level of this Concept as defined by OpenAlex. Level 0 indicates root Concepts like Computer Science and Psychology. The larger the number, the more granualr a Concept is.

  • Score: The score assigned to this Concept by OpenAlex. A higher score indicates this Concept is a better representation of a paper.

Google Scholar Citations

  • Year: The year this paper was published

  • DOI: Paper DOI

  • IEEE Title: Paper title as shown on IEEE Xplore (as of June 2022)

  • Title on Google Scholar: Paper title as shown on Google Scholar (as of June 2022)

  • Citation Link: Link to papers citing a VIS paper on Google Scholar (as of June 2022)

  • Citation Counts on Google Scholar: Citation counts on Google Scholar (as of June 2022)

Large data

The large folder within data/processed is empty because GitHub does not allow uploading files larger than 100M. Large files are stored in the repository of https://osf.io/zkvjm/ (OSF Storage -> large).

Dependencies

This project uses python 3.8 with the following packages:

snakemake
pandas
numpy
matplotlib
seaborn
altair
scikit-learn
scipy
plotnine
beautifulsoup4
selenium
urllib3
requests
lxml

All packages can be installed with pip install pkgname , for example, pip install scikit-learn . For lxml , use conda install -c anaconda lxml .

snakemake is used for the workflow. For details, see my tutorial on snakemake .

For citation analysis, we also used R . See citation_analysis.R .

For python , we recommend conda and creating a virtural environment. After installing anaconda , you can create a virtual environment:

conda create --name 32vis python=3.8
conda activate 32vis

Then you can install packages with conda or pip .

You can also use the environment.yml and requirements.yml but they contain many packages that are not used at all.

Reproducibility

Our work is designed to be reproducible.

Re-generate data?

If you want to reproduce our work from the very beginning, after installing the necessary packages mentioned above, you can delete all folders in data folder except for raw and README.md .

Then:

conda activate 32vis
cd workflow
snakemake --cores 1

This will generate all data again. Please note that:

  1. We obtained data from the API of OpenAlex. However, OpenAlex updates its data every two weeks. This means that the data you will get will different from ours. The degree of differences is a function of time. For example, if you recreate the data ten years from now, our data will be totally different.

  2. To crawl Google Scholar needs human participant due to the reCAPTCHA security checks.

After all data is obtained, you can run all files in analyses_and_get_figures to reproduce our results.

Okay with our current data?

If you don't plan to re-generate all the data but just want to reproduce results based on data we already had, you can simply run all files in analyses_and_get_figures directly.

Citation

@article{hao2022thirty,
 title={Thirty-two Years of IEEE VIS: Authors, Fields of Study and Citations},
 author={Hao, Hongtao and Cui, Yumian and Wang, Zhengxiang and Kim, Yea-Seul},
 journal={IEEE Transactions on Visualization and Computer Graphics},
 year={2022},
 doi={10.1109/TVCG.2022.3209422},
 publisher={IEEE}
}

Code Snippets

  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
import sys
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as multi_score
from collections import Counter
from bs4 import BeautifulSoup

def get_simple_df(fname):
	"""
		- remove nan, 
		- get only two target columns, i.e., raw string and aff type
		- drop duplicates
	"""
	raw_string = 'Raw Affiliation String'
	aff_type = 'First Institution Country Code'
	df = pd.read_csv(fname)
	df = df[(df[raw_string].notnull()) & (df[aff_type].notnull())]
	df = df[[raw_string, aff_type]]
	df = df.drop_duplicates()
	return df

def get_df(cit_author, ref_author, oa_author):
	"""concatenate, drop_duplicates, reset index, rename columns,
		factorize label_str

	Returns:
		the df used for model training and testing. It contains three columns:
			1. aff, which is pre-processed strings of affiliations
			2. label_str, which is country codes in strings,
			3. label: which is factorized version of country codes
	"""

	df = pd.concat(
		[oa_author, ref_author, cit_author], ignore_index = True
		).drop_duplicates().reset_index(drop=True)
	df.columns = ['aff', 'label_str']
	df = df.assign(label = pd.factorize(df['label_str'])[0])
	return df 

def get_dicts(df):
	"""get two dicts; id <--> cntry
	"""
	cntry_to_id = dict(zip(df.label_str, df.label))
	id_to_cntry = dict(zip(df.label, df.label_str))
	return cntry_to_id, id_to_cntry

def clean_text(text):
    """
    Takes a string and returns a string
    """
    # remove html tags, lowercase, remove nonsense, remove non-letter
    aff = BeautifulSoup(text, "lxml").text 
    aff = aff.lower()
    aff = re.sub(r'xa0|#n#‡#n#|#tab#|#r#|\[|\]', "", aff)
    aff = re.sub(r'[^a-z]+', ' ', aff)
    return aff

def logist_regression(df):
	'''
	Input: 
		df: df
	Returns:
		logreg: logistic regression model
	'''
	X = df.aff
	y = df.label
	X_train, X_test, y_train, y_test = train_test_split(
		X, y, test_size=0.2, random_state = 42)
	logreg = Pipeline([('vect', CountVectorizer(stop_words='english', min_df = 5)),
				('clf', LogisticRegression(max_iter=600)),
			   ])
	print('model training now...')
	logreg.fit(X_train, y_train)

	y_train_pred = logreg.predict(X_train)
	y_test_pred = logreg.predict(X_test)

	target_names = list(set([id_to_cntry[x] for x in y_test]))

	f = open(CNTRY_CLASSIFICATION_REPORT,'a')
	f.write('The following is the result for affiliation country code classification' + '\n')
	f.write('Test set accuracy %s' % accuracy_score(y_test_pred, y_test))
	f.write('\n')
	precision, recall, fscore, support = multi_score(
		y_test, 
		y_test_pred, 
		average='weighted'
	)
	f.write('precision: {}'.format(precision))
	f.write('\n')
	f.write('recall: {}'.format(recall))
	f.write('\n')
	f.write('fscore: {}'.format(fscore))
	f.write('\n')
	f.write('support: {}'.format(support))
	f.write('\n')
	f.write('\n')
	f.write('Training set accuracy %s' % accuracy_score(y_train, y_train_pred))
	# f.write(classification_report(y_test, y_test_pred, target_names=target_names))
	f.close()

	return logreg

def get_processed_merged_author(DF, LOGREG):
	'''
	Input: 
		- DF: merged
		- LOGREG
	Returns:
		- DF with cntry classification results
	'''
	# clean text for affs to be predicted
	DF['IEEE Author Affiliation Filled_Processed'] = DF[
		'IEEE Author Affiliation Filled'].apply(clean_text)
	pred = LOGREG.predict(DF['IEEE Author Affiliation Filled_Processed'])
	results = [id_to_cntry[x] for x in pred]
	DF['country_code_results'] = results
	# if I have handcoded the country codes, use those first
	DF = DF.assign(country_code_results_updated = 
	    np.where(DF['First Institution Country Code By Hand'].notnull(), 
	         DF['First Institution Country Code By Hand'],
	         DF['country_code_results']
	        ))
	return DF

if __name__ == '__main__':

	CIT_AUTHOR = sys.argv[1]
	REF_AUTHOR = sys.argv[2]
	# openalex author df for VIS papers:
	OA_AUTHOR = sys.argv[3]
	MERGED_AUTHOR = sys.argv[4]
	MERGED_CNTRy_test_predICTED = sys.argv[5]
	CNTRY_CLASSIFICATION_REPORT = sys.argv[6]

	# load datasets:
	cit_author = get_simple_df(CIT_AUTHOR)
	ref_author = get_simple_df(REF_AUTHOR)
	oa_author = get_simple_df(OA_AUTHOR)
	merged = pd.read_csv(MERGED_AUTHOR)

	# get df for model trainig and testing
	df = get_df(cit_author, ref_author, oa_author)

	# clean affiliation texts 
	df['aff'] = df['aff'].apply(clean_text)

	df = df.drop_duplicates()
	f = open(CNTRY_CLASSIFICATION_REPORT,'a')
	f.write(f'there are {df.shape[0]} training examples in country classification.')
	f.write('\n')
	f.close()


	# get dicts
	cntry_to_id, id_to_cntry = get_dicts(df)

	# get logreg
	logreg = logist_regression(df)

	merged_processed = get_processed_merged_author(merged, logreg)

	# export merged_processed
	cols_to_keep = [
		'Year',
		'DOI',
		'Title',
		'IEEE Number of Authors',
		'IEEE Author Position', 
		'IEEE Author Name',
		'OpenAlex Author ID',
		'IEEE Author Affiliation Filled',
		'country_code_results_updated', 
		]
	col_renamer = {
		'Year':'Year',
		'DOI':'DOI',
		'Title':'Title',
		'IEEE Number of Authors':'Number of Authors',
		'IEEE Author Position':'Author Position', 
		'IEEE Author Name':'Author Name',
		'OpenAlex Author ID':'OpenAlex Author ID',
		'IEEE Author Affiliation Filled':'Affiliation Name',
		'country_code_results_updated':'Affiliation Country Code', 
		}
	merged_cntry_test_predicted = merged_processed[cols_to_keep]
	merged_cntry_test_predicted.rename(columns = col_renamer).to_csv(
		MERGED_CNTRy_test_predICTED, index = False
	)
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
import sys
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as multi_score
from bs4 import BeautifulSoup

def get_simple_df(fname):
	"""
		- remove nan, 
		- get only two target columns, i.e., raw string and aff type
		- drop duplicates
	"""
	raw_string = 'Raw Affiliation String'
	aff_type = 'First Institution Type'
	df = pd.read_csv(fname)
	df = df[(df[raw_string].notnull()) & (df[aff_type].notnull())]
	df = df[[raw_string, aff_type]]
	df = df.drop_duplicates()
	return df

def get_df(cit_author, ref_author, oa_author):
	"""concatenate, drop_duplicates, reset index, rename columns,
		factorize label_str

	Returns:
		the df used for model training and testing. It contains five columns:
			1. aff, which is pre-processed strings of affiliations
			2. label_str, which is country codes in strings,
			3. label: which is factorized version of country codes
			4. binary_label_str
			5. binary_label
	"""

	df = pd.concat(
		[oa_author, ref_author, cit_author], ignore_index = True
		).drop_duplicates().reset_index(drop=True)
	df.columns = ['aff', 'label_str']
	df = df.assign(label = pd.factorize(df['label_str'])[0])
	df = df.assign(binary_label_str = np.where(
		df.label_str == 'education', 'education', 'non-education'))
	df = df.assign(binary_label = pd.factorize(df['binary_label_str'])[0])
	return df 

def get_dicts(df):
	"""get four dicts; id <--> type, for both binary and multiclass
	"""
	multi_type_to_id = dict(zip(df.label_str, df.label))
	id_to_multi_type = dict(zip(df.label, df.label_str))
	binary_type_to_id = dict(zip(df.binary_label_str, df.binary_label))
	id_to_binary_type = dict(zip(df.binary_label, df.binary_label_str))
	return multi_type_to_id, id_to_multi_type, binary_type_to_id, id_to_binary_type

def clean_text(text):
    """
    Takes a string and returns a string
    """
    # remove html tags, lowercase, remove nonsense, remove non-letter
    aff = BeautifulSoup(text, "lxml").text 
    aff = aff.lower()
    aff = re.sub(r'xa0|#n#‡#n#|#tab#|#r#|\[|\]', "", aff)
    aff = re.sub(r'[^a-z]+', ' ', aff)
    return aff

def logist_regression(df, LABEL):
	'''
	Input: 
		df: df
		LABEL: 'label' if multiclass and 'binary_label' if binary
	Returns:
		logreg: logistic regression classifier (model)

	'''
	X = df.aff
	y = df[LABEL]
	X_train, X_test, y_train, y_test = train_test_split(
		X, y, test_size=0.2, random_state = 42)
	logreg = Pipeline([('vect', CountVectorizer(stop_words='english', min_df = 2)),
				('clf', LogisticRegression(max_iter=600)),
			   ])
	print('model training now...')
	logreg.fit(X_train, y_train)

	y_train_pred = logreg.predict(X_train)
	y_test_pred = logreg.predict(X_test)

	target_names = list(set(df.label_str)) if LABEL == 'label' else list(set(df.binary_label_str))
	logreg_type = 'multiclass classification' if LABEL == 'label' else 'binary classification'

	f = open(TYPE_CLASSIFICATION_REPORT,'a')
	f.write('The following is the result for aff type' + ' : ' + logreg_type + '\n')
	f.write('Test set accuracy %s' % accuracy_score(y_test, y_test_pred))
	f.write('\n')
	precision, recall, fscore, support = multi_score(
		y_test, 
		y_test_pred, 
		average='weighted'
	)
	f.write('precision: {}'.format(precision))
	f.write('\n')
	f.write('recall: {}'.format(recall))
	f.write('\n')
	f.write('fscore: {}'.format(fscore))
	f.write('\n')
	f.write('support: {}'.format(support))
	f.write('\n')
	f.write('\n')
	f.write('Training set accuracy %s' % accuracy_score(y_train, y_train_pred))
	# f.write('\n')
	# f.write(classification_report(y_test, y_test_pred, target_names=target_names))
	f.write('\n')
	f.write('\n')

	f.close()

	return logreg

def get_processed_merged_author(DF, LOGREG_MULTI, LOGREG_BINARY):
	'''
	Input: 
		- DF: merged
		- LOGREG_MULTI
		- LOGREG_BINARY
	Returns:
		- DF with binary and multiclass classification results
	'''
	# clean text for affs to be predicted
	DF['IEEE Author Affiliation Filled_Processed'] = DF[
		'IEEE Author Affiliation Filled'].apply(clean_text)
	pred_binary = LOGREG_BINARY.predict(DF['IEEE Author Affiliation Filled_Processed'])
	pred_binary_type = [id_to_binary_type[x] for x in pred_binary]
	pred_multi = LOGREG_MULTI.predict(DF['IEEE Author Affiliation Filled_Processed'])
	pred_multi_type = [id_to_multi_type[x] for x in pred_multi]
	DF['aff_type_results_binary'] = pred_binary_type
	DF['aff_type_results_multiclass'] = pred_multi_type
	# use type by hand if exists
	DF = DF.assign(aff_type_results_binary_updated = 
	    np.where(DF['Binary Institution Type By Hand'].notnull(), 
	         DF['Binary Institution Type By Hand'],
	         DF['aff_type_results_binary']
	        ))
	# use type by hand if exists
	DF = DF.assign(aff_type_results_multiclass_updated = 
	    np.where(DF['First Institution Type By Hand'].notnull(), 
	         DF['First Institution Type By Hand'],
	         DF['aff_type_results_multiclass']
	        ))
	return DF

if __name__ == '__main__':

	CIT_AUTHOR = sys.argv[1]
	REF_AUTHOR = sys.argv[2]
	# openalex author df for VIS papers:
	OA_AUTHOR = sys.argv[3]
	MERGED_AUTHOR = sys.argv[4]
	MERGED_AFF_TYPE_PREDICTED = sys.argv[5]
	TYPE_CLASSIFICATION_REPORT = sys.argv[6]

	# load datasets:
	cit_author = get_simple_df(CIT_AUTHOR)
	ref_author = get_simple_df(REF_AUTHOR)
	oa_author = get_simple_df(OA_AUTHOR)
	merged = pd.read_csv(MERGED_AUTHOR)

	# get df for model trainig and testing
	df = get_df(cit_author, ref_author, oa_author)

	# clean affiliation texts 
	df['aff'] = df['aff'].apply(clean_text)

	# drop duplicates after text pre-processing
	df = df.drop_duplicates()
	f = open(TYPE_CLASSIFICATION_REPORT,'a')
	f.write(f'there are {df.shape[0]} training examples in aff type classification.')
	f.write('\n')
	f.write('\n')
	f.close()

	# get dicts
	multi_type_to_id, id_to_multi_type, binary_type_to_id, id_to_binary_type = get_dicts(df)

	# get logreg
	logreg_multi = logist_regression(df, 'label')
	logreg_binary = logist_regression(df, 'binary_label')

	merged_processed = get_processed_merged_author(merged, logreg_multi, logreg_binary)

	# export merged_processed
	cols_to_keep = [
		'Year',
		'DOI',
		'Title',
		'IEEE Number of Authors',
		'IEEE Author Position', 
		'IEEE Author Name',
		'OpenAlex Author ID',
		'IEEE Author Affiliation Filled',
		'aff_type_results_multiclass_updated', 
		'aff_type_results_binary_updated', 
		]
	col_renamer = {
		'Year':'Year',
		'DOI':'DOI',
		'Title':'Title',
		'IEEE Number of Authors':'Number of Authors',
		'IEEE Author Position':'Author Position', 
		'IEEE Author Name':'Author Name',
		'OpenAlex Author ID':'OpenAlex Author ID',
		'IEEE Author Affiliation Filled':'Affiliation Name',
		'aff_type_results_multiclass_updated':'Multiclass Affiliation Type', 
		'aff_type_results_binary_updated':'Binary Affiliation Type',
		}
	merged_aff_type_predicted = merged_processed[cols_to_keep]
	merged_aff_type_predicted.rename(columns = col_renamer).to_csv(
		MERGED_AFF_TYPE_PREDICTED, index=False
	)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import sys
import pandas as pd
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException 
from selenium.common.exceptions import ElementNotInteractableException
import os
import random
import re
import csv
import numpy as np
import urllib.parse

PAPERS_TO_SUTDY = sys.argv[1]
IEEE_PAPER_DF = sys.argv[2]
GSCHOLAR_DATA = sys.argv[3]

def specify_driver_options():
	"""
	specify driver options
	"""
	options = Options()
	options.set_preference("browser.download.folderList", 2)
	options.set_preference("browser.download.manager.showWhenStarting", 
						   False)
	options.set_preference("browser.helperApps.neverAsk.saveToDisk", 
						   "text/plain, text/txt, application/plain, application/txt")

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(INPUT): # INPUT here is ieee_paper_df
	# get year_dict and title_dict
	df = pd.read_csv(INPUT)
	dois = df.loc[:, "DOI"].tolist()
	titles = df.loc[:, "IEEE Title"].tolist()
	years = df.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return doi_year_dict, doi_title_dict

def get_gscholar_data_by_title(doi, doi_index):
	# TITLE QUERY
	if doi in title_recode_dict.keys():
		title = title_recode_dict[doi]
	else:
		title = doi_title_dict[doi]
	title_to_query = urllib.parse.quote_plus(title)
	doi_to_query = urllib.parse.quote_plus(doi)
	query_string = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C50&q='
	# IF DOI IN TO_QUERY_BY_DOI, USE DOI QUERY
	if doi in to_query_by_doi:
		driver.get(query_string + doi_to_query + '&btnG=')
	# IF NOT, USE TITLE QUERY
	else:
		driver.get(query_string + title_to_query + '&btnG=')
	gs_paper_e = wait.until(EC.presence_of_element_located((
			By.CSS_SELECTOR, 'h3.gs_rt')))
	gs_paper_title = gs_paper_e.text
	gs_citation_e = wait.until(
		EC.presence_of_element_located((By.XPATH, '//div[@class="gs_fl"]//child::a[3]'
	)))
	citation_link = gs_citation_e.get_attribute('href')
	citation_count_string = gs_citation_e.get_attribute('innerHTML')
	if citation_count_string == "Related articles":
		gs_citation_count = 0
	else:
		gs_citation_count = int(re.findall(r'\d+', citation_count_string)[0])
	gscholar_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'IEEE Title': title,
		'Title on Google Scholar': gs_paper_title,
		'Citation Link': citation_link,
		'Citation Counts on Google Scholar': gs_citation_count,
	}
	gscholar_dict_list.append(gscholar_dict)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		get_gscholar_data_by_title(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.2+random.uniform(0, 0.2)) 
	driver.close()
	driver.quit()

if __name__ == '__main__':
	driver = webdriver.Firefox(options=specify_driver_options())
	wait = WebDriverWait(driver, 90)
	DOIS = read_txt(PAPERS_TO_SUTDY)
	doi_year_dict, doi_title_dict = get_dicts(IEEE_PAPER_DF)
	random_dois = random.sample(DOIS, 10)
	random_dois.append('10.1109/INFVIS.2001.963279')
	gscholar_dict_list = []
	title_recode_dict = {
	# If I don't change the title for querying, the results are wrong:
		# This is the real title on PDF:
		'10.1109/VISUAL.1999.809889': 'Enabling classification and shading for 3 D texture mapping based volume rendering using OpenGL and extensions',
	}
	to_query_by_doi = [
	# If I query by title, the results are false:
		'10.1109/VISUAL.1993.398863',
		'10.1109/VISUAL.1996.567807',
		'10.1109/VISUAL.1998.745315',
		'10.1109/INFVIS.2001.963282',
		'10.1109/VISUAL.1992.235194',
		'10.1109/VISUAL.1993.398866',
		'10.1109/VISUAL.1998.745348',
		'10.1109/VISUAL.1997.663925',
		'10.1109/VISUAL.1993.398900',
		'10.1109/VISUAL.2000.885719',
		'10.1109/TVCG.2021.3114849',
		'10.1109/VISUAL.1991.175771',
		'10.1109/INFVIS.2001.963279',
		'10.1109/INFVIS.2001.963295',
		'10.1109/VIS.1999.10000',
	]
	main(DOIS)
	df = pd.DataFrame(gscholar_dict_list)
	df.to_csv(GSCHOLAR_DATA, index = False)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import sys
import pandas as pd
import numpy as np
import itertools

MERGED_CNTRY_PREDICTED = sys.argv[1]
MERGED_AFF_TYPE_PREDICTED = sys.argv[2]
HT_CLEANED_AUTHOR_DF = sys.argv[3]

def get_cross_country_dic(df):
	cross_country_dic = {}
	for group in df.groupby('DOI'):
	    DOI = group[0]
	    country_codes = group[1]['Affiliation Country Code'].tolist()
	    num_of_cntry = len(list(set(country_codes)))
	    if num_of_cntry != 1:
	        cross_country_dic[DOI] = True
	    else:
	        cross_country_dic[DOI] = False
	return cross_country_dic

def get_cross_type_dic(df):
	cross_type_dic = {}
	for group in df.groupby('DOI'):
	    DOI = group[0]
	    types = group[1]['Binary Type'].tolist()
	    num_of_types = len(list(set(types)))
	    if num_of_types != 1:
	        cross_type_dic[DOI] = True
	    else:
	        cross_type_dic[DOI] = False
	return cross_type_dic

if __name__ == '__main__':
	# load data
	cntry_df = pd.read_csv(MERGED_CNTRY_PREDICTED)
	type_df = pd.read_csv(MERGED_AFF_TYPE_PREDICTED)

	if cntry_df.shape[0] == type_df.shape[0]:
		print('cntry_df has the same length with type_df')

	# get the column of affiliation type
	multi_aff_types = type_df['Multiclass Affiliation Type']
	binary_aff_types = type_df['Binary Affiliation Type']

	# assign it to cntry_df and reanme columns
	cntry_df = cntry_df.assign(multi_aff_type = multi_aff_types)
	cntry_df = cntry_df.assign(binary_aff_type = binary_aff_types)
	cntry_df.rename(
		columns = {
			'multi_aff_type': 'Affiliation Type',
			'binary_aff_type': 'Binary Type',
		}, 
		inplace=True
	)

	df = cntry_df.copy()

	cross_country_dic = get_cross_country_dic(df)
	cross_type_dic = get_cross_type_dic(df)

	df['Cross-type Collaboration'] = df.DOI.apply(
    	lambda x: cross_type_dic[x]
	)
	df['International Collaboration'] = df.DOI.apply(
    	lambda x: cross_country_dic[x]
	)

	df.to_csv(HT_CLEANED_AUTHOR_DF, index=False)
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import sys
import pandas as pd
import numpy as np
from functools import reduce 

PAPER_TO_STUDY = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
OPENALEX_PAPER_DF = sys.argv[3]
HT_CLEANED_AUTHOR_DF = sys.argv[4]
GSCHOLAR_DATA = sys.argv[5]
AWARD_PAPER_DF = sys.argv[6]
HT_CLEANED_PAPER_DF = sys.argv[7]

def get_vispd(VISPUBDATA_PLUS, PAPER_TO_STUDY):
	cols = [
		'Conference',
		'Year',
		'Title',
		'DOI',
		'FirstPage',
		'LastPage',
		'PaperType',
	]
	vispd = VISPUBDATA_PLUS[
		VISPUBDATA_PLUS.DOI.isin(PAPER_TO_STUDY)].loc[:, cols].reset_index(drop=True)
	vispd.loc[vispd.Year == 2021, 'PaperType'] = 'J'
	return vispd 

def get_alex(OPENALEX_PAPER_DF):
	cols = [
		'DOI',
		'OpenAlex Year',
		'OpenAlex Publication Date',
		'OpenAlex ID',
		'OpenAlex Venue Name',
		'OpenAlex First Page',
		'OpenAlex Last Page',
		'Number of Pages',
		'Number of References',
		'Number of Concepts',
		'Number of Citations',
	]
	alex = OPENALEX_PAPER_DF.loc[:, cols]
	return alex 

def get_authors(HT_CLEANED_AUTHOR_DF):
	cols = [
		'DOI',
		'Number of Authors',
		'Cross-type Collaboration',
		'International Collaboration',
		'With US Authors',
	]
	# create the column of "With US Authors"
	for doi in list(set(HT_CLEANED_AUTHOR_DF.DOI)):
		if 'US' in HT_CLEANED_AUTHOR_DF[
			HT_CLEANED_AUTHOR_DF.DOI == doi]['Affiliation Country Code'].tolist():
			HT_CLEANED_AUTHOR_DF.loc[HT_CLEANED_AUTHOR_DF.DOI == doi, 'With US Authors'] = True
		else:
			HT_CLEANED_AUTHOR_DF.loc[HT_CLEANED_AUTHOR_DF.DOI == doi, 'With US Authors'] = False
	HT_CLEANED_AUTHOR_DF.drop_duplicates(subset=['DOI'], inplace=True)
	authors = HT_CLEANED_AUTHOR_DF.loc[:, cols].reset_index(drop=True) 
	# create the column of both cross-type and cross-country collaboration
	authors['Both Cross-type and Cross-country Collaboration'] = authors[
		'Cross-type Collaboration'] * authors['International Collaboration']
	# rename column
	authors.rename(
		columns={'International Collaboration': 'Cross-country Collaboration'},
		inplace=True
	)
	return authors 

def get_gscholar(GSCHOLAR_DATA):
	cols = [
		'DOI',
		'IEEE Title',
		'Citation Counts on Google Scholar',
	]
	gscholar = GSCHOLAR_DATA.loc[:, cols]
	return gscholar

def get_df_merged(dfs):
	df_merged = reduce(lambda left,right: pd.merge(left,right,on='DOI'), dfs)
	return df_merged

def get_award_dicts(AWARD_PAPER_DF):
	awards = AWARD_PAPER_DF[AWARD_PAPER_DF.Award != 'TT']
	kwargs = {'Track Updated': np.where(awards.Year == 2021, 'VIS', awards.Track)}
	awards = awards.assign(**kwargs)
	award_dois = awards.DOI.tolist()
	award_names = awards.Award.tolist()
	award_tracks = awards['Track Updated'].tolist()
	doi_award_name_dict = dict(zip(award_dois, award_names))
	doi_award_track_dict = dict(zip(award_dois, award_tracks))
	return award_dois, doi_award_name_dict, doi_award_track_dict

def get_df_final(df_merged, award_dois, doi_award_name_dict, doi_award_track_dict):
	df_merged['Award'] = df_merged['DOI'].apply(
		lambda x: True if x in award_dois else False
	)
	df_merged['Award Name'] = df_merged['DOI'].apply(
		lambda x: doi_award_name_dict[x] if x in award_dois else np.nan)
	df_merged['Award Track'] = df_merged['DOI'].apply(
		lambda x: doi_award_track_dict[x] if x in award_dois else np.nan)
	df_final = df_merged
	return df_final

def main():
	# process data
	vispd = get_vispd(VISPUBDATA_PLUS, PAPER_TO_STUDY)
	alex = get_alex(OPENALEX_PAPER_DF)
	authors = get_authors(HT_CLEANED_AUTHOR_DF)
	gscholar = get_gscholar(GSCHOLAR_DATA)
	# merge data
	dfs = [vispd, alex, authors, gscholar]
	df_merged = get_df_merged(dfs)
	# get award data
	award_dois, doi_award_name_dict, doi_award_track_dict = get_award_dicts(AWARD_PAPER_DF)
	df_final = get_df_final(
		df_merged, award_dois, doi_award_name_dict, doi_award_track_dict)
	# write to file
	df_final.to_csv(HT_CLEANED_PAPER_DF, index=False)

if __name__ == '__main__':
	# load data
	VISPUBDATA_PLUS = pd.read_csv(VISPUBDATA_PLUS)
	PAPER_TO_STUDY = pd.read_csv(PAPER_TO_STUDY, header=None)[0].tolist()
	OPENALEX_PAPER_DF = pd.read_csv(OPENALEX_PAPER_DF)
	HT_CLEANED_AUTHOR_DF = pd.read_csv(HT_CLEANED_AUTHOR_DF)
	GSCHOLAR_DATA = pd.read_csv(GSCHOLAR_DATA)
	AWARD_PAPER_DF = pd.read_csv(AWARD_PAPER_DF)
	main()
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
import json
import numpy as np
import sys
import random
import time
from io import StringIO
from html.parser import HTMLParser
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import re

PAPERS_TO_STUDY = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
IEEE_AUTHOR_DF = sys.argv[3]
IEEE_PAPER_DF = sys.argv[4]
PROBLEM_DOIS = sys.argv[5]

def get_s():
	# set retry if status codes in [ 500, 502, 503, 504, 429]
	# als return headers
	s = requests.Session()
	retries = Retry(total=5,
		backoff_factor=0.1,
		status_forcelist=[ 500, 502, 503, 504, 429],
	)
	s.mount('http://', HTTPAdapter(max_retries=retries))
	headers = {
	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	'Accept': 'application/json',
	}
	return s, headers

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return doi_year_dict, doi_title_dict

def get_response(URL):
	response = s.get(url=URL, headers=headers)
	while response.status_code != 200:
		print(f'response status code is {response.status_code}. retrying now...')
		time.sleep(5)
		response = s.get(url=URL, headers=headers)
	return response 

def get_soup(RESPONSE):
	html = RESPONSE.text
	soup = BeautifulSoup(html, 'lxml')
	return soup 

def get_j(DOI, SOUP):
	if DOI != '10.1109/VIS.1999.10000':
		str = SOUP.find_all('script')[11].string.rsplit(
			'xplGlobal.document.metadata=')[1].rsplit(
			'xplGlobal.document.userLoggedIn=')[0]

		# delete anything after the last `}`
		str = str.replace(re.findall(r'[^\}]+$', str)[0], '')
		j = json.loads(str)
	else:
		j = None
	return j

# scrip html tags and entities in titles
# source: https://stackoverflow.com/a/925630
class MLStripper(HTMLParser):
	def __init__(self):
		super().__init__()
		self.reset()
		self.strict = False
		self.convert_charrefs= True
		self.text = StringIO()
	def handle_data(self, d):
		self.text.write(d)
	def get_data(self):
		return self.text.getvalue()

def strip_tags(html):
	s = MLStripper()
	s.feed(html)
	return s.get_data()

# def get_ieee_title(J):
# 	# get ieee paper title
# 	title_raw = J['title']
# 	title = strip_tags(title_raw)
# 	return title

def update_paper_dict_list(J, DOI):
	if DOI != '10.1109/VIS.1999.10000':
		title_raw = J['title']
		ieee_title = strip_tags(title_raw)
		ieee_doi = J['doi']
	else:
		ieee_title = doi_title_dict[DOI]
		ieee_doi = DOI
	paper_dict = {
		'Year': doi_year_dict[DOI],
		'DOI': DOI,
		'Title': doi_title_dict[DOI],
		'IEEE Title': ieee_title,
		'IEEE DOI': ieee_doi,
	}
	paper_dict_list.append(paper_dict)

def update_author_dict_list(J, DOI):
	AUTHOR_JSON = J['authors']
	for i in AUTHOR_JSON:
		try:
			first_name = i['firstName']
		except:
			first_name = None
		try:
			last_name = i['lastName']
		except:
			last_name = None
		try:
			author_name = i['name']
		except:
			author_name = None
		author_num = len(AUTHOR_JSON)
		author_position = AUTHOR_JSON.index(i) + 1
		try:
			affiliation_element = i['affiliation']
			affiliation_name = affiliation_element[0]
			affiliation_num = len(affiliation_element)
			one_affiliation = True if affiliation_num == 1 else False
		except:
			affiliation_name = None
			affiliation_num = None
			one_affiliation = None
		try:
			author_id = 'https://ieeexplore.ieee.org/author/' + i['id']
		except:
			author_id = None
		author_dict = {
			'Year': doi_year_dict[DOI],
			'DOI': DOI,
			'Title': doi_title_dict[DOI],
			# 'IEEE Title': IEEE_TITLE,
			# 'First Name': first_name,
			# 'Last Name': last_name,
			'Number of Authors': author_num,
			'Author Position': author_position,
			'Author Name': author_name,
			'Author ID': author_id,
			'Author Affiliation': affiliation_name,
			# 'Number of Affiliations': affiliation_num,
			'One Affiliation': one_affiliation,
		}
		author_dict_list.append(author_dict)

def get_empty_author_dict(DOI):
	author_dict = {
		'Year': doi_year_dict[DOI],
		'DOI': DOI,
		'Title': doi_title_dict[DOI],
	}
	author_dict_list.append(author_dict)

def main(DOIS):
	for DOI in DOIS:
		doi_index = DOIS.index(DOI) + 1
		url = 'https://doi.org/' + DOI
		response = get_response(url)
		soup = get_soup(response)
		j = get_j(DOI, soup)
		update_paper_dict_list(j, DOI)
		try:
			if DOI != '10.1109/VIS.1999.10000':
				update_author_dict_list(j, DOI)
			else:
				get_empty_author_dict(DOI)
		except:
			problem_dois_list.append(DOI)
			print(f'something wrong with {DOI}')
		time.sleep(0.4+random.uniform(0, 0.4)) 
		print(f'{doi_index} is done')

if __name__ == '__main__':
	s = get_s()[0]
	headers = get_s()[1]
	PAPERS = pd.read_csv(PAPERS_TO_STUDY, header=None)
	DOIS = PAPERS[0].tolist()
	random_dois = random.sample(DOIS, 10)
	random_dois.append('10.1109/VIS.1999.10000')
	doi_year_dict, doi_title_dict = get_dicts(VISPUBDATA_PLUS)
	headers = {
	'User-agent':
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
	}
	author_dict_list = []
	paper_dict_list = []
	problem_dois_list = []
	# main(random_dois)
	main(DOIS)
	author_df = pd.DataFrame(author_dict_list)
	paper_df = pd.DataFrame(paper_dict_list)
	author_df.to_csv(IEEE_AUTHOR_DF, index=False)
	paper_df.to_csv(IEEE_PAPER_DF, index=False)
	with open(PROBLEM_DOIS, 'w') as f:
		for doi in problem_dois_list:
			f.write("%s\n" % doi)
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
import sys
import pandas as pd
import re
import numpy as np
import csv
import difflib 

IEEE_AUTHOR = sys.argv[1]
OPENALEX_AUTHOR = sys.argv[2]
PAPERS_TO_STUDY = sys.argv[3]
VISPUBDATA = sys.argv[4]
MERGED_AUTHOR_DF = sys.argv[5]

def get_dicts(VISPUBDATA):
	# get year_dict and title_dict
	vispd = pd.read_csv(VISPUBDATA)
	dois = vispd.loc[:, "DOI"].tolist()
	titles = vispd.loc[:, "Title"].tolist()
	years = vispd.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return doi_year_dict, doi_title_dict

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def update_ieee_orig(DF): # df here is iee_orig
	"""update ieee_org

	ieee_org is wrong in '10.1109/TVCG.2008.157' as it contains an additional author that shouldn't be there;
	also, ieee_org lacks author info for '10.1109/VIS.1999.10000'.

	What this function does is to delete the additional author in '10.1109/TVCG.2008.157' and update info in 
	that paper. Then, I added author data manually for '10.1109/VIS.1999.10000'.

	"""
	DF = DF.drop(DF[DF.DOI == '10.1109/VIS.1999.10000'].index)
	row_to_drop = DF.index[DF.DOI == '10.1109/TVCG.2008.157'].tolist()[0]
	df_dropped = DF.drop([row_to_drop])
	df_dropped.loc[df_dropped.DOI == '10.1109/TVCG.2008.157', 'Number of Authors'] -= 1
	df_dropped.loc[df_dropped.DOI == '10.1109/TVCG.2008.157', 'Author Position'] -= 1.0
	df = df_dropped
	FILL_DATA = [
	{
		'Year': 1999,
		'DOI': '10.1109/VIS.1999.10000',
		'Title': 'Progressive Compression of Arbitrary Triangular Meshes',
		'Number of Authors': 3,
		'Author Position': 1,
		'Author Name': 'Daniel Cohen-Or',
		'Author ID': np.NaN,
		'Author Affiliation': 'Tel Aviv University',
		'One Affiliation': True,
	},
	{
		'Year': 1999,
		'DOI': '10.1109/VIS.1999.10000',
		'Title': 'Progressive Compression of Arbitrary Triangular Meshes',
		'Number of Authors': 3,
		'Author Position': 2,
		'Author Name': 'David Levin',
		'Author ID': np.NaN,
		'Author Affiliation': 'Tel Aviv University',
		'One Affiliation': True,
	},
	{
		'Year': 1999,
		'DOI': '10.1109/VIS.1999.10000',
		'Title': 'Progressive Compression of Arbitrary Triangular Meshes',
		'Number of Authors': 3,
		'Author Position': 3,
		'Author Name': 'Offir Remez',
		'Author ID': np.NaN,
		'Author Affiliation': 'Tel Aviv University',
		'One Affiliation': True,
	}
	]
	fill_data_df = pd.DataFrame(FILL_DATA)
	df = df.append(fill_data_df, ignore_index = True)
	return df

def get_diff_dois(IEEE, ALEX): # ieee, alex
	# return a list of DOIs where alex is wrong in Number of Authors
	DOIS = list(set(IEEE.DOI))
	diff_dois = []
	for doi in DOIS:
		ieee_n = IEEE[IEEE.DOI == doi]['Number of Authors'].tolist()[0]
		alex_n = ALEX[ALEX.DOI == doi]['Number of Authors'].tolist()[0]
		if ieee_n != alex_n:
			diff_dois.append(doi)
	return diff_dois 

def get_alex_new(IEEE, ALEX, DIFF_DOIS):
	"""
	For DOIs where alex is wrong in Number of Authors, get correct data from IEEE first
	Drop the rows where alex is wrong from alex, and append the correct ieee data to alex_dropped

	Returns:
		alex_new, where data of Number of Authors is correct
	"""
	df_to_append = IEEE[IEEE.DOI.isin(DIFF_DOIS)].iloc[:, 0:6]
	alex_dropped = ALEX.drop(ALEX[ALEX.DOI.isin(DIFF_DOIS)].index)
	alex_new = alex_dropped.append(df_to_append, ignore_index = True)
	return alex_new

def get_sorted_dfs(IEEE, ALEX_NEW, PAPERS):
	"""sort ieee and alex author df by paper index and author position

	I added a variable 'Paper Index' to both ieee and alex_new. I 
	also added a prefix of 'IEEE ' in ieee. Then I sort the two datasets 
	by 'Paper Index' and 'Author Position'. 

	Returns:
		two dataframes, ieee_sorted, and alex_new_sorted

	"""
	IEEE['Paper Index'] = [PAPERS.index(i) for i in IEEE.DOI.tolist()]
	ALEX_NEW['Paper Index'] = [PAPERS.index(i) for i in ALEX_NEW.DOI.tolist()]
	IEEE = IEEE.add_prefix('IEEE ')
	alex_new_sorted = ALEX_NEW.sort_values(
		by=['Paper Index', 'Author Position'], ).reset_index(drop=True)
	ieee_sorted = IEEE.sort_values(
		by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True)
	return ieee_sorted, alex_new_sorted

def get_concat_df(IEEE, ALEX, PAPERS): # ieee_sorted, alex_sorted
	"""check https://stackoverflow.com/a/13680953 for details
	"""
	fuzzy_match_df_list = []
	mismatch_doi_list = []
	for doi in PAPERS:
		df1 = IEEE[IEEE['IEEE DOI'] == doi]
		df2 = ALEX[ALEX['DOI'] == doi]
		try:
			kwargs = {'IEEE Author Name': 
			df2['Author Name'].apply(
				lambda x: difflib.get_close_matches(
					x, df1['IEEE Author Name'], cutoff=0.6)[0])
			}
		except:
			kwargs = {'IEEE Author Name': df1['IEEE Author Name']}
			mismatch_doi_list.append(doi)
		df2 = df2.assign(**kwargs)
		df = df1.merge(df2, on='IEEE Author Name', how='inner')
		fuzzy_match_df_list.append(df)
	print(f'in {len(mismatch_doi_list)} dois, fuzzy matching was not successful, so I assumed author position in merging')
	df = pd.concat(fuzzy_match_df_list, ignore_index=True)
	return df 

def flatten(t):
	"""convert list of lists to a list of items"""
	"""source: https://stackoverflow.com/a/952952"""
	return [item for sublist in t for item in sublist]

def update_with_vispubdata_author_data(VISPD, DF): # vispd, concat_df
	ieee_wrong = [
	'10.1109/INFVIS.2005.1532150',
	'10.1109/VISUAL.2005.1532819',
	'10.1109/VISUAL.2005.1532794',
	'10.1109/VISUAL.1992.235178',
	]
	correct_author_num = [5, 2, 5, 4]
	correct_author_num_dict = dict(zip(ieee_wrong, correct_author_num))
	vispd_names = VISPD.loc[VISPD.DOI.isin(ieee_wrong), 'AuthorNames-Deduped'].tolist()
	dois = flatten([np.repeat(doi, correct_author_num_dict[doi]) for doi in ieee_wrong])
	years = [doi_year_dict[x] for x in dois]
	titles = [doi_title_dict[x] for x in dois]
	author_names = flatten([x.split(';') for x in vispd_names])
	author_nums = flatten([np.repeat(i, i) for i in correct_author_num])
	author_positions = flatten([range(1, i+1) for i in correct_author_num])
	paper_index = [papers.index(doi) for doi in dois]
	DF_TO_FILL = pd.DataFrame({
		'IEEE DOI': dois,
		'DOI': dois,
		'IEEE Year': years,
		'Year': years,
		'IEEE Title': titles,
		'Title': titles,
		'IEEE Number of Authors': author_nums,
		'IEEE Author Position': author_positions,
		'IEEE Author Name': author_names,
		'Number of Authors': author_nums,
		'Author Position': author_positions,
		'Author Name': author_names,
		'IEEE Paper Index': paper_index,
		'Paper Index': paper_index,
	})
	df_dropped = DF.drop(DF[DF['IEEE DOI'].isin(ieee_wrong)].index)
	df_new = df_dropped.append(DF_TO_FILL, ignore_index=True)
	df_new = df_new.sort_values(
		by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True)
	return df_new

def update_country_code(DF, DOI, NEW_DATA):
	DF.loc[DF['DOI'] == DOI, 'First Institution Country Code By Hand'] = NEW_DATA
	# this is to change openalex author names to be the same as IEEE author names
	# DF.loc[DF['DOI'] == DOI, 'Author Name'] = DF.loc[DF['DOI'] == DOI, 'IEEE Author Name']
	return DF 

def update_country_code_by_raw_string(DF, RAW_STRING, NEW_DATA):
    DF.loc[DF['Raw Affiliation String'] == RAW_STRING, 'First Institution Country Code By Hand'] = NEW_DATA
    return DF 

def update_type(DF, DOI, NEW_DATA):
    DF.loc[DF['DOI'] == DOI, 'First Institution Type By Hand'] = NEW_DATA
    return DF 

def update_type_by_raw_string(DF, RAW_STRING, NEW_DATA):
    DF.loc[DF['Raw Affiliation String'] == RAW_STRING, 'First Institution Type By Hand'] = NEW_DATA
    return DF 

def update_affiliations(DF, DOI, NEW_DATA):
    # update both ieee author affiliation, alex first institution names and raw string
    DF.loc[DF['DOI'] == DOI, 'IEEE Author Affiliation'] = NEW_DATA
    DF.loc[DF['DOI'] == DOI, 'First Institution Name'] = NEW_DATA
    DF.loc[DF['DOI'] == DOI, 'Raw Affiliation String'] = NEW_DATA
    return DF 

def update_author_name(DF, DOI, NEW_DATA):
    DF.loc[DF['DOI'] == DOI, 'IEEE Author Name'] = NEW_DATA
    return DF


def update_concat_df(DF): # DF here is concat_df
	"""Update data for specific DOIs

	Return:
		still concat_df, but updated
	"""
	# '10.1109/VISUAL.1996.568115',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568115',
		['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568115',
		['company']*2 + ['facility'],
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568115',
		['MRJ, Inc']*2 + ['NASA Ames Research Center']
	)
	# '10.1109/VISUAL.2000.885735'
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885735',
		np.repeat('NL', 6),
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885735',
		['government']*2 + ['education']*4,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885735',
		np.append(
			np.repeat(
				'Center for Mathematics and Computer Science, CWI, Amsterdam, Netherlands', 2),
			np.repeat(
				'Swammerdam Inst. for Life Sciences, BioCentrum Amsterdam, Amsterdam, Netherlands', 4)
			)
	)
	# '10.1109/VISUAL.1996.568143',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568143',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568143',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568143',
		['Ohio State University, Columbus, OH, USA']*6
	)
	# '10.1109/VISUAL.1999.809936',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809936',
		['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809936',
		['education']*3,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809936',
		['Worcester Polytechnic Institute, Worcester, MA, USA']*3
	)
	# '10.1109/INFVIS.2002.1173147',
	# IEEE Xplore got author name wrong
	update_country_code(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		['SE', 'US', 'SE'],
	)
	update_type(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		['education']*3,
	)
	update_affiliations(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		[
			'Dept. of Information Science, Uppsala University, Uppsala, Sweden',
			'Dept. of Psychology, Indiana University, Bloomington, Indiana, USA',
			'Dept. of Information Science, Uppsala University, Uppsala, Sweden',
		]
	)
	update_author_name(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		['M. Lind', 'G.P. Bingham', 'C. Forsell'],
	)
	# '10.1109/VISUAL.1992.235175',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1992.235175',
		['US']*12,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1992.235175',
		['company']*3 + ['government']*2 + ['education']*6 + ['company']*1
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1992.235175',
		[
			'Unisys Corporation',
			'Sterling Software',
			'Unisys Corporation',
			'U.S. Environmental Protection Agency, United States',
			'U.S. Environmental Protection Agency',
			'University of Alabama in Huntsville (UAH), United States',
			'Florida State University, United States',
			'Florida State University, United States',
			'University of Wisconsin, Madison, WI, United States',
			'University of Wisconsin, Madison, WI, United States',
			'University of Wisconsin, Madison, WI, United States',
			'IBM T.J. Watson Research Center, United States',
		]
	)
	# '10.1109/TVCG.2006.182',
	update_country_code(
		DF, 
		'10.1109/TVCG.2006.182',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/TVCG.2006.182',
		['education']*5,
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2006.182',
		['Brown University, United States']*5,
	)
	# '10.1109/TVCG.2015.2467971',
	update_country_code(
		DF, 
		'10.1109/TVCG.2015.2467971',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/TVCG.2015.2467971',
		['education']*5, 
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2015.2467971',
		['University of North Carolina at Charlotte, NC, United States']*5,
	)
	# '10.1109/SciVis.2015.7429489', 
	# author affilitions listed on ieee are all WRONG!!!
	# I found the authors' correct affilition on their ieee author id pages
	update_country_code(
		DF, 
		'10.1109/SciVis.2015.7429489',
		['DE']*5,
	)
	update_type(
		DF, 
		'10.1109/SciVis.2015.7429489',
		['education']*5, 
	)
	update_affiliations(
		DF, 
		'10.1109/SciVis.2015.7429489',
		['Technical University of Munich, Germany']*5,
	)
	# '10.1109/VISUAL.2005.1532821',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532821',
		['AT', 'HR', 'AT', 'AT', 'US'],
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532821',
		['company']*4 + ['education']*1 
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532821',
		['VRVis Research Center Vienna, Austria'] + ['AVL-AST Zagreb, Croatia'] + [
		'VRVis Research Center Vienna, Austria']*2 + ['Virginia Tech']
	)
	# '10.1109/VISUAL.2000.885692',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885692',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885692',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885692',
		['University of Utah, Salt Lake City, UT, USA']*4 + ['Vanderbilt University, USA'] + [
		  'University of Utah, Salt Lake City, UT, USA'],
	)
	# '10.1109/VISUAL.1999.809912',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809912',
		['DE']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809912',
		['education']*2 + ['healthcare']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809912',
		['WSUGRIS, University of Tubingen, Tubingen, Germany']*2 + [
		 'Department of Neuroradiology, University Hospital Tubingen, Tubingen, Germany']*2 ,
	)
	# '10.1109/VISUAL.1999.809929',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809929',
		['US']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809929',
		['company']*4,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809929',
		['IBM T.J. Watson Research Center, United States']*3 + [
		 'UBS Group AG'] ,
	)
	# '10.1109/VISUAL.1999.809884',
	# In this paper, openalex got country wrong and ieee got some of the affiliation wrong
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809884',
		['DE']*5,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809884',
		['nonprofit']*4 + ['education']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809884',
		['German National Research Centre for Information Technology, Germany']*4 + [
		 'Department of Physics & Astronomy, University of Heidelberg, Germany'] ,
	)
	# '10.1109/VISUAL.1999.809920',
	# openalex got country wrong
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809920',
		['DE']*5,
	)
	# '10.1109/VISUAL.1993.398911',
	# openalex got this paper country wrong for the last two authors
	update_country_code(
		DF, 
		'10.1109/VISUAL.1993.398911',
		['RU']*4 + ['DE']*2,
	)
	# '10.1109/VISUAL.2005.1532816',
	# ieee xplore got author positions and author affiliations wrong
	update_author_name(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		[
			'Gregor Schlosser',
			'J ̈urgen Hesser',
			'Frank Zeilfelder',
			'Christian Rossl',
			'Reinhard Manner',
			'Gunther Nurnberger',
			'Hans-Peter Seidel',
		],
	)
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		['DE']*7,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		['education']*3 + ['nonprofit']*1 + ['education']*2 + ['nonprofit']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		['ICM, Universitäten Mannheim und Heidelberg, Mannheim, Germany']*2 +
		['Institut für Mathematik, Universität Mannheim, Mannheim, Germany'] +
		['Max Planck Institut für Informatik, Saarbruecken, Germany'] +
		['ICM, Universitäten Mannheim und Heidelberg, Mannheim, Germany'] +
		['Institut für Mathematik, Universität Mannheim, Mannheim, Germany'] +
		['Max Planck Institut für Informatik, Saarbruecken, Germany'],
	)
	# '10.1109/VAST.2016.7883507',
	# This is the paper where i don't have ieee author affilition or openalex raw string,
	# but i have openalex first institution name.
	# Another note: Information on IEEE about the first two authors of this paper is WRONG!
	update_country_code(
		DF, 
		'10.1109/VAST.2016.7883507',
		['DE']*5,
	)
	update_type(
		DF, 
		'10.1109/VAST.2016.7883507',
		['education']*5,
	)
	update_affiliations(
		DF, 
		'10.1109/VAST.2016.7883507',
		['University of Stuttgart, Germany']*5
	)
	# '10.1109/VISUAL.2004.38',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2004.38',
		['CN']*1 + ['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2004.38',
		['education']*3 + ['company']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2004.38',
		['Zhejiang University, China'] + ['Carnegie Mellon University, United States'] + [
			'Massachusetts Institute Of Technology, United States'] + [
				'Mitsubishi Electric Research Laboratories, United States']
	)
	"""The following are cases where i have raw string, but not type or country code"""
	# '10.1109/TVCG.2006.195',
	update_country_code(
		DF, 
		'10.1109/TVCG.2006.195',
		['NL']*3
	)
	update_type(
		DF, 
		'10.1109/TVCG.2006.195',
		['education']*2 + ['government']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2006.195',
		['Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Netherlands']*2 + [
			'Center for Mathematics and Computer Science (CWI), Netherlands'
		]*1
	)
	# '10.1109/VISUAL.1996.567752',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.567752',
		['US']*3
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.567752',
		['company']*3
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.567752',
		['GE Corporate Research & Development, United States']*3,
	)
	# '10.1109/VISUAL.1999.809907',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809907',
		['NL']*2
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809907',
		['government']*2
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809907',
		['Center for Mathematics and Computer Science (CWI), Netherlands']*2,
	)
	# '10.1109/VISUAL.2004.88',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2004.88',
		['DE']*2
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2004.88',
		['nonprofit'] + ['education']
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2004.88',
		['Caesar Research Center, Bonn, Germany'] + [
		'Interdisciplinary Center for Scientific Computing, Heidelberg, Germany'],
	)
	# '10.1109/VISUAL.2004.113',
	update_type_by_raw_string(
		DF,
		'DLR Goettingen',
		['government']
	)
	update_country_code_by_raw_string(
		DF,
		'DLR Goettingen',
		'DE'
	)
	# '10.1109/VISUAL.2000.885722',
	update_type_by_raw_string(
		DF,
		'ETH Zentrum, CH - 8092 Switzerland',
		'education'
	)
	update_country_code_by_raw_string(
		DF,
		'ETH Zentrum, CH - 8092 Switzerland',
		'CH'
	)
	# '10.1109/VISUAL.2000.885715',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885715',
		['DE']*3 + ['NL'] + ['DE'] + ['NL']
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885715',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885715',
		['University of Bonn, Bonn, Germany'] * 3 + ['Eindhoven University of Technology'] + [
			'University of Bonn, Bonn, Germany'] + ['Eindhoven University of Technology']
	)
	# '10.1109/VISUAL.2000.885731',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885731',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885731',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885731',
		['Brown University, United States']*6,
	)
	# '10.1109/VISUAL.1996.568133',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568133',
		['US']*7,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568133',
		['healthcare'] + ['education'] + ['facility']*2 + ['healthcare'] + ['education']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568133',
		['National Jewish Center for Immunology and Respiratory Medicine, United States'] + [
		'University of New Mexico, United States'] + [
		'Sandia National Laboratories, United States']*2 + [
		'National Jewish Center for Immunology and Respiratory Medicine, United States'] + [
		'State University of New York at Stony Brook, United States'] + [
		'University of New Mexico, United States']
	)
	# '10.1109/VISUAL.2005.1532808',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532808',
		['DE'],
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532808',
		['education'],
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532808',
		['University of Stuttgart']
	)
	# '10.1109/VISUAL.1998.745350',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1998.745350',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1998.745350',
		['facility']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1998.745350',
		['Naval Reseach Lab, Washington, D.C.']*6
	)
	# '10.1109/VISUAL.2005.1532776',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532776',
		['US']*7,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532776',
		['company']*3 + ['facility']*2 + ['company']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532776',
		['Kitware, United States']*3 + [
		'Sandia National Laboratories, United States']*2 + [
		'Simmetrix, United States']*2,
	)
	# '10.1109/VISUAL.1996.568150',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568150',
		['NL']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568150',
		['nonprofit'] + ['government']*2 + ['education']
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568150',
		['Netherlands Energy Research Foundation, Netherlands'] + [
		'Centre for Mathematics and Computer Science (CWI), Netherlands']*2 + [
		'Vrije Universiteit, Netherlands']
	)
	# '10.1109/VISUAL.1990.146398',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1990.146398',
		['US']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1990.146398',
		['government'] + ['company']*3 
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1990.146398',
		['NASA Ames Research Center, Moffett Field, CA, USA'] + [
		'Sterling Software, United States'] + [
			'Crossfield Marketing, United States'] + [
			'Crystal River Engineering, Inc., Groveland, CA, USA']
	) 
	# '10.1109/VISUAL.1996.568120',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568120',
		['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568120',
		['education']*3 
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568120',
		['University of Illinois at Chicago, United States'] + [
		'University of Chicago, United States'] + [
			'University of Illinois at Chicago, United States']
	) 
	"""BELOW ARE WHERE I FILL AUTHOR DATA FOR ROWS WHERE DATA WAS FROM VISPUBDATA RAW"""
	# '10.1109/INFVIS.2005.1532150',
	update_country_code(
		DF, 
		'10.1109/INFVIS.2005.1532150',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/INFVIS.2005.1532150',
		['education']*5,
	)
	update_affiliations(
		DF, 
		'10.1109/INFVIS.2005.1532150',
		['Stanford University, United States']*5,
	) 
	# '10.1109/VISUAL.2005.1532819',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532819',
		['CA']*2,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532819',
		['education']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532819',
		['University of Alberta, Canada']*2,
	) 
	# '10.1109/VISUAL.2005.1532794',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532794',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532794',
		['facility'] + ['education']*3 + ['facility'],
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532794',
		['Oak Ridge National Lab, United States'] + [
			'The University of Tennessee, United States']*3 + [
			'Oak Ridge National Lab, United States'],
	) 
	# '10.1109/VISUAL.1992.235178',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1992.235178',
		['US']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1992.235178',
		['education']*4,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1992.235178',
		['University of Utah, United States']*4,
	) 
	## IEEE Website updates the name of Sehi LYi but this update is 
	## different from the name shown on PDF. I changed it back. 
	# '10.1109/TVCG.2021.3114876',
	update_author_name(
		DF, 
		'10.1109/TVCG.2021.3114876', 
		["Sehi L'Yi", 'Qianwen Wang', 'Fritz Lekschas', 'Nils Gehlenborg'],
	)
	## I found the in this paper, Some authors' affiliations contain two institutions
	update_country_code(
		DF, 
		'10.1109/TVCG.2011.207',
		['DE']*4,
	)
	update_type(
		DF, 
		'10.1109/TVCG.2011.207',
		['company'] + ['education']*1 + ['company']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2011.207',
		['Fraunhofer MEVIS, Germany'] + [
			'Center of Complex Systems and Visualization (CeVis), University of Bremen, Germany']*1 + [
			'Fraunhofer MEVIS, Germany']*2,
	) 
	## I found that in this paper, the first author has two affiliations
	update_country_code(
		DF, 
		'10.1109/INFVIS.2004.1',
		['FR']*3,
	)
	update_type(
		DF, 
		'10.1109/INFVIS.2004.1',
		['education']*1 + ['nonprofit']*1 + ['education']*1
	)
	update_affiliations(
		DF, 
		'10.1109/INFVIS.2004.1',
		['ecole des mines de nantes nantes france'] + ['INRIA']*1 + ['ecole des mines de nantes nantes france'],
	) 

	return DF

def manual_update(DF, DOI, AUTHOR_NAME, COL_TO_CHANGE, TEXT):
	"""This is to manually update errors in rows where ieee author info is nan 
	and where openalex author info is complete
	"""
	DF.loc[(DF['DOI'] == DOI) & (DF['IEEE Author Name'] == AUTHOR_NAME), COL_TO_CHANGE] = TEXT

def manual_update_concat_df(DF): # DF here is concat_df
    manual_update(
        DF,
        '10.1109/VISUAL.1997.663848',
        'R. Machiraju',
        'Raw Affiliation String',
        'Mississippi State University, Mississippi, United States'
    )
    manual_update(
        DF,
        '10.1109/VISUAL.2004.128',
        'E. Parkinson',
        'Raw Affiliation String',
        'VA Tech Hydro Corporation, Swizerland',
    )
    manual_update(
        DF,
        '10.1109/VISUAL.2004.128',
        'E. Parkinson',
        'First Institution Type',
        'company'
    )
    manual_update(
        DF,
        '10.1109/VISUAL.2004.128',
        'E. Parkinson',
        'First Institution Country Code',
        'CH',
    )
    manual_update(
        DF,
        '10.1109/INFVIS.1999.801864',
        'J. Sean',
        'IEEE Author Name',
        'Jeffrey Senn',
    )
    manual_update(
        DF,
        '10.1109/INFVIS.1999.801864',
        'J. Sean',
        'Author Name',
        'Jeffrey Senn',
    )
    manual_update(
        DF,
        '10.1109/TVCG.2019.2934260',
        'Andrew J. Solis',
        'Raw Affiliation String',
        'University of Texas Austin, Texas, United States',
    )
    manual_update(
        DF,
        '10.1109/TVCG.2019.2934260',
        'Andrew J. Solis',
        'First Institution Name',
        'University of Texas Austin',
    )

def get_concat_df_filled(DF): # DF here is concat_df
	""" find out who don't have affilition, and fill the data manually

	Get the subset of concat_df where there does not exist any affiliation name. 
	Then drop this subset from concat_df

	Update this subset's IEEE Author Affiliation with fill_dict, and then append 
	this updated subset to concat_df_dropped

	Returns:
		concat_df_filled, where all authors have at least one affiliation name

	"""
	fill_dict = {
	'K.I. Joy': 'University of California, Davis, United States',
	'H. Pfister': 'Department of Computer Science, State University of New York at Stony Brook, United States',
	'A.J. Kolojechick': 'Carnegie Mellon University,School of Computer Science,Pittsburgh,United States',
	'M. Roth': 'Computer Graphics Research Group, Deptartment of Computer Science, ETH Zurich, Switzerland',
	'P.C. Wong': 'Pacific Northwest National Laboratory, United States',
	'H. Foote': 'Pacific Northwest National Laboratory, United States',
	'W. Strasser': 'Computer Graphics Lab, University of Tubingen, Germany',
	'M. Tuveri': 'Center for Advanced Studies, Research and Development in Sardinia, Cagliari, Italy',
	'N. Fanst': 'Georgia Institute of Technology, United States',
	'Heike Janicke': 'Image and Signal Processing Group at the Universi ̈at Leipzig, Germany',
	'A. Vilanova': 'Institute of Computer Graphics, Vienna University of Technology, Austria',
	'P. Thiansathaporn': 'Department of Physics & Astronomy, University of North Carolina, Chapel Hill, United States',
	'B. Hegedust': 'Institute of Computer Graphics, Vienna University of Technology, Austria',
	'W.C. Flowers': 'Massachusetts Institute of Technology, United States',
	'G. Turk': 'GVU Center, College of Computing, Georgia Institute of Technology, United States',
	'P. Ermest': 'Philips Medical Systems, Best, Netherlands',
	'T. Moller': 'Department Of Computer And Information Science, The Ohio State University, Columbus, Ohio, United States',
	'K. Fostiropoulos': 'German National Research Centre for Information Technology, Germany',
	'F. Sobieczky': 'University of Göttingen, Germany',
	'W. Bertelheimer': 'Bayerische Motoren Werke AG (BMW) Corporation, Germany',
	}
	to_fill_df = DF[(
		DF['IEEE Author Affiliation'].isnull()) & (
		DF['Raw Affiliation String'].isnull()) & (
		DF['First Institution Name'].isnull())
	]
	rows_to_drop = DF.index[(
		DF['IEEE Author Affiliation'].isnull()) & (
		DF['Raw Affiliation String'].isnull()) & (
		DF['First Institution Name'].isnull())
	]
	concat_df_dropped = DF.drop(rows_to_drop)
	if concat_df_dropped.shape[0] + to_fill_df.shape[0] == DF.shape[0]:
		print('concat_df_dropped has correct row numbers')
	else:
		print('concat_df_dropped has incorrect row numbers')
	name_list = to_fill_df['IEEE Author Name'].tolist()
	kwargs = {'IEEE Author Affiliation' : lambda x: [fill_dict[i] for i in name_list]}
	to_fill_df = to_fill_df.assign(**kwargs)
	concat_df_filled = concat_df_dropped.append(
		to_fill_df, ignore_index=True).sort_values(
		by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True)
	return concat_df_filled

def recode_to_edu(DF): # df here is concat_df_filled
	# openalex got these institutions' type wrong. they should be education.
	edu_recode_list = [
		'Paris Diderot University',
		'Paris Descartes University',
		'École Polytechnique Fédérale de Lausanne',
		'Johns Hopkins University School of Medicine'
	]
	DF.loc[
	  DF['First Institution Name'].isin(edu_recode_list), 'First Institution Type'
	] = 'education'
	return DF 

def get_alex_raw_string_correct(DF): # DF here is concat_df_filled
	"""if openalex raw string is wrong, correct/update it with ieee author affliation
	"""
	openalex_raw_string_wrong = [
		'10.1109/VISUAL.1999.809920', 
		'10.1109/VISUAL.1999.809884', 
		'10.1109/VISUAL.1993.398911',
	]
	DF.loc[DF.DOI.isin(openalex_raw_string_wrong), 'Raw Affiliation String'] = DF.loc[
		DF.DOI.isin(openalex_raw_string_wrong)]['IEEE Author Affiliation']
	return DF

def binary_type(row):
	if row['First Institution Type'] == 'education':
		binary_type = 'education'
	elif row['First Institution Type'] in [
		'facility', 'government', 'company', 'healthcare', 'archive', 'nonprofit','other'
	]:
		binary_type = 'non-education'
	else:
		binary_type = np.NaN
	return binary_type

def binary_type_by_hand(row):
	'''This is to transform type handcoded by me to binary type
	'''
	if row['First Institution Type By Hand'] == 'education':
		binary_type = 'education'
	elif row['First Institution Type By Hand'] in [
		'facility', 'government', 'company', 
		'healthcare', 'archive', 'nonprofit', 'other', 
		# just in case I have input these by hand:
		'noneducation', 'non-education'
	]:
		binary_type = 'non-education'
	else:
		binary_type = np.NaN
	return binary_type

def add_binary_type(DF): # DF here is concat_df_filled
	DF['Binary Institution Type'] = DF.apply(binary_type, axis=1)
	DF['Binary Institution Type By Hand'] = DF.apply(binary_type_by_hand, axis=1)
	return DF 

def check_delete_rename(DF): # DF here is concat_df_filled
	# check paper index, author num, and author positions
	if DF['IEEE Paper Index'].tolist() == DF['Paper Index'].tolist():
		print('Two paper index vectors are equal')
	else:
		print('Something wrong with paper index vectors')
	if DF['IEEE Number of Authors'].tolist() == DF['Number of Authors'].tolist():
		print('Two author num vectors are equal')
	else:
		print('Something wrong with author num vectors')
	if DF['IEEE Author Position'].tolist() == DF['Author Position'].tolist():
		print('Two author position vectors are equal')
	else:
		print('Something wrong with author position vectors\
			, but this is expected as it indicates that the fuzzy matching above works.')
	# delete useless columns
	DF.drop(['Year', 'DOI', 'Title', 'IEEE Paper Index', 'Paper Index'], axis=1, inplace=True)
	# add a column called IEEE Author Affiliation Filled. It is bascially the same as 
	# ieee author affiliation. The only difference is that if ieee is nan, 
	# i get the data from openalex raw string
	DF['IEEE Author Affiliation Filled'] = np.where(
		DF['IEEE Author Affiliation'].notnull(),
		DF['IEEE Author Affiliation'],
		DF['Raw Affiliation String'],
	)
	# rename columns
	DF.rename(columns={
		'IEEE Year': 'Year',
		'IEEE DOI': 'DOI',
		'IEEE Title': 'Title',
		'IEEE Author Affiliation': 'IEEE Author Affiliation Updated',
		'First Institution Name': 'First Institution Name Updated',
		'Raw Affiliation String': 'Raw Affiliation String Updated',
		# 'First Institution Type': 'First Institution Type Updated',
		# 'First Institution Country Code': 'First Institution Country Code Updated',
	}, inplace=True)
	return DF

def main():
	ieee = update_ieee_orig(ieee_orig)
	diff_dois = get_diff_dois(ieee, alex)
	alex_new = get_alex_new(ieee, alex, diff_dois)
	ieee_sorted, alex_sorted = get_sorted_dfs(ieee, alex_new, papers)
	concat_df = get_concat_df(ieee_sorted, alex_sorted, papers)
	concat_df = update_with_vispubdata_author_data(vispd, concat_df)
	concat_df = update_concat_df(concat_df)
	manual_update_concat_df(concat_df)
	concat_df_filled = get_concat_df_filled(concat_df)
	concat_df_filled = recode_to_edu(concat_df_filled)
	concat_df_filled = get_alex_raw_string_correct(concat_df_filled)
	concat_df_filled = add_binary_type(concat_df_filled)
	concat_df_filled = check_delete_rename(concat_df_filled)
	return concat_df_filled

if __name__ == '__main__':
	vispd = pd.read_csv(VISPUBDATA)
	doi_year_dict, doi_title_dict = get_dicts(VISPUBDATA)
	ieee_orig = pd.read_csv(IEEE_AUTHOR)
	alex = pd.read_csv(OPENALEX_AUTHOR)
	papers = read_txt(PAPERS_TO_STUDY)
	df = main()
	df.to_csv(MERGED_AUTHOR_DF, index=False)
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
import pandas as pd 
import numpy as np 
import requests
import random
import math
import re 
import sys 
import time 
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import json 

OPENALEX_PAPER_DF = sys.argv[1]
OPENALEX_CITATION_AUTHOR_DF = sys.argv[2]
OPENALEX_CITATION_CONCEPT_DF = sys.argv[3]
OPENALEX_CITATION_PAPER_DF = sys.argv[4]

def get_dicts(OPENALEX_PAPER_DF): # vispd_openalex_match here is OPENALEX_PAPER_DF
	df = pd.read_csv(OPENALEX_PAPER_DF)
	dois = df['DOI'].tolist()
	urls = df['Citation API URL'].tolist()
	openalex_ids = df['OpenAlex ID'].tolist()
	years = df['Year'].tolist()
	titles = df['Title'].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	doi_url_dict = dict(zip(dois, urls))
	doi_openalexID_dict = dict(zip(dois, openalex_ids))
	return [dois, urls, doi_year_dict, doi_title_dict, doi_url_dict, doi_openalexID_dict]

def get_s():
	# set retry if status codes in [ 500, 502, 503, 504, 429]
	# als return headers
	s = requests.Session()
	retries = Retry(total=5,
		backoff_factor=0.1,
		status_forcelist=[ 500, 502, 503, 504, 429],
	)
	s.mount('http://', HTTPAdapter(max_retries=retries))
	headers = {
	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	'Accept': 'application/json',
	}
	return s, headers

def get_concept_dict_list_from_concepts(doi, result, concepts):
	"""returns a list of dicts
	"""
	openalex_year = result['publication_year']
	openalex_id = re.sub('https://openalex.org/', '', result['id'])
	openalex_title = result['display_name']
	openalex_doi = result['doi']
	concept_dict_list = []
	num_concepts = len(concepts)
	for i in concepts:
		concept_index = concepts.index(i) + 1
		concept_name = i['display_name']
		openalex_concept_id = i['id']
		wikidata_url = i['wikidata']
		level = i['level']
		score = i['score']
		concept_dict = {
			'Cited Ppaer Year': doi_year_dict[doi],
			'Cited Paper DOI': doi,
			'Cited Paper Title': doi_title_dict[doi],
			'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
			'Citation Paper Year': openalex_year,
			'Citation Paper OpenAlex ID': openalex_id,
			'Citation Ppaer OpenAlex Title': openalex_title,
			'Citation Paper OpenAlex DOI': openalex_doi,
			'Number of Concepts': num_concepts,
			'Index of Concept': concept_index,
			'Concept': concept_name,
			'Concept ID': openalex_concept_id,
			'Wikidata': wikidata_url,
			'Level': level,
			'Score': score,
		}
		concept_dict_list.append(concept_dict)
	return concept_dict_list

def get_author_dict_list_from_authors(doi, result, authors):
	"""returns a list of dicts
	"""
	openalex_year = result['publication_year']
	openalex_id = re.sub('https://openalex.org/', '', result['id'])
	openalex_title = result['display_name']
	openalex_doi = result['doi']
	author_dict_list = []
	num_authors = len(authors)
	for i in authors:
		author = i['author']
		author_name = author['display_name']
		author_position = authors.index(i) + 1
		position_type = i['author_position']
		openalex_author_id = author['id']
		author_orcid = author['orcid']
		raw_affiliation_string = i['raw_affiliation_string']
		if len(i['institutions']) == 0:
			num_institutions = np.NaN
			first_institution = np.NaN
			institution_name = np.NaN
			institution_id = np.NaN
			ror = np.NaN
			country_code = np.NaN
			institution_type = np.NaN
		else:
			num_institutions = len(i['institutions'])
			first_institution = i['institutions'][0]
			# Check whether the institution object is empty
			# this is because, in the first citation of 10.1109/TVCG.2007.70599
			# the first author's institution is empty, which causes errors 
			if first_institution:
				institution_name = first_institution['display_name']
				institution_id = first_institution['id']
				ror = first_institution['ror']
				country_code = first_institution['country_code']
				institution_type = first_institution['type']
			else:
				institution_name = np.NaN
				institution_id = np.NaN
				ror = np.NaN
				country_code = np.NaN
				institution_type = np.NaN
		author_dict = {
			'Cited Ppaer Year': doi_year_dict[doi],
			'Cited Paper DOI': doi,
			'Cited Paper Title': doi_title_dict[doi],
			'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
			'Citation Paper Year': openalex_year,
			'Citation Paper OpenAlex ID': openalex_id,
			'Citation Ppaer OpenAlex Title': openalex_title,
			'Citation Paper OpenAlex DOI': openalex_doi,
			'Number of Authors': num_authors,
			'Author Name': author_name,
			'Author Position': author_position,
			'Author Position Type': position_type,
			'OpenAlex Author ID': openalex_author_id,
			'Author ORCID': author_orcid,
			'Number of Affiliations': num_institutions,
			'First Institution Name': institution_name,
			'Raw Affiliation String': raw_affiliation_string,
			'First Institution ID': institution_id,
			'First Institution ROR': ror,
			'First Institution Type': institution_type,
			'First Institution Country Code': country_code
		}
		author_dict_list.append(author_dict)
	return author_dict_list

def get_paper_dict_from_json_result(j, doi):
	"""returns a dict 
	"""
	authors = j['authorships']
	num_authors = len(authors)
	concepts = j['concepts']
	num_concepts = len(concepts)
	openalex_year = j['publication_year']
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_doi = j['doi']
	openalex_publication_date = j['publication_date']
	venue = j['host_venue']
	openalex_venue_id = venue['id']
	openalex_url = venue['url']
	openalex_venue_name = venue['display_name']
	openalex_publisher = venue['publisher']
	publication_type = j['type']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	# num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None 
	# 	else int(openalex_last_page) - int(openalex_first_page) + 1)
	num_references = len(j['referenced_works'])
	num_citations = j['cited_by_count']
	# cited_by_api_url is a little bit complicated because in the results of title query
	#   it returns a list whereas it returns a str in doi query
	cited_url = j['cited_by_api_url']
	cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0]
	num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url)
	paper_dict = {
		'Cited Ppaer Year': doi_year_dict[doi],
		'Cited Paper DOI': doi,
		'Cited Paper Title': doi_title_dict[doi],
		'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex Publication Date': openalex_publication_date,
		'Citation Paper OpenAlex ID': openalex_id,
		'Citation Paper OpenAlex Title': openalex_title,
		'Citation Paper OpenAlex DOI': openalex_doi,
		'Citation Paper OpenAlex URL': openalex_url,
		'OpenAlex Venue ID': openalex_venue_id,
		'OpenAlex Venue Name': openalex_venue_name,
		'OpenAlex Publisher': openalex_publisher,
		'Publication Type': publication_type,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
		# 'Number of Pages': num_pages,
		'Number of References': num_references,
		'Number of Authors': num_authors,
		'Number of Concepts': num_concepts,
		'Number of Citations': num_citations,
		'Citation API URL': cited_by_api_url,
		'Number of Citation API URLs': num_cited_by_api_url,
	}
	return paper_dict

def get_empty_dict_list(doi):
	dict_list = [{
		'Cited Ppaer Year': doi_year_dict[doi],
		'Cited Paper DOI': doi,
		'Cited Paper Title': doi_title_dict[doi],
		'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
	}]
	return dict_list

def get_empty_dict(doi):
	a_dict = {
		'Cited Ppaer Year': doi_year_dict[doi],
		'Cited Paper DOI': doi,
		'Cited Paper Title': doi_title_dict[doi],
		'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
	}
	return a_dict 

def get_json_result(url, s, headers):
	"""if 404 or other error codes, retry
	This function prevents error codes. I am pretty sure that every api_cited_url will get 
		a status_code of 200, that's why I am confident to use this function

	Also, it should be noted that if the status code is 404, then s.get(url).json() will 
		throw an error. So i don't need to check the status code in this function. 
	"""
	try: 
		j = s.get(url, headers=headers).json()
	except:
		time.sleep(1) 
		return get_json_result(url, s, headers)
	else:
		return j

def main(DOIS, s, headers):
	for doi in DOIS:
		# to make sure the api-url is not nan:
		if doi_url_dict[doi] == doi_url_dict[doi]:
			url = doi_url_dict[doi] + '&per-page=50'
			j0 = get_json_result(url, s, headers)
			count = j0['meta']['count']
			per_page = 50
			total_pages = math.ceil(count/per_page)
			# checking whether results are empty
			if count > 0:
				# for every page
				for i in range(1,total_pages+1):
					list_of_concept_dict_lists = []
					list_of_author_dict_lists = []
					paper_dict_list = []
					j = get_json_result(url + f'&page={i}', s, headers=headers)
					results = j['results']
					# for every result in a page
					for result in results:
						concepts = result['concepts']
						authors = result['authorships']
						concept_dict_list = get_concept_dict_list_from_concepts(doi, result, concepts)
						author_dict_list = get_author_dict_list_from_authors(doi, result, authors)
						paper_dict = get_paper_dict_from_json_result(result, doi)
						list_of_concept_dict_lists.append(concept_dict_list)
						list_of_author_dict_lists.append(author_dict_list)
						paper_dict_list.append(paper_dict)
					lists_concepts.append(list_of_concept_dict_lists)
					lists_authors.append(list_of_author_dict_lists)
					list_of_paper_dict_lists.append(paper_dict_list)
					time.sleep(0.2)

			# if empty results:
			else:
				list_of_concept_dict_lists = []
				list_of_author_dict_lists = []
				paper_dict_list = []
				concept_dict_list = get_empty_dict_list(doi)
				author_dict_list = get_empty_dict_list(doi)
				paper_dict = get_empty_dict(doi)
				list_of_concept_dict_lists.append(concept_dict_list)
				list_of_author_dict_lists.append(author_dict_list)
				paper_dict_list.append(paper_dict)
				lists_concepts.append(list_of_concept_dict_lists)
				lists_authors.append(list_of_author_dict_lists)
				list_of_paper_dict_lists.append(paper_dict_list)
		else:
			list_of_concept_dict_lists = []
			list_of_author_dict_lists = []
			paper_dict_list = []
			concept_dict_list = get_empty_dict_list(doi)
			author_dict_list = get_empty_dict_list(doi)
			paper_dict = get_empty_dict(doi)
			list_of_concept_dict_lists.append(concept_dict_list)
			list_of_author_dict_lists.append(author_dict_list)
			paper_dict_list.append(paper_dict)
			lists_concepts.append(list_of_concept_dict_lists)
			lists_authors.append(list_of_author_dict_lists)
			list_of_paper_dict_lists.append(paper_dict_list)
		print(f'{DOIS.index(doi) + 1} is done')
		time.sleep(0.5)

if __name__ == '__main__':
	# I don't need to worry papers having no citations. 
	# This is because even if there is no citation, there is still a cited_api_url
	# and the result count in that cited_api_url will be zero.
	# I have solved this issue in main()
	dois = get_dicts(OPENALEX_PAPER_DF)[0]
	random_dois = random.sample(dois, 10)
	urls = get_dicts(OPENALEX_PAPER_DF)[1]
	doi_year_dict = get_dicts(OPENALEX_PAPER_DF)[2]
	doi_title_dict = get_dicts(OPENALEX_PAPER_DF)[3]
	doi_url_dict = get_dicts(OPENALEX_PAPER_DF)[4]
	doi_openalexID_dict = get_dicts(OPENALEX_PAPER_DF)[5]
	lists_concepts = [] # list of lists of concept dict lists
	lists_authors = [] # list of lists of author dict lists
	list_of_paper_dict_lists = [] # list of paper dict lists
	s = get_s()[0]
	headers = get_s()[1]
	main(dois, s, headers)

author_df_initiate = pd.DataFrame()
concept_df_initiate = pd.DataFrame()

def build_df_from_lists(lists, df):
	for i in lists:
		df1 = pd.concat([pd.DataFrame(l) for l in i], ignore_index=True)
		df = df.append(df1, ignore_index=True)
	return df 

author_df = build_df_from_lists(lists_authors, author_df_initiate)
concept_df = build_df_from_lists(lists_concepts, concept_df_initiate)
paper_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_paper_dict_lists], ignore_index=True)

author_df.to_csv(OPENALEX_CITATION_AUTHOR_DF, index=False)
concept_df.to_csv(OPENALEX_CITATION_CONCEPT_DF, index=False)
paper_df.to_csv(OPENALEX_CITATION_PAPER_DF, index=False)
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
import pandas as pd 
import numpy as np 
import requests
import random
import math
import csv  
import re 
import sys 
import time 
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

PAPERS_TO_STUDY = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
OPENALEX_PAPER_DF = sys.argv[3]
OPENALEX_AUTHOR_DF = sys.argv[4]
OPENALEX_CONCEPT_DF = sys.argv[5]
OPENALEX_REFERENCE_DF = sys.argv[6]
TITEL_QUERY_EMPTY_DOI_QUERY_404_DFS = sys.argv[7]
TITLE_QUERY_404_DFS = sys.argv[8]
DOI_QUERY_404_DFS = sys.argv[9]

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return [doi_year_dict, doi_title_dict]

def get_concept_dict_list_from_concepts(doi, concepts):
	"""returns a list of dicts
	"""
	concept_dict_list = []
	num_concepts = len(concepts)
	# first check whether the list concepts is empty:
	if concepts:
		for i in concepts:
			concept_index = concepts.index(i) + 1
			concept_name = i['display_name']
			openalex_concept_id = i['id']
			wikidata_url = i['wikidata']
			level = i['level']
			score = i['score']
			concept_dict = {
				'Year': doi_year_dict[doi],
				'DOI': doi,
				'Title': doi_title_dict[doi],
				'Number of Concepts': num_concepts,
				'Index of Concept': concept_index,
				'Concept': concept_name,
				'Concept ID': openalex_concept_id,
				'Wikidata': wikidata_url,
				'Level': level,
				'Score': score,
			}
			concept_dict_list.append(concept_dict)
	# if concept list is empty, 'number of concepts' will be NaN
	else:
		concept_dict = {
			'Year': doi_year_dict[doi],
			'DOI': doi,
			'Title': doi_title_dict[doi],
		}
		concept_dict_list.append(concept_dict)
	return concept_dict_list

def get_reference_dict_list_from_referenced_works(doi, referenced_works):
	reference_dict_list = []
	num_references = len(referenced_works)
	# first check whether the list of referenced works is empty
	if referenced_works:
		for i in referenced_works:
			reference_index = referenced_works.index(i) + 1
			reference_dict = {
				'Year': doi_year_dict[doi],
				'DOI': doi,
				'Title': doi_title_dict[doi],
				'Number of References': num_references,
				'Index of Reference': reference_index,
				'Reference': i,
			}
			reference_dict_list.append(reference_dict)
	# if refs list is empty, 'number of references' will be NaN
	else:
		reference_dict = {
			'Year': doi_year_dict[doi],
			'DOI': doi,
			'Title': doi_title_dict[doi],
		}
		reference_dict_list.append(reference_dict)
	return reference_dict_list

def get_author_dict_list_from_authors(doi, authors):
	"""returns a list of dicts
	"""
	author_dict_list = []
	num_authors = len(authors)
	# first check whether authors is empty
	if authors:
		for i in authors:
			author = i['author']
			author_name = author['display_name']
			author_position = authors.index(i) + 1
			position_type = i['author_position']
			openalex_author_id = author['id']
			author_orcid = author['orcid']
			raw_affiliation_string = i['raw_affiliation_string']
			if len(i['institutions']) == 0:
				num_institutions = np.NaN
				first_institution = np.NaN
				institution_name = np.NaN
				institution_id = np.NaN
				ror = np.NaN
				country_code = np.NaN
				institution_type = np.NaN
			else:
				num_institutions = len(i['institutions'])
				first_institution = i['institutions'][0]
				institution_name = first_institution['display_name']
				institution_id = first_institution['id']
				ror = first_institution['ror']
				country_code = first_institution['country_code']
				institution_type = first_institution['type']
			author_dict = {
				'Year': doi_year_dict[doi],
				'DOI': doi,
				'Title': doi_title_dict[doi],
				'Number of Authors': num_authors,
				'Author Name': author_name,
				'Author Position': author_position,
				'Author Position Type': position_type,
				'OpenAlex Author ID': openalex_author_id,
				'Author ORCID': author_orcid,
				'Number of Affiliations': num_institutions,
				'First Institution Name': institution_name,
				'Raw Affiliation String': raw_affiliation_string,
				'First Institution ID': institution_id,
				'First Institution ROR': ror,
				'First Institution Type': institution_type,
				'First Institution Country Code': country_code
			}
			author_dict_list.append(author_dict)
	# if authors list is empty, 'number of authors' will be NaN
	else:
		author_dict = {
			'Year': doi_year_dict[doi],
			'DOI': doi,
			'Title': doi_title_dict[doi],
		}
		author_dict_list.append(author_dict)
	return author_dict_list

def get_paper_dict_from_json_result(j, doi):
	"""returns a dict 
	"""
	authors = j['authorships']
	num_authors = len(authors)
	concepts = j['concepts']
	num_concepts = len(concepts)
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_publication_date = j['publication_date']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue_id = venue['id']
	openalex_url = venue['url']
	openalex_venue_name = venue['display_name']
	openalex_publisher = venue['publisher']
	publication_type = j['type']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None 
		else int(openalex_last_page) - int(openalex_first_page) + 1)
	num_references = len(j['referenced_works'])
	num_citations = j['cited_by_count']
	# cited_by_api_url is a little bit complicated because in the results of title query
	#   it returns a list whereas it returns a str in doi query
	cited_url = j['cited_by_api_url']
	cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0]
	num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url)
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex Publication Date': openalex_publication_date,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue ID': openalex_venue_id,
		'OpenAlex Venue Name': openalex_venue_name,
		'OpenAlex Publisher': openalex_publisher,
		'Publication Type': publication_type,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
		'Number of Pages': num_pages,
		'Number of References': num_references,
		'Number of Authors': num_authors,
		'Number of Concepts': num_concepts,
		'Number of Citations': num_citations,
		'Citation API URL': cited_by_api_url,
		'Number of Citation API URLs': num_cited_by_api_url,
	}
	return paper_dict

def get_empty_dict_list(doi):
	dict_list = [{
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}]
	return dict_list

def get_empty_paper_dict(doi):
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}
	return paper_dict

def get_title_query_response(doi):
	title = doi_title_dict[doi]
	title_to_query = re.sub(r'\:|\?|\&|\,', '', title)
	response = requests.get(
		'https://api.openalex.org/works?filter=title.search:' + title_to_query)
	return response, title_to_query

def check_results_count(response):
	j = response.json()
	count = j['meta']['count']
	return j, count 

def get_doi_query_response(doi):
	response = requests.get("https://api.openalex.org/works/doi:" + doi)
	return response

def get_data(doi, doi_index):
	# if doi not in to_query_by_doi, query title first
	if doi not in to_query_by_doi:
		# query title first:
		response = get_title_query_response(doi)[0]
		# if the response.status_code is in retry_code, then there is something wrong
		#    I will sleep for a while and try again. Note that if the status_code is 404, 
		#      I put it to no_matching (see below, if status_code != 200), rather than retryihng
		while response.status_code in retry_code:
			print(f'Title query has errors for {doi_index} : {doi_title_dict[doi]}. Error status code is {response.status_code}. Retrying...')
			time.sleep(3)
			response = get_title_query_response(doi)[0]
		# if title query succeeds:
		if response.status_code == 200:
			# get json and check results count:
			j = check_results_count(response)[0]
			count = check_results_count(response)[1]
			# if count is non-zero:
			if count > 0:
				# if doi not in special_result_index_dict, use index of 0
				#   otherwise, use the value corresponding to the key
				if doi not in list(special_result_index_dict.keys()):
					correct_result = j['results'][0]
				else:
					correct_index = special_result_index_dict[doi]
					correct_result = j['results'][correct_index]
				authors = correct_result['authorships']
				concepts = correct_result['concepts']
				referenced_works = correct_result['referenced_works']
				paper_dict = get_paper_dict_from_json_result(correct_result, doi)
				author_dict_list = get_author_dict_list_from_authors(doi, authors)
				concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts)
				reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works)
			# if count is zero, query doi instead
			else:
				# get doi query response:
				response2 = get_doi_query_response(doi)
				# if status code is in retry_code, retry
				while response2.status_code in retry_code:
					print(f'doi query has error for {doi_index} : {doi}, error status code is {response2.status_code}, retrying...')
					time.sleep(3)
					response2 = get_doi_query_response(doi)
				# if doi query succeeds:
				if response2.status_code == 200:
					j2 = response2.json()
					authors = j2['authorships']
					concepts = j2['concepts']
					referenced_works = j2['referenced_works']
					paper_dict = get_paper_dict_from_json_result(j2, doi)
					author_dict_list = get_author_dict_list_from_authors(doi, authors)
					concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts)
					reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works)
				# if doi query fails, add the doi to no_result_bad_doi list
				else:
					error_status_code(response2.status_code)
					title_query_empty_doi_query_404_list.append(doi)
					paper_dict = get_empty_paper_dict(doi)
					author_dict_list = get_empty_dict_list(doi)
					concept_dict_list = get_empty_dict_list(doi)
					reference_dict_list = get_empty_dict_list(doi)
					print(f'doi query fails for {doi_index} : {doi}')
		# if title query fails (most likely status code 404), which is very unlikely!
		#    add it to no_title_matching
		else:
			title_query_404_list.append(doi)
			error_status_code.append(response.status_code)
			paper_dict = get_empty_paper_dict(doi)
			author_dict_list = get_empty_dict_list(doi)
			concept_dict_list = get_empty_dict_list(doi)
			reference_dict_list = get_empty_dict_list(doi)
			print(f'title query fails for {doi_index} : {doi_title_dict[doi]}')
	# if doi in to_query_by_doi, use doi query
	else:
		# get doi query response:
		response0 = get_doi_query_response(doi)
		# if status code is in retry_code, retry
		while response0.status_code in retry_code:
			print(f'doi query for {doi_index} : {doi} has error, status code is {response0.status_code}, retrying...')
			time.sleep(3)
			response0 = get_doi_query_response(doi)
		# if doi query succeeds:
		if response0.status_code == 200:
			j0 = response0.json()
			authors = j0['authorships']
			concepts = j0['concepts']
			referenced_works = j0['referenced_works']
			paper_dict = get_paper_dict_from_json_result(j0, doi)
			author_dict_list = get_author_dict_list_from_authors(doi, authors)
			concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts)
			reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works)
		# if doi query fails, add the doi to no_doi_matching
		else:
			error_status_code.append(response0.status_code)
			doi_query_404_list.append(doi)
			paper_dict = get_empty_paper_dict(doi)
			author_dict_list = get_empty_dict_list(doi)
			concept_dict_list = get_empty_dict_list(doi)
			reference_dict_list = get_empty_dict_list(doi)
			print(f'doi query fails for {doi_index} : {doi}')
	list_of_paper_dicts.append(paper_dict)
	list_of_author_dict_lists.append(author_dict_list)
	list_of_concept_dict_lists.append(concept_dict_list)
	list_of_reference_dict_lists.append(reference_dict_list)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		get_data(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.5)
	print(list(set(error_status_code)))

if __name__ == '__main__':
	papers_to_study = read_txt(PAPERS_TO_STUDY)
	random_papers_to_study = random.sample(papers_to_study, 10)
	doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0]
	doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1]
	list_of_paper_dicts = []
	list_of_author_dict_lists = []
	list_of_concept_dict_lists = []
	list_of_reference_dict_lists = []
	title_query_empty_doi_query_404_list = []
	title_query_404_list = []
	doi_query_404_list = []
	retry_code = [ 500, 502, 503, 504, 429]
	error_status_code = []
	to_query_by_doi = [
		'10.1109/VISUAL.2001.964489',
		'10.1109/VISUAL.1996.568113',
		'10.1109/VISUAL.1999.809896',
		'10.1109/VISUAL.1991.175771',
		'10.1109/VISUAL.1998.745302',
		'10.1109/VISUAL.1993.398868',
		'10.1109/INFVIS.2005.1532128',
		'10.1109/VISUAL.1993.398859',
		'10.1109/VISUAL.1991.175795',
		'10.1109/VISUAL.2003.1250401',
		'10.1109/VISUAL.1991.175789',
		'10.1109/VISUAL.2000.885739',
		'10.1109/TVCG.2014.2346922',
		'10.1109/VISUAL.1999.809871',
		'10.1109/VISUAL.1996.567807',
		'10.1109/VISUAL.2000.885692',
		'10.1109/VISUAL.1991.175777',
		'10.1109/VISUAL.1998.745315',
		'10.1109/VISUAL.1997.663909',
		'10.1109/VISUAL.2000.885697',
		'10.1109/VISUAL.2001.964504',
		'10.1109/TVCG.2006.168',
		'10.1109/TVCG.2007.70617',
		'10.1109/VISUAL.1997.663910',
		'10.1109/VISUAL.1997.663931',
		'10.1109/VISUAL.2002.1183792',
		'10.1109/VISUAL.1992.235201',
		'10.1109/VISUAL.1996.568128',
		'10.1109/VISUAL.1997.663923',
		'10.1109/VAST.2011.6102441',
		'10.1109/VISUAL.2000.885732',
		'10.1109/VISUAL.2001.964522',
		'10.1109/VISUAL.2005.1532812',
		'10.1109/VISUAL.1998.745350',
		'10.1109/INFVIS.2001.963282',
		'10.1109/VISUAL.1995.480804',
		'10.1109/VISUAL.2005.1532847',
		'10.1109/INFVIS.1996.559229',
		'10.1109/VISUAL.2000.885738',
		'10.1109/VISUAL.1991.175800',
		'10.1109/VISUAL.1993.398865',
		'10.1109/VISUAL.1993.398866',
		'10.1109/VISUAL.1998.745348',
		'10.1109/VISUAL.1993.398867',
		'10.1109/VISUAL.1997.663925',
		'10.1109/VISUAL.1993.398900',
		'10.1109/VISUAL.1992.235181',
		'10.1109/VISUAL.1992.235195',
		'10.1109/VISUAL.2000.885719',
		'10.1109/VISUAL.1991.175816',
		'10.1109/VISUAL.1990.146414',
		'10.1109/VISUAL.1993.398861',
		'10.1109/VISUAL.1993.398872',
		'10.1109/VISUAL.1994.346292',
		'10.1109/VISUAL.1994.346295',
		'10.1109/VISUAL.1994.346297',
		'10.1109/VISUAL.1994.346301',
		'10.1109/VISUAL.1999.809913',
		'10.1109/VISUAL.2001.964546',
		'10.1109/VISUAL.2003.1250404',
		'10.1109/TVCG.2014.2346442',
		'10.1109/TVCG.2020.3028948',
		'10.1109/TVCG.2020.3030363',
		'10.1109/TVCG.2020.3030364',
		'10.1109/tvcg.2021.3114784',
		'10.1109/tvcg.2021.3114780',
		'10.1109/tvcg.2021.3114782',
		'10.1109/tvcg.2021.3114783',
		'10.1109/tvcg.2021.3114836',
		'10.1109/TVCG.2021.3064037',
		'10.1109/TVCG.2021.3114849',
		'10.1109/TVCG.2021.3114842',
		'10.1109/TVCG.2021.3114766',
		'10.1109/TVCG.2021.3114777'
	]
	special_result_index_dict = {
		'10.1109/VISUAL.1992.235194': 4,
	}
	main(papers_to_study)

paper_df = pd.DataFrame(list_of_paper_dicts)
author_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_author_dict_lists], ignore_index=True)
concept_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_concept_dict_lists], ignore_index=True)
reference_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_reference_dict_lists], ignore_index=True)

paper_df.to_csv(OPENALEX_PAPER_DF, index=False)
author_df.to_csv(OPENALEX_AUTHOR_DF, index=False)
concept_df.to_csv(OPENALEX_CONCEPT_DF, index=False)
reference_df.to_csv(OPENALEX_REFERENCE_DF, index=False)

with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_DFS, 'w') as f:
	for doi in title_query_empty_doi_query_404_list:
		f.write("%s\n" % doi)

with open(TITLE_QUERY_404_DFS, 'w') as f:
	for doi in title_query_404_list:
		f.write("%s\n" % doi)

with open(DOI_QUERY_404_DFS, 'w') as f:
	for doi in doi_query_404_list:
		f.write("%s\n" % doi)
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
import pandas as pd 
import numpy as np 
import requests
import random
import re 
import sys 
import time 
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

OPENALEX_REFERENCE_DF = sys.argv[1]
OPENALEX_REFERENCE_PAPER_DF_UNIQUE = sys.argv[2]
OPENALEX_REFERENCE_AUTHOR_DF_UNIQUE = sys.argv[3]
OPENALEX_REFERENCE_CONCEPT_DF_UNIQUE = sys.argv[4]
OPENALEX_REFERENCE_PAPER_DF = sys.argv[5]
OPENALEX_REFERENCE_AUTHOR_DF = sys.argv[6]
OPENALEX_REFERENCE_CONCEPT_DF = sys.argv[7]
OPENALEX_REFERENCE_ERROR_DF = sys.argv[8]

def get_unique_ref_urls(ref_df): # ref_df here is OPENALEX_REFERENCE_DF
	# returns a list: unique reference paper urls
	ref = pd.read_csv(ref_df).dropna(subset=['Number of References'])
	unique_ref_urls = list(set(ref.Reference.tolist()))
	return ref, unique_ref_urls

def get_s():
	# set retry if status codes in [ 500, 502, 503, 504, 429]
	# als return headers
	s = requests.Session()
	retries = Retry(total=5,
		backoff_factor=0.1,
		status_forcelist=[ 500, 502, 503, 504, 429],
	)
	s.mount('http://', HTTPAdapter(max_retries=retries))
	headers = {
	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	'Accept': 'application/json',
	}
	return s, headers

def get_paper_dict_from_json_result(j, url, paper_dict_list):
	"""returns a dict 
	"""
	authors = j['authorships']
	num_authors = len(authors)
	concepts = j['concepts']
	num_concepts = len(concepts)
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_publication_date = j['publication_date']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue_id = venue['id']
	openalex_url = venue['url']
	openalex_venue_name = venue['display_name']
	openalex_publisher = venue['publisher']
	publication_type = j['type']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	# num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None 
	# 	else int(openalex_last_page) - int(openalex_first_page) + 1)
	num_references = len(j['referenced_works'])
	num_citations = j['cited_by_count']
	# cited_by_api_url is a little bit complicated because in the results of title query
	#   it returns a list whereas it returns a str in doi query
	cited_url = j['cited_by_api_url']
	cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0]
	num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url)
	paper_dict = {
		'Reference': re.sub('//api.', '//', url),
		'OpenAlex Year': openalex_year,
		'OpenAlex Publication Date': openalex_publication_date,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue ID': openalex_venue_id,
		'OpenAlex Venue Name': openalex_venue_name,
		'OpenAlex Publisher': openalex_publisher,
		'Publication Type': publication_type,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
		# 'Number of Pages': num_pages,
		'Number of References for Reference paper': num_references,
		'Number of Citations': num_citations,
		'Number of Authors': num_authors,
		'Number of Concepts': num_concepts,
		'Citation API URL': cited_by_api_url,
		'Number of Citation API URLs': num_cited_by_api_url,
	}
	paper_dict_list.append(paper_dict)
	return paper_dict_list

def get_author_dict_list_from_authors(j, url, author_dict_list):
	"""returns a list of dicts
	"""
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	authors = j['authorships']
	num_authors = len(authors)
	for i in authors:
		author = i['author']
		author_name = author['display_name']
		author_position = authors.index(i) + 1
		position_type = i['author_position']
		openalex_author_id = author['id']
		author_orcid = author['orcid']
		raw_affiliation_string = i['raw_affiliation_string']
		if len(i['institutions']) == 0:
			num_institutions = np.NaN
			first_institution = np.NaN
			institution_name = np.NaN
			institution_id = np.NaN
			ror = np.NaN
			country_code = np.NaN
			institution_type = np.NaN
		else:
			num_institutions = len(i['institutions'])
			first_institution = i['institutions'][0]
			institution_name = first_institution['display_name']
			institution_id = first_institution['id']
			ror = first_institution['ror']
			country_code = first_institution['country_code']
			institution_type = first_institution['type']
		author_dict = {
			'Reference': re.sub('//api.', '//', url),
			'Reference OpenAlex Year': openalex_year,
			'Reference OpenAlex ID': openalex_id,
			'Reference OpenAlex Title': openalex_title,
			'Number of Authors': num_authors,
			'Author Name': author_name,
			'Author Position': author_position,
			'Author Position Type': position_type,
			'OpenAlex Author ID': openalex_author_id,
			'Author ORCID': author_orcid,
			'Number of Affiliations': num_institutions,
			'First Institution Name': institution_name,
			'Raw Affiliation String': raw_affiliation_string,
			'First Institution ID': institution_id,
			'First Institution ROR': ror,
			'First Institution Type': institution_type,
			'First Institution Country Code': country_code
		}
		author_dict_list.append(author_dict)
	return author_dict_list

def get_concept_dict_list_from_concepts(j, url, concept_dict_list):
	"""returns a list of dicts
	"""
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	concepts = j['concepts']
	num_concepts = len(concepts)
	for i in concepts:
		concept_index = concepts.index(i) + 1
		concept_name = i['display_name']
		openalex_concept_id = i['id']
		wikidata_url = i['wikidata']
		level = i['level']
		score = i['score']
		concept_dict = {
			'Reference': re.sub('//api.', '//', url),
			'Reference OpenAlex Year': openalex_year,
			'Reference OpenAlex ID': openalex_id,
			'Reference OpenAlex Title': openalex_title,
			'Number of Concepts': num_concepts,
			'Index of Concept': concept_index,
			'Concept': concept_name,
			'Concept ID': openalex_concept_id,
			'Wikidata': wikidata_url,
			'Level': level,
			'Score': score,
		}
		concept_dict_list.append(concept_dict)
	return concept_dict_list

def main(URLS, s, headers):
	for url in URLS:
		url_index = URLS.index(url) + 1
		api_url = re.sub('https://', 'https://api.', url)
		response = s.get(api_url, headers=headers)
		# if the response.status_code is in retry_code, then there is something wrong
		#    I will sleep for a while and try again. Note that if the status_code is 404, 
		#      I except it and put it in error_url_dict
		while response.status_code in retry_code:
			print(f'doi query {url_index} : {api_url} has error, status code is {response.status_code}, retrying...')
			time.sleep(3)
			response = s.get(api_url, headers=headers)
		# note that if the error code is 404, which means the following `response.jons()` will fail,
		#   then that url will NOT be included in paper_dict, author_dict, or concept list
		#.   Instead, that url will be put in error_url_dict
		#.     This is not a problem because later when I merge with REF, the merged file
		#.       will show NaN for 'number of concepts'....
		#.        In fact, even if I create empty dicts for those urls with 404 status codes,
		#          the final merged output will be the same. 
		try:
			j = response.json()
			get_paper_dict_from_json_result(j, url, paper_dict_list)
			get_author_dict_list_from_authors(j, url, author_dict_list)
			get_concept_dict_list_from_concepts(j, url, concept_dict_list)
			print(f'{url_index} / {len(URLS)} is done')
		except:
			error_url_dict = {
			    'Error URL': url,
			    'Error Status Code': response.status_code,
			}
			error_url_dict_list.append(error_url_dict)
			print(f'{url} : {response.status_code}')
		time.sleep(0.5)

if __name__ == '__main__':
	s = get_s()[0]
	headers = get_s()[1]
	# REF is openalex_reference_df with rows omitted whose 'number of reference' is missing
	REF = get_unique_ref_urls(OPENALEX_REFERENCE_DF)[0]
	URLS = get_unique_ref_urls(OPENALEX_REFERENCE_DF)[1]
	random_urls = URLS[0:11]
	paper_dict_list = []
	author_dict_list = []
	concept_dict_list = []
	error_url_dict_list = []
	retry_code = [ 500, 502, 503, 504, 429]
	main(URLS, s, headers)
	paper_df = pd.DataFrame(paper_dict_list)
	author_df = pd.DataFrame(author_dict_list)
	concept_df = pd.DataFrame(concept_dict_list)
	error_df = pd.DataFrame(error_url_dict_list)
	ref_paper_df = REF.merge(paper_df, on="Reference", how='left')
	ref_author_df = REF.merge(author_df, on="Reference", how='left')
	ref_concept_df = REF.merge(concept_df, on="Reference", how='left')
	paper_df.to_csv(OPENALEX_REFERENCE_PAPER_DF_UNIQUE, index=False)
	author_df.to_csv(OPENALEX_REFERENCE_AUTHOR_DF_UNIQUE, index=False)
	concept_df.to_csv(OPENALEX_REFERENCE_CONCEPT_DF_UNIQUE, index=False)
	ref_paper_df.to_csv(OPENALEX_REFERENCE_PAPER_DF, index=False)
	ref_author_df.to_csv(OPENALEX_REFERENCE_AUTHOR_DF, index=False)
	ref_concept_df.to_csv(OPENALEX_REFERENCE_CONCEPT_DF, index=False)
	error_df.to_csv(OPENALEX_REFERENCE_ERROR_DF, index=False)
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
import csv
import sys

VISPD_PLUS_GOOD_PAPERS = sys.argv[1]
PAPERS_TO_STUDY = sys.argv[2]

def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

def get_papers_to_study(INPUT): # INPUT here is vispd_plus_good_papers
    vispd_plus_good_papers = read_txt(INPUT)
    to_exclude_from_analysis = [
        '10.1109/VISUAL.1990.146412', # this one simply cannot be found by either title or doi query
        '10.1109/VISUAL.2003.1250379', # this one is wrong match and I can't find a way to locate it on openalex
    ]
    papers_to_study = [
        x for x in vispd_plus_good_papers if x not in to_exclude_from_analysis
    ]
    return papers_to_study

papers_to_study = get_papers_to_study(VISPD_PLUS_GOOD_PAPERS)

with open(PAPERS_TO_STUDY, 'w') as f:
    for doi in papers_to_study:
        f.write("%s\n" % doi)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import sys
import pandas as pd
import requests
from bs4 import BeautifulSoup

TITLES_2021 = sys.argv[1]

def get_page(url):
	r = requests.get(url)
	soup = BeautifulSoup(r.content, 'lxml')
	page = soup.find('article')
	return page

page = get_page('http://ieeevis.org/year/2021/info/papers-sessions')

def get_all_title_str(page):
	"""all_title_str contains both full and short papers' titles
	"""
	strong_elements = page.find_all('strong')
	time_str_elements = [
		x for x in strong_elements if 'CDT' in x.string or 'October' in x.string
	]
	all_title_str = [x.string for x in strong_elements if x not in time_str_elements]
	return all_title_str

all_title_str = get_all_title_str(page)

def get_str_to_exclude(page):
	"""i obtain the list of short paper titles

	First, I obtain both 'strong' and 'em'. Then, I obtain the index of the line that contain 'short papers:'
	That will serve as the "starting index" later. 

	Then, for each line that contain 'short papers:', i obtain the index of the immediate line that contains
	'session chair:'. That index will serve as the "end index".

	For each "starting" and "end" pair, I got the elements in between an extract their string. These include 
	all short papers' titlees. 

	"""
	strong_and_em = page.find_all(['strong', 'em'])
	short_paper_em_idx = [
		strong_and_em.index(i) for i in strong_and_em if 'Short Papers:' in i.string
	]
	session_chair_em_idx = [
		strong_and_em.index(i) for i in strong_and_em if 'Session Chair:' in i.string
	]
	end_idx_list = []
	for idx in short_paper_em_idx:
		end_idx = session_chair_em_idx.index(idx+1)
		end_idx_list.append(session_chair_em_idx[end_idx+1])
	start_end_dic = dict(zip(short_paper_em_idx, end_idx_list))
	str_to_exclude_list = []
	for start in start_end_dic.keys():
		to_exclude = strong_and_em[start:start_end_dic[start]]
		str_to_exclude = [x.string for x in to_exclude]
		str_to_exclude_list.append(str_to_exclude)
	str_to_exclude_list_flattened = [
		item for sublist in str_to_exclude_list for item in sublist
	]
	return str_to_exclude_list_flattened

str_to_exclude = get_str_to_exclude(page)

title_str = [x for x in all_title_str if x not in str_to_exclude]

title_str.remove(
	'Jurassic Mark: Inattentional Blindness for a Datasaurus Reveals that Visualizations are Explored, not Seen'
)

# This paper changed its title for publication on TCVG
title_replace_dict = {
    'IRVINE: Using Interactive Clustering and Labeling to Analyze Correlation Patterns: A Design Study from the Manufacturing of Electrical Engines':
    'IRVINE: A Design Study on Analyzing Correlation Patterns of Electrical Engines',
}

def replace_title(TITLES, DIC):
    for i,n in enumerate(TITLES):
        if n in DIC.keys():
            TITLES[i] = DIC[n]
    return TITLES

title_str = replace_title(title_str, title_replace_dict)

if len(title_str) == 170:
	print('title_str has 170 elements. everything correct')
else:
	print('something is wrong. the length of title_str is not 170')

df = pd.DataFrame(title_str, columns=['title'])

df.to_csv(TITLES_2021, index=False)
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import requests
import csv
import pandas as pd
import random
import re
import time
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import sys
import time

VISPD_PLUS_GOOD_PAPERS = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
VISPD_OPENALEX_MATCH_1 = sys.argv[3]
TITEL_QUERY_EMPTY_DOI_QUERY_404_1 = sys.argv[4]
TITLE_QUERY_404_1 = sys.argv[5]

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return [doi_year_dict, doi_title_dict]

# def get_s():
# 	# set retry if status codes in [ 500, 502, 503, 504, 429]
# 	# als return headers
# 	s = requests.Session()
# 	retries = Retry(total=5,
# 		backoff_factor=0.1,
# 		status_forcelist=[ 500, 502, 503, 504, 429],
# 	)
# 	s.mount('http://', HTTPAdapter(max_retries=retries))
# 	headers = {
# 	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
# 	'Accept': 'application/json',
# 	}
# 	return s, headers

def get_title_query_response(doi):
	title_original = doi_title_dict[doi]
	title = re.sub(r'\:|\?|\&|\,', '', title_original)
	response = requests.get(
		'https://api.openalex.org/works?filter=title.search:' + title)
	return response

def check_results_count(response):
	j = response.json()
	count = j['meta']['count']
	return j, count 

def get_doi_query_response(doi):
	response = requests.get("https://api.openalex.org/works/doi:" + doi)
	return response

def get_paper_dict_from_json_result(j, doi):
	openalex_id = j['id']
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue = venue['id']
	openalex_url = venue['url']
	openalex_journal = venue['display_name']
	openalex_publisher = venue['publisher']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue': openalex_venue,
		'OpenAlex Journal': openalex_journal,
		'OpenAlex Publisher': openalex_publisher,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
	}
	return paper_dict

def get_empty_paper_dict(doi):
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}
	return paper_dict

def get_paper_dict_list(doi, doi_index):
	# query title first:
	response = get_title_query_response(doi)
	while response.status_code in retry_code:
		print(f'title query for {doi_index} : {doi} has error. Error status code is {response.status_code}. Retrying...')
		time.sleep(1)
		response = get_title_query_response(doi)
	# if title query succeeds:
	if response.status_code == 200:
		# get json and check results count:
		j = check_results_count(response)[0]
		count = check_results_count(response)[1]
		# if count is non-zero:
		if count > 0:
			first_result = j['results'][0]
			paper_dict = get_paper_dict_from_json_result(first_result, doi)
		# if count is zero, use doi query instead
		else:
			# get doi query response:
			response2 = get_doi_query_response(doi)
			while response2.status_code in retry_code:
				print(f'doi query for {doi_index} : {doi} has error. Error status code is {response2.status_code}. Retrying...')
				time.sleep(1)
				response2 = get_doi_query_response(doi)
			# if doi query succeeds:
			if response2.status_code == 200:
				j2 = response2.json()
				paper_dict = get_paper_dict_from_json_result(j2, doi)
			# empty title query, and 404 for doi query:
			else:
				error_status_code.append(response2.status_code)
				title_query_empty_doi_query_404_list.append(doi)
				paper_dict = get_empty_paper_dict(doi)
				print(f'doi query is not successful for {doi_index} : {doi}, whose title is {doi_title_dict[doi]}')

	# if title query fails:	
	else:
		title_query_404_list.append(doi)
		error_status_code.append(response.status_code)
		# error_status_code.append([doi, response.status_code])
		paper_dict = get_empty_paper_dict(doi)
		print(f'title query is not successful for {doi_index} : {doi_title_dict[doi]}')
	paper_dict_list.append(paper_dict)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		get_paper_dict_list(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.5)
	print(list(set(error_status_code)))

if __name__ == '__main__':
	# note on 2022-01-21: it's not a bug here but it might be error-prone:
	# i defined s here and then i used it direclty in the function of `main`
	# without "importing" the parameter, like `main(vispd_plus_good_papers, s)`
	# it's working, but as I said, it might be error prone
	vispd_plus_good_papers = read_txt(VISPD_PLUS_GOOD_PAPERS)
	doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0]
	doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1]
	retry_code = [ 500, 502, 503, 504, 429]
	paper_dict_list = []
	title_query_empty_doi_query_404_list = []
	title_query_404_list = []
	error_status_code = []
	# s = get_s()[0]
	# headers = get_s()[1]
	main(vispd_plus_good_papers)

paper_df = pd.DataFrame(paper_dict_list)

paper_df.to_csv(VISPD_OPENALEX_MATCH_1, index=False)

with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_1, 'w') as f:
	for doi in title_query_empty_doi_query_404_list:
		f.write("%s\n" % doi)

with open(TITLE_QUERY_404_1, 'w') as f:
	for doi in title_query_404_list:
		f.write("%s\n" % doi)
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
import requests
import csv
import pandas as pd
import random
import re
import time
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import sys
import time

VISPD_PLUS_GOOD_PAPERS = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
VISPD_OPENALEX_MATCH_2 = sys.argv[3]
TITEL_QUERY_EMPTY_DOI_QUERY_404_2 = sys.argv[4]
TITLE_QUERY_404_2 = sys.argv[5]
DOI_QUERY_404_2 = sys.argv[6]

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return [doi_year_dict, doi_title_dict]

def get_title_query_response(doi):
	title_original = doi_title_dict[doi]
	title = re.sub(r'\:|\?|\&|\,', '', title_original)
	response = requests.get(
		'https://api.openalex.org/works?filter=title.search:' + title)
	return response

def check_results_count(response):
	j = response.json()
	count = j['meta']['count']
	return j, count 

def get_doi_query_response(doi):
	response = requests.get("https://api.openalex.org/works/doi:" + doi)
	return response

def get_paper_dict_from_json_result(j, doi):
	openalex_id = j['id']
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue = venue['id']
	openalex_url = venue['url']
	openalex_journal = venue['display_name']
	openalex_publisher = venue['publisher']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue': openalex_venue,
		'OpenAlex Journal': openalex_journal,
		'OpenAlex Publisher': openalex_publisher,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
	}
	return paper_dict

def get_empty_paper_dict(doi):
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}
	return paper_dict

def update_paper_dict_list(doi, doi_index):
	if doi not in to_query_by_doi:
		# query title first:
		response = get_title_query_response(doi)
		# if status code is in retry_code, retry:
		while response.status_code in retry_code:
			print(f'title query for {doi_index} : {doi} is having errors, error status code is {response.status_code}, retrying...')
			time.sleep(1)
			response = get_title_query_response(doi)
		# if title query succeeds:
		if response.status_code == 200:
			# get json and check results count:
			j = check_results_count(response)[0]
			count = check_results_count(response)[1]
			# if count is non-zero:
			if count > 0:
				# if doi not in special_result_index_dict, use index of 0
				if doi not in list(special_result_index_dict.keys()):
					first_result = j['results'][0]
					paper_dict = get_paper_dict_from_json_result(first_result, doi)
				else:
					correct_index = special_result_index_dict[doi]
					correct_result = j['results'][correct_index]
					paper_dict = get_paper_dict_from_json_result(correct_result, doi)
			# if count is zero, use doi query instead
			else:
				# get doi query response:
				response2 = get_doi_query_response(doi)
				# if status code is in retry_code, retry:
				while response2.status_code in retry_code:
					print(f'doi query for {doi_index} : {doi} is having errors, error status code is {response2.status_code}, retrying...')
					time.sleep(1)
					response2 = get_doi_query_response(doi)
				# if doi query succeeds:
				if response2.status_code == 200:
					j2 = response2.json()
					paper_dict = get_paper_dict_from_json_result(j2, doi)
				# if doi query fails, add the list to no_result list
				else:
					# empty title query results and bad doi query
					error_status_code.append(response2.status_code)
					title_query_empty_doi_query_404_list.append(doi)
					paper_dict = get_empty_paper_dict(doi)
					print(f'doi query is fails for {doi_index} : {doi}, whose title is {doi_title_dict[doi]}')

		# if title query fails:	
		else:
			title_query_404_list.append(doi)
			error_status_code.append(response.status_code)
			paper_dict = get_empty_paper_dict(doi)
			print(f'title query fails for {doi_index} : {doi_title_dict[doi]}')
	else:
		response0 = get_doi_query_response(doi)
		# if status code is in retry_code, retry
		while response0.status_code in retry_code:
			print(f'doi query for {doi_index} : {doi} is having errors, error status code is {response0.status_code}, retrying...')
			time.sleep(3)
			response0 = get_doi_query_response(doi)
		# if doi query succeeds:
		if response0.status_code == 200:
			j0 = response0.json()
			paper_dict = get_paper_dict_from_json_result(j0, doi)
		# if doi query fails:
		else:
			error_status_code.append(response0.status_code)
			doi_query_404_list.append(doi)
			paper_dict = get_empty_paper_dict(doi)
			print(f'doi query fails for {doi_index} : {doi}')
	paper_dict_list.append(paper_dict)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		update_paper_dict_list(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.5)
	print(list(set(error_status_code)))

if __name__ == '__main__':
	# note on 2022-01-21: it's not a bug here but it might be error-prone:
	# i defined s here and then i used it direclty in the function of `main`
	# without "importing" the parameter, like `main(vispd_plus_good_papers, s)`
	# it's working, but as I said, it might be error prone
	vispd_plus_good_papers = read_txt(VISPD_PLUS_GOOD_PAPERS)
	doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0]
	doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1]
	retry_code = [ 500, 502, 503, 504, 429]
	paper_dict_list = []
	title_query_empty_doi_query_404_list = []
	title_query_404_list = []
	doi_query_404_list = []
	error_status_code = []
	to_query_by_doi = [
		'10.1109/VISUAL.2001.964489',
		'10.1109/VISUAL.1996.568113',
		'10.1109/VISUAL.1999.809896',
		'10.1109/VISUAL.1991.175771',
		'10.1109/VISUAL.1998.745302',
		'10.1109/VISUAL.1993.398868',
		'10.1109/INFVIS.2005.1532128',
		'10.1109/VISUAL.1993.398859',
		'10.1109/VISUAL.1991.175795',
		'10.1109/VISUAL.2003.1250401',
		'10.1109/VISUAL.1991.175789',
		'10.1109/VISUAL.2000.885739',
		'10.1109/TVCG.2014.2346922',
		'10.1109/VISUAL.1999.809871',
		'10.1109/VISUAL.1996.567807',
		'10.1109/VISUAL.2000.885692',
		'10.1109/VISUAL.1991.175777',
		'10.1109/VISUAL.1998.745315',
		'10.1109/VISUAL.1997.663909',
		'10.1109/VISUAL.2000.885697',
		'10.1109/VISUAL.2001.964504',
		'10.1109/TVCG.2006.168',
		'10.1109/TVCG.2007.70617',
		'10.1109/VISUAL.1997.663910',
		'10.1109/VISUAL.1997.663931',
		'10.1109/VISUAL.2002.1183792',
		'10.1109/VISUAL.1992.235201',
		'10.1109/VISUAL.1996.568128',
		'10.1109/VISUAL.1997.663923',
		'10.1109/VAST.2011.6102441',
		'10.1109/VISUAL.2000.885732',
		'10.1109/VISUAL.2001.964522',
		'10.1109/VISUAL.2005.1532812',
		'10.1109/VISUAL.1998.745350',
		'10.1109/INFVIS.2001.963282',
		'10.1109/VISUAL.1995.480804',
		'10.1109/VISUAL.2005.1532847',
		'10.1109/INFVIS.1996.559229',
		'10.1109/VISUAL.2000.885738',
		'10.1109/VISUAL.1991.175800',
		'10.1109/VISUAL.1993.398865',
		'10.1109/VISUAL.1993.398866',
		'10.1109/VISUAL.1998.745348',
		'10.1109/VISUAL.1993.398867',
		'10.1109/VISUAL.1997.663925',
		'10.1109/VISUAL.1993.398900',
		'10.1109/VISUAL.1992.235181',
		'10.1109/VISUAL.1992.235195',
		'10.1109/VISUAL.2000.885719',
		'10.1109/VISUAL.1991.175816',
		'10.1109/VISUAL.1990.146414',
		'10.1109/VISUAL.1993.398861',
		'10.1109/VISUAL.1993.398872',
		'10.1109/VISUAL.1994.346292',
		'10.1109/VISUAL.1994.346295',
		'10.1109/VISUAL.1994.346297',
		'10.1109/VISUAL.1994.346301',
		'10.1109/VISUAL.1999.809913',
		'10.1109/VISUAL.2001.964546',
		'10.1109/VISUAL.2003.1250404',
		'10.1109/TVCG.2014.2346442',
		'10.1109/TVCG.2020.3028948',
		'10.1109/TVCG.2020.3030363',
		'10.1109/TVCG.2020.3030364',
		'10.1109/tvcg.2021.3114784',
		'10.1109/tvcg.2021.3114780',
		'10.1109/tvcg.2021.3114782',
		'10.1109/tvcg.2021.3114783',
		'10.1109/tvcg.2021.3114836',
		'10.1109/TVCG.2021.3064037',
		'10.1109/TVCG.2021.3114849',
		'10.1109/TVCG.2021.3114842',
		'10.1109/TVCG.2021.3114766',
		'10.1109/TVCG.2021.3114777'
	]
	special_result_index_dict = {
		'10.1109/VISUAL.1992.235194': 4,
	}
	main(vispd_plus_good_papers)

paper_df = pd.DataFrame(paper_dict_list)

paper_df.to_csv(VISPD_OPENALEX_MATCH_2, index=False)

with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_2, 'w') as f:
	for doi in title_query_empty_doi_query_404_list:
		f.write("%s\n" % doi)

with open(TITLE_QUERY_404_2, 'w') as f:
	for doi in title_query_404_list:
		f.write("%s\n" % doi)

with open(DOI_QUERY_404_2, 'w') as f:
	for doi in doi_query_404_list:
		f.write("%s\n" % doi)
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import pandas as pd
import sys

VISPUBDATA_PLUS = sys.argv[1]
VISPD_PLUS_GOOD_PAPERS = sys.argv[2]

def get_vispd_plus_good_papers(INPUT):
    """get the list of good dois
    """
    vispd_plus = pd.read_csv(INPUT)
    jc = ['J', 'C']
    good_papers = vispd_plus[
        (vispd_plus.PaperType.isin(jc)) | (vispd_plus.Year > 2020)
        ]
    dois = good_papers.loc[:, "DOI"].tolist()
    # remove the invalid DOI
    dois.remove('10.0000/00000001')
    return dois

vispd_plus_good_papers = get_vispd_plus_good_papers(VISPUBDATA_PLUS)

with open(VISPD_PLUS_GOOD_PAPERS, 'w') as f:
    for doi in vispd_plus_good_papers:
        f.write("%s\n" % doi)
 9
10
11
12
13
14
15
16
17
18
19
20
import sys
import pandas as pd

DOIS_2021 = sys.argv[1]
VISPUBDATA = sys.argv[2]
VISPUBDATA_PLUS = sys.argv[3]

if __name__ == '__main__':
	dois_2021_df = pd.read_csv(DOIS_2021)
	vispd = pd.read_csv(VISPUBDATA)
	vispd_plus = vispd.append(dois_2021_df, ignore_index=True)
	vispd_plus.to_csv(VISPUBDATA_PLUS, index=False)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import pandas as pd
import urllib
import requests
from bs4 import BeautifulSoup
import re
import csv
import random
import numpy as np
import time
import sys

INPUT = sys.argv[1]
OUT_FNAME = sys.argv[2]

def get_wos_id_from_doi(doi):
	url = f'http://ws.isiknowledge.com/cps/openurl/service?url_ver=Z39.88-2004&rft_id=info:doi/{doi}'
	headers = {
		"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	}
	response = requests.get(url=url, headers=headers)
	wos_url = response.history[-1].url
	wos_id_list = re.findall(r'(?<=2FWOS%3A)(.*)(?=%3F)', wos_url)
	if wos_id_list:
		wos_id = wos_id_list[0]
	else:
		wos_id = np.NaN
	doi_wos_dict = {
		'DOI': doi,
		'WOS ID': wos_id
	}
	doi_wos_dict_list.append(doi_wos_dict)

def get_dois(INPUT):
	good_dois = open(INPUT, 'r')
	reader = csv.reader(good_dois)
	allRows = [row for row in reader]
	dois = [i[0] for i in allRows]
	return dois 

def build_df_from_dict_list(df, dict_list):
	"""build df from a list of dictionaries

	Arguments:
	   df: an empty df you just initiated

	   dict_list: a list of dictionaries containing data you want to form a df

	Returns:
	  The updated df
	"""
	for i in dict_list:
		df_1 = pd.DataFrame([i])
		df = df.append(df_1, ignore_index=True)
	return df

def main():
	for doi in dois:
		get_wos_id_from_doi(doi)
		time.sleep(2+random.uniform(0, 2))
		print(f'{dois.index(doi) + 1} is done')

if __name__ == '__main__':
	# initiate a list of dicts
	doi_wos_dict_list = []
	dois = get_dois(INPUT)
	main()
	# initiate a dataframe 
	doi_wos_df_initiate = pd.DataFrame(columns=['DOI', 'WOS ID'])
	doi_wos_df = build_df_from_dict_list(
		doi_wos_df_initiate, doi_wos_dict_list)
	doi_wos_df.to_csv(OUT_FNAME, index=False)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import sys
import pandas as pd
import itertools
from collections import Counter

HT_CLEANED_AUTHOR_DF = sys.argv[1]
AUTHOR_CHORD_DF = sys.argv[2]
TS_AUTHOR_CHORD_DF = sys.argv[3]

def get_dic(DF): # DF here is HT_CLEANED_AUTHOR_DF
	"""get the dictionary of bicode counts"""
	tuple_list = []
	for group in DF.groupby('DOI'):
		country_codes = list(set(group[1]['Affiliation Country Code']))
		if len(country_codes) > 1:
			tuples = [x for x in itertools.combinations(country_codes, 2)]
			tuple_list.append(tuples)
	bicode = list(itertools.chain(*tuple_list))
	bicode_counts = Counter(bicode)
	bicode_counts_dic = dict(bicode_counts)
	return bicode_counts_dic 

def get_chord_df(DIC): # DIC here is bicode_counts_dic
	"""
	Return:
		A dataframe containig three columns: source, targe, value.
		Even though I am using `source`, and `target`, this is an undirected ntework. 
	"""
	chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value'])
	chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
	chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
	chord_df_sorted = chord_df[
		['source', 'target', 'value']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	return chord_df_sorted

def get_ts_chord_df(DF, ts_chord_data): # DF here is HT_CLEANED_AUTHOR_DF
	"""
	get timeseries data. groupby year first. get each year's data and then concatenate
	"""
	for year_group in DF.groupby("Year"):
		bicode_counts_dic = get_dic(year_group[1])
		chord_df = pd.DataFrame(
			bicode_counts_dic.items(), columns=['pairs','value'])
		chord_df['year'] = year_group[0]
		chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
		chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
		chord_df_sorted = chord_df[
			['source', 'target', 'value', 'year']].sort_values(
			by='value', ascending=False).reset_index(drop=True)
		ts_chord_data.append(chord_df_sorted)
	ts_chord_df = pd.concat(
		ts_chord_data, ignore_index=True)
	return ts_chord_df 

def rename_countries(DF):
	"""to convert country codes to name"""
	DF.replace({
		'CH': 'Switzerland',
		'CN': 'China',
		'DE': 'Germany',
		'CA': 'Canada',
		'FR': 'France',
		'NL': 'Netherlands',
		'AT': 'Austria',
		'AU': 'Australia',
	},
		inplace=True
	)
	return DF 

if __name__ == '__main__':
	HT_CLEANED_AUTHOR_DF = pd.read_csv(HT_CLEANED_AUTHOR_DF)
	ts_chord_data = []
	bicode_counts_dic = get_dic(HT_CLEANED_AUTHOR_DF)
	chord_df = get_chord_df(bicode_counts_dic)
	chord_df.to_csv(AUTHOR_CHORD_DF, index=False)
	ts_chord_df = get_ts_chord_df(HT_CLEANED_AUTHOR_DF, ts_chord_data)
	ts_chord_df.to_csv(TS_AUTHOR_CHORD_DF, index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
import pandas as pd 
import sys
import numpy as np
import itertools
from collections import Counter

VISPUBDATA_PLUS = sys.argv[1]
OPENALEX_CONCEPT_DF = sys.argv[2]
REFERENCE_CONCEPT_DF = sys.argv[3]
CITATION_CONCEPT_DF = sys.argv[4]
SANKEY_AGGREGATED_DF = sys.argv[5]
SANKEY_TS_DF = sys.argv[6]

def get_vis_doi_concept_dic(DF, LEVEL): # DF here is OPENALEX_CONCEPT_DF
	vis_levelns_df = DF[DF.Level == LEVEL].reset_index(drop=True)
	max_score_leveln = []
	for group in vis_levelns_df.groupby('DOI'):
		max_score = max(group[1]['Score'])
		df_to_append = group[1][group[1]['Score'] == max_score]
		max_score_leveln.append(df_to_append)
	vis_leveln_df = pd.concat(max_score_leveln, ignore_index=True)
	vis_leveln_doi_concept_dic = dict(
		zip(vis_leveln_df.DOI, vis_leveln_df.Concept))
	return vis_leveln_doi_concept_dic

def get_leveln_df(DF, LEVEL, ID_NAME): 
	"""
	inputs:
		DF is either s REF_DF or CIT_DF
		ID_NAME is either REF_ID_NAME or CIT_ID_NAME
	Returns:
		a dataframe of two columns: 
			1. IEEE VIS papers' DOI
			2. REF/CIT papers' concept
	"""
	dfs = []
	levelns_df = DF[DF.Level == LEVEL]
	# keep only the highest score concept
	for group in levelns_df.groupby(ID_NAME):
		dff = group[1].sort_values(by='Score', ascending=False)
		max_score = max(dff['Score'])
		dff_to_append = dff[dff['Score'] == max_score]
		dfs.append(dff_to_append)

	leveln_df = pd.concat(dfs, ignore_index=True)[['DOI', 'Concept', ID_NAME]]
	return leveln_df

def get_leveln_output_df(DF, VIS_DOI_CONCEPT_DIC, YEAR_DICT, YEAR_KEY, SUFFIX):
	"""
	inputs:
		DF is either REF_LEVELN_DF or CIT_LEVELN_DF
		YEAR_DICT is either DOI_YEAR_DICT or CIT_ID_YEAR_DICT
		YEAR_KEY is either REF_YEAR_KEY, or CIT_YEAR_KEY
		SUFFIX is either REF_SUFFIX or CIT_SUFFIX

	The purpose of this step:
		1. map DOI to IEEE VIS concept
		2. get the year when this citation happens
	"""

	DF['IEEE VIS Concept'] = DF.DOI.apply(
		lambda x: VIS_DOI_CONCEPT_DIC[
			x] if x in VIS_DOI_CONCEPT_DIC.keys() else np.NaN
	)
	DF['Year'] = DF[YEAR_KEY].apply(lambda x: YEAR_DICT[x])
	leveln_df_nonan = DF[DF['IEEE VIS Concept'].notnull()]
	leveln_df_output = leveln_df_nonan.drop(
		columns=['DOI']).reset_index(drop=True)
	if SUFFIX == REF_SUFFIX:
		leveln_df_output['Concept'] = leveln_df_output[
			'Concept'].apply(lambda s: s + REF_SUFFIX)
	else:
		leveln_df_output['Concept'] = leveln_df_output[
			'Concept'].apply(lambda s: s + CIT_SUFFIX)
	leveln_df_output['IEEE VIS Concept'] = leveln_df_output[
		'IEEE VIS Concept'].apply(lambda s: s + "(v)")
	return leveln_df_output

def get_leveln_aggregated(SOURCE, DF, LEVEL): 
	"""
	inputs:
		SOURCE is either 'REF' or 'CIT'
		DF is either REF_LEVELN_OUTPUT or CIT_LEVELN_OUTPUT
	"""
	if SOURCE == 'REF':
		tuples = list(zip(
			DF['Concept'], 
			DF['IEEE VIS Concept'],
		))
	else:
		tuples = list(zip(
			DF['IEEE VIS Concept'],
			DF['Concept'], 
		))
	biconcept_counts = Counter(tuples)
	dic = dict(biconcept_counts)
	sankey_df = pd.DataFrame(dic.items(), columns=['pairs','value'])
	sankey_df['level'] = LEVEL
	sankey_df['source'] = sankey_df.pairs.apply(lambda x: x[0])
	sankey_df['target'] = sankey_df.pairs.apply(lambda x: x[1])
	sankey_df_sorted = sankey_df[
		['source', 'target', 'value', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	sankey_df_sorted['rank'] = sankey_df_sorted.index + 1
	return sankey_df_sorted

def get_ts_year_group_data(SOURCE, DF, LEVEL):
	"""
	inputs:
		SOURCE is either 'REF' or 'CIT'
		DF is year_group

	This is much the same as the get_leveln_aggregated() function
	"""
	if SOURCE == 'REF':
		tuples = list(zip(
			DF[1]['Concept'], 
			DF[1]['IEEE VIS Concept'],
		))
	else:
		tuples = list(zip(
			DF[1]['IEEE VIS Concept'],
			DF[1]['Concept'], 
		))
	biconcept_counts = Counter(tuples)
	dic = dict(biconcept_counts)
	sankey_df = pd.DataFrame(dic.items(), columns=['pairs','value'])
	sankey_df['level'] = LEVEL
	sankey_df['source'] = sankey_df.pairs.apply(lambda x: x[0])
	sankey_df['target'] = sankey_df.pairs.apply(lambda x: x[1])
	sankey_df_sorted = sankey_df[
		['source', 'target', 'value', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	sankey_df_sorted['rank'] = sankey_df_sorted.index + 1
	sankey_df_sorted['year'] = DF[0]
	return sankey_df_sorted

if __name__ == '__main__':
	VISPUBDATA_PLUS = pd.read_csv(VISPUBDATA_PLUS)
	OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF)
	REF_DF = pd.read_csv(REFERENCE_CONCEPT_DF)
	CIT_DF = pd.read_csv(CITATION_CONCEPT_DF)

	REF_ID_NAME = 'Reference OpenAlex ID'
	CIT_ID_NAME = 'Citation Paper OpenAlex ID'

	REF_DF = REF_DF[REF_DF[REF_ID_NAME].notnull()]
	CIT_DF = CIT_DF[CIT_DF[CIT_ID_NAME].notnull()]
	CIT_DF.rename(columns = {'Cited Paper DOI': 'DOI'}, inplace=True)

	DOI_YEAR_DICT = dict(zip(
		VISPUBDATA_PLUS.DOI, VISPUBDATA_PLUS.Year
	))

	CIT_ID_YEAR_DICT = dict(zip(
		CIT_DF[CIT_ID_NAME], CIT_DF['Citation Paper Year']
	))

	REF_YEAR_KEY = 'DOI'
	CIT_YEAR_KEY = CIT_ID_NAME

	# Set parameters
	START_LEVEL = 0
	END_LEVEL = 3
	CUTOFF = 500
	REF_SUFFIX = '(r)'
	CIT_SUFFIX = '(c)'

	# initiate dfs
	REF_LEVELN_AGGREGATED_DFS = []
	CIT_LEVELN_AGGREGATED_DFS = []
	REF_LEVELN_TS_DFS = []
	CIT_LEVELN_TS_DFS = []

	for LEVEL in range(START_LEVEL, END_LEVEL + 1):
		VIS_DOI_CONCEPT_DIC = get_vis_doi_concept_dic(
			OPENALEX_CONCEPT_DF,
			LEVEL
		)

		# REFERENCE -> VIS
		REF_LEVELN_DF = get_leveln_df(
			REF_DF, 
			LEVEL, 
			REF_ID_NAME,
		)
		REF_LEVELN_OUTPUT = get_leveln_output_df(
			REF_LEVELN_DF, 
			VIS_DOI_CONCEPT_DIC, 
			DOI_YEAR_DICT, 
			REF_YEAR_KEY, 
			REF_SUFFIX,
		)
		REF_LEVELN_AGGREGATED = get_leveln_aggregated(
			'REF',
			REF_LEVELN_OUTPUT, 
			LEVEL,
		)

		REF_LEVELN_AGGREGATED_DFS.append(REF_LEVELN_AGGREGATED)

		# TIMESERIES:
		REF_LEVELN_YEAR_GROUP_DFS = []
		for year_group in REF_LEVELN_OUTPUT.groupby('Year'):
			year_group_data = get_ts_year_group_data(
				'REF',
				year_group,
				LEVEL
				)
			REF_LEVELN_YEAR_GROUP_DFS.append(year_group_data)
		REF_LEVELN_TS_DF = pd.concat(
			REF_LEVELN_YEAR_GROUP_DFS,
			ignore_index = True,
		)
		REF_LEVELN_TS_DFS.append(REF_LEVELN_TS_DF)

		# VIS -> CITATION
		CIT_LEVELN_DF = get_leveln_df(
			CIT_DF, 
			LEVEL, 
			CIT_ID_NAME,
		)
		CIT_LEVELN_OUTPUT = get_leveln_output_df(
			CIT_LEVELN_DF, 
			VIS_DOI_CONCEPT_DIC, 
			CIT_ID_YEAR_DICT, 
			CIT_YEAR_KEY, 
			CIT_SUFFIX,
		)
		CIT_LEVELN_AGGREGATED = get_leveln_aggregated(
			'CIT',
			CIT_LEVELN_OUTPUT, 
			LEVEL,
		)

		CIT_LEVELN_AGGREGATED_DFS.append(CIT_LEVELN_AGGREGATED)

		# TIMESERIES:
		CIT_LEVELN_YEAR_GROUP_DFS = []
		for year_group in CIT_LEVELN_OUTPUT.groupby('Year'):
			year_group_data = get_ts_year_group_data(
				'CIT',
				year_group,
				LEVEL,
			)
			CIT_LEVELN_YEAR_GROUP_DFS.append(year_group_data)
		CIT_LEVELN_TS_DF = pd.concat(
			CIT_LEVELN_YEAR_GROUP_DFS,
			ignore_index = True,
		)
		CIT_LEVELN_TS_DFS.append(CIT_LEVELN_TS_DF)

		print(f'level {LEVEL} is done')

	# GET AGGREGATED_DF
	ref_aggregated = pd.concat(
		REF_LEVELN_AGGREGATED_DFS,
		ignore_index = True,
	)
	ref_aggregated['source name'] = 'REF'
	cit_aggregated = pd.concat(
		CIT_LEVELN_AGGREGATED_DFS,
		ignore_index = True,
	)
	cit_aggregated['source name'] = 'VIS'

	aggregated_df = pd.concat(
		[ref_aggregated, cit_aggregated],
		ignore_index = True,
	)

	# GET TS_DF
	ref_timeseries = pd.concat(
		REF_LEVELN_TS_DFS,
		ignore_index = True,
	)
	ref_timeseries['source name'] = 'REF'
	cit_timeseries = pd.concat(
		CIT_LEVELN_TS_DFS,
		ignore_index = True,
	)
	cit_timeseries['source name'] = 'VIS'

	ts_df = pd.concat(
		[ref_timeseries, cit_timeseries],
		ignore_index = True,
	)

	# Write to file
	aggregated_df.to_csv(SANKEY_AGGREGATED_DF, index=False)
	ts_df.to_csv(SANKEY_TS_DF, index=False)

	print('sankey data has been saved!')
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import sys
import numpy as np
import pandas as pd
from collections import Counter

OPENALEX_PAPER_DF = sys.argv[1]
OPENALEX_CONCEPT_DF = sys.argv[2]
TOP_CONCEPTS_TRENDS_DF = sys.argv[3]

def get_year_count_dic(DF): # DF here is openalex_paper_df
	"""I want proportion. So I need first know total number of pubs each year"""
	year_count_df = DF.groupby(
		'Year').size().to_frame('count').reset_index()
	year_count_dic = dict(
		zip(year_count_df['Year'], year_count_df['count']))
	return year_count_dic 

def get_top_concepts_rank_and_total(DF, LEVEL, CUTOFF): # DF here is OPENALEX_CONCEPT_DF
	"""get the top concepts, its rank, and its historical total
	"""
	# filter by specific level
	lvl = DF[DF.Level == LEVEL]

	# get the total frequency of the concepts within that level
	lvl_df = lvl.groupby(['Concept', 'Concept ID']).size().to_frame(
		'frequency').reset_index().sort_values(
		by='frequency', ascending=False).head(CUTOFF)

	# get the rank of each of the top 10 concepts within that level
	# generate two dics: one for rank, and the other for total
	lvl_df['rank'] = range(1, CUTOFF+1)
	top_concepts = lvl_df['Concept']
	concept_rank_dic = dict(zip(lvl_df['Concept'], lvl_df['rank']))
	concept_historical_total_dic = dict(zip(lvl_df['Concept'], lvl_df['frequency']))
	return top_concepts, concept_rank_dic, concept_historical_total_dic


def get_ts_for_top(DF, TOP_CONCEPTS): # DF here is OPENALEX_CONCEPT_DF
	"""
	get timeseries data for top concepts

	Returns:
		a dataframe where in each row I have a concept, a year, and 
		the total frequency of that concept in that year

	"""

	top_concepts_ts_df = DF[DF.Concept.isin(TOP_CONCEPTS)].groupby(
		['Concept', 'Year']).size().to_frame(
		'Concept Yearly Frequency').reset_index()
	return top_concepts_ts_df


def update_dfs(
	DF, 
	i, 
	TOP_RANK_DIC, 
	TOP_TOTAL_DIC, 
	YEAR_COUNT_DIC,
	DFS
	): # DF here is TOP_CONCEPTS_TS_DF

	LEVEL = i
	dfss = []
	start = 1990 ; end = 2021
	year_idx = range(start, end+1)

	for group in DF.groupby('Concept'):
		"""Normalize each concept in each level by the same time range, i.e., 1990-2021"""
		year_frequency_dic = dict(
			zip(group[1]['Year'], group[1]['Concept Yearly Frequency']))
		concepts = [group[1].iloc[0, :].Concept] * len(year_idx)
		frequencies = [
			year_frequency_dic[
			x] if x in year_frequency_dic.keys() else 0 for x in year_idx]
		time_series_df = pd.DataFrame(
			list(zip(concepts, year_idx, frequencies)), 
			columns = [f'concept_{LEVEL}', f'year_{LEVEL}', f'yearly frequency_{LEVEL}'])
		time_series_df[f'rank_{LEVEL}'] = time_series_df[f'concept_{LEVEL}'].apply(
			lambda x: TOP_RANK_DIC[x])
		time_series_df[f'level_{LEVEL}'] = LEVEL
		time_series_df[f'concept historical total_{LEVEL}'] = time_series_df[
			f'concept_{LEVEL}'].apply(
			lambda x: TOP_TOTAL_DIC[x])
		time_series_df[f'yearly vis total_{LEVEL}'] = time_series_df[f'year_{LEVEL}'].apply(
			lambda x: YEAR_COUNT_DIC[x])
		time_series_df[f'proportion_{LEVEL}'] = time_series_df[
			f'yearly frequency_{LEVEL}'] / time_series_df[f'yearly vis total_{LEVEL}']
		# time_series_df is for each concept within each level
		# dfss is to contain all concepts data within a level
		dfss.append(time_series_df.reset_index(drop=True))
	level_df_to_append = pd.concat(dfss, ignore_index = True)
	level_df_to_append.sort_values(by=[f'rank_{LEVEL}', f'year_{LEVEL}'], inplace=True)
	DFS.append(level_df_to_append.reset_index(drop=True))


if __name__ == '__main__':
	# Set parameters
	START_LEVEL = 0
	END_LEVEL = 3
	# CUTOFF = 30
	CUTOFF = 10

	OPENALEX_PAPER_DF = pd.read_csv(OPENALEX_PAPER_DF)
	OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF)

	YEAR_COUNT_DIC = get_year_count_dic(OPENALEX_PAPER_DF)

	DFS = []
	for i in range(START_LEVEL, END_LEVEL+1):
		TOP_CONCEPTS, TOP_RANK_DIC, TOP_TOTAL_DIC = get_top_concepts_rank_and_total(
			OPENALEX_CONCEPT_DF, 
			i, 
			CUTOFF
		)

		TOP_CONCEPTS_TS_DF = get_ts_for_top(
			OPENALEX_CONCEPT_DF, TOP_CONCEPTS
		)
		update_dfs(
		TOP_CONCEPTS_TS_DF, 
		i, 
		TOP_RANK_DIC, 
		TOP_TOTAL_DIC, 
		YEAR_COUNT_DIC,
		DFS
		)

	# concat, validate, and write to file

	dff = pd.concat(DFS, axis=1)

	print(dff['year_1'].tolist() == dff['year_2'].tolist())
	print(dff['year_1'].tolist() == dff['year_3'].tolist())
	print(dff['rank_1'].tolist() == dff['rank_3'].tolist())
	print(dff['rank_1'].tolist() == dff['rank_2'].tolist())

	dff.to_csv(TOP_CONCEPTS_TRENDS_DF, index = False)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import sys
import numpy as np
import pandas as pd
import itertools
from collections import Counter

OPENALEX_CONCEPT_DF = sys.argv[1]
AGGREGATED_COOCCURANCE_DF = sys.argv[2]
TS_AGGREGATED_COOCCURANCE_DF = sys.argv[3]

def get_level_df(DF, LEVEL):
	# subset by level
	level_df = DF[DF.Level == LEVEL].reset_index(drop=True)
	return level_df

def get_dic(LEVEL_DF): # DF here is OPENALEX_CONCEPT_DF
	"""get the dictionary of biconcept counts"""

	# initiate a tuple list
	tuple_list = []

	# for each ieeevis paper, get combinations of level concepts if more than two 
	# level concepts exist
	for group in LEVEL_DF.groupby('DOI'):
		concepts = list(set(group[1].Concept))
		if len(concepts) > 1:
			tuples = [x for x in itertools.combinations(concepts, 2)]
			tuple_list.append(tuples)

	# get biconcepts dictionary
	biconcepts = list(itertools.chain(*tuple_list))
	biconcept_counts_dic = dict(Counter(biconcepts))

	return biconcept_counts_dic

def update_data(DIC, LEVEL, CUTOFF, DATA): # DIC: biconcept_counts_dic
	# DATA: cooccurance_aggregated_data
	chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value'])
	chord_df['level'] = LEVEL
	chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
	chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
	chord_df = chord_df[
		['source', 'target', 'value', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	chord_df = chord_df[chord_df['value'] >= CUTOFF]
	DATA.append(chord_df)

def update_ts_data(DIC, YEAR, LEVEL, CUTOFF, DATA):
	"""get timeseries chord dataframe"""
	chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value'])
	chord_df['year'] = YEAR
	chord_df['level'] = LEVEL
	chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
	chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
	chord_df = chord_df[
		['source', 'target', 'value', 'year', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	chord_df = chord_df[chord_df['value'] >= CUTOFF]
	DATA.append(chord_df)


if __name__ == '__main__':
	OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF)

	"""set parameters """ 

	CUTOFF = 1 # cutoff number for cooccurance
	START = 0 # top level
	END = 3 # lowest level 

	"""Get Aggregated data """

	# Aggregated data, involving data of all levles
	cooccurance_aggregated_data = []

	# iterate through all levels
	for LEVEL in range(START, END + 1):
		LEVEL_DF = get_level_df(OPENALEX_CONCEPT_DF, LEVEL)
		biconcept_counts_dic = get_dic(LEVEL_DF)
		update_data(
			biconcept_counts_dic, LEVEL, CUTOFF, cooccurance_aggregated_data)

	# write to file
	aggregated_df = pd.concat(cooccurance_aggregated_data, ignore_index=True)
	aggregated_df.to_csv(AGGREGATED_COOCCURANCE_DF, index=False)


	"""Get Timeseries data """

	cooccurance_timeseries_aggregated_data = []

	for LEVEL in range(START, END + 1):

		# initiate time series data for each level
		# it will collect each year's data within the current LEVEL
		cooccurance_timeseries_data = []

		LEVEL_DF = get_level_df(OPENALEX_CONCEPT_DF, LEVEL)

		for YEAR_GROUP in LEVEL_DF.groupby('Year'):
			biconcept_counts_dic = get_dic(YEAR_GROUP[1])
			update_ts_data(
				biconcept_counts_dic, 
				YEAR_GROUP[0], 
				LEVEL, 
				CUTOFF, 
				cooccurance_timeseries_data
			)

		# this is the final data for each level
		cooccurance_timeseries_df = pd.concat(
			cooccurance_timeseries_data, ignore_index=True)

		# append this level's data to aggregated data list
		cooccurance_timeseries_aggregated_data.append(cooccurance_timeseries_df)

	# write to file
	ts_aggregated_df = pd.concat(
		cooccurance_timeseries_aggregated_data, ignore_index=True)
	ts_aggregated_df.to_csv(TS_AGGREGATED_COOCCURANCE_DF, index=False)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import sys

# input
IEEE_AUTHOR_DF = sys.argv[1]

# output
AWARD_PAPER_DF = sys.argv[2]


def get_paragraphs(url):
    r = requests.get(url)
    if r.status_code == 200:
        soup = bs(r.text, 'html.parser')
        article = soup.find('article')
        paragraphs = list(article.stripped_strings)

    return paragraphs


def rename(x):
    if 'Honorable Mention Awards' in x:
        return 'HM'
    if 'Best Paper Award' in x:
        return 'BP'
    if 'Test of Time Award' in x:
        return 'TT'
    if 'Best Case Study Award' in x:
        return 'BCS'
    raise ValueError("Unknow award:", x)


rearranger = lambda x: [x[-1], x[-3], x[-2], x[-4], x[1], x[0]]


def get_parsed_results(years, years_idx, paragraphs):
    results = []
    intervals = zip(years_idx, years_idx[1:] + [len(paragraphs)])

    # every loop includes a year's awards
    for idx, (y1, y2) in enumerate(intervals):
        year = years[idx]
        paper_info = [] # initialize a list to store a paper info
        for i in range(y1+1, y2):
            p = paragraphs[i]

            if p.endswith(('Awards:', 'Award:')): 
                award = p.replace(':', '')
                award = rename(award)
                continue

            if p.endswith("\nDOI:"):
                p = p.replace(".\nDOI:", "").replace("Awarded at: ", '')

            if p == "DOI:":
                p = 'Vis'

            # every paper info has four lines: author, title, awarded at, DOI
            paper_info.append(p) 

            # all DOIs happen to have "/" not used anywhere else
            if '/' in p and paragraphs[i-1].endswith("DOI:"): 
                paper_info.extend([award, year]) # add award type and year
                results.append(paper_info)
                paper_info = []     

    return list(map(rearranger, results))


def doi_debug(results):
    df = pd.read_csv(IEEE_AUTHOR_DF)
    dois = df['DOI'].unique().tolist()
    dois_lower = [d.lower() for d in dois]

    for idx, res in enumerate(results):        
        if res[1] in dois:
            pass
        elif res[1].lower() in dois_lower:
            i = dois_lower.index(res[1].lower())
            print(res[1] + " has been unified as --> " + dois[i])
            results[idx][1] = dois[i]
        else:
            print(f"DOI: {res[1]} does not exist in {IEEE_AUTHOR_DF}!")

    return results



def get_2021_tt_papers():
    url = 'http://ieeevis.org/year/2021/info/awards/test-of-time-awards'
    paragraphs = get_paragraphs(url)
    tracks = ['VAST', 'InfoVis', 'SciVis']
    tracks_idx = [paragraphs.index(a) for a in tracks]

    years, years_idx = [], []
    for idx, p in enumerate(paragraphs):
        p = p.replace(":", "")
        if p.isdigit():
            years.append(int(p))
            years_idx.append(idx)

    def get_track(year_idx):
        for i in range(-1, -4, -1):
            if year_idx > tracks_idx[i]:
                return tracks[i]

    results = []
    award = 'TT'

    for idx, y_idx in enumerate(years_idx):
        year = years[idx]
        title = paragraphs[y_idx+1]
        author = paragraphs[y_idx+2]
        doi = paragraphs[y_idx+4]
        track = get_track(y_idx)
        results.append([year, doi, award, track, title, author])
    return doi_debug(results)


def main():
    url = 'http://ieeevis.org/year/2022/info/history/best-paper-award'
    paragraphs = get_paragraphs(url)
    years = [y for y in range(2021, 1989, -1)]
    years_idx = [paragraphs.index(str(y)) for y in years]
    assert len(years) == len(years_idx)
    results = get_parsed_results(years, years_idx, paragraphs)
    results = doi_debug(results)
    results.extend(get_2021_tt_papers())
    columns = ['Year', 'DOI', 'Award', 'Track', 'Title', 'Author']
    df = pd.DataFrame(results, columns=columns)
    df.to_csv(AWARD_PAPER_DF, index=False)


if __name__ == '__main__':
    main()
199
shell: "python scripts/get_titles_2021.py {output}"
205
shell: "python scripts/get_vispd_plus.py {input} {output}"
211
shell: "python scripts/get_vispd_plus_good_papers.py {input} {output}"
217
shell: "python scripts/get_vispd_openalex_match_1.py {input} {output}"
223
shell: "python scripts/get_vispd_openalex_match_2.py {input} {output}"
229
shell: "python scripts/get_papers_to_study.py {input} {output}"
235
shell: "python scripts/get_openalex_dfs.py {input} {output}"
241
shell: "python scripts/get_openalex_citation_dfs.py {input} {output}"
247
shell: "python scripts/get_ieee_author_and_paper_title.py {input} {output}"
253
shell: "python scripts/get_merged_author_df.py {input} {output}"
259
shell: "python scripts/get_openalex_reference_dfs.py {input} {output}"
265
shell: "python scripts/scrape_award_papers.py {input} {output}"
271
shell: "python scripts/get_gscholar_data.py {input} {output}"
277
shell: "python scripts/get_wos_id.py {input} {output}"
286
shell: "python scripts/CLASS_country.py {input} {output}"
295
shell: "python scripts/CLASS_type.py {input} {output}"
302
shell: "python scripts/get_HT_cleaned_author_df.py {input} {output}"
313
shell: "python scripts/get_HT_cleaned_paper_df.py {input} {output}"
318
shell: "python scripts/plot_data_author_chord_diagram_data.py {input} {output}"
323
shell: "python scripts/plot_vis_concepts_cooccurance_data.py {input} {output}"
328
shell: "python scripts/plot_top_concepts_trends.py {input} {output}"
337
shell: "python scripts/plot_sankey_data.py {input} {output}"
ShowHide 41 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://32vis.hongtaoh.com/
Name: 32vis
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...