Thirty-Two Years of IEEE VIS

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword data .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This repository contains data files and codes (data processing & analysis) for the paper of Thirty-two years of IEEE VIS: Authors, Fields of Study and Citations.

Updated Findings

In Fig. 3(d) and 3(e), we showed that the number of citations for VIS from non-VIS papers has been increasing dramatically but we did not analyze the publication venues of these citation papers. We did it later and found that citations coming from IEEE Transactions on Visualization and Computer Graphics accounted for 12.4% of all 153,549 citations (undeduplicated). Citations from Computer Graphics Forum , HCI venues, PacificVis, and journals in the filed of Visualization such as Information Visualization and Journal of Visualization are also major sources. This indicate that the impacts of VIS are mostly confined to visualization and HCI areas . Detailed results are available at https://hongtaoh.com/files/top_venues.html .

For recalicability committee:

Please go to the folder of reproduce and simply run bash script.sh .

Structure

This repository consists of four folders:

analyses_and_get_figures contains Jupyter notebooks that get the reported statistics and figures in the Results section of our paper.
data are data files we created and analyzed.
results are the output figures generated from codes in analyses_and_get_figures . Figures in both the paper and the supplementary material are included.
workflow contains (1) scripts to obtain data, and (2) Jupyter notebooks to validate data.

analyses_and_get_figures and results are easy to understand. The most difficult and critical parts are workflow and data . For detailed data generation & processing procedures, refere to workflow . For detailed descriptions of data that were generated and used in the study, refer to the data folder.

Important data

The most important data files in analysis are as follows:

data/ht_class/ht_cleaned_author_df.csv
data/ht_class/ht_cleaned_paper_df.csv
data/interim/openalex_author_df.csv
data/processed/openalex_concept_df.csv
data/processed/large/openalex_citation_concept_df.csv
data/processed/large/openalex_reference_concept_df.csv
data/processed/openalex_refeernce_concept_df_unique.csv

Data dicionaries for public data

We have also made data that might be useful for other researcers working on scientometric analysis available on Google Sheets: https://docs.google.com/spreadsheets/d/1JRo33XurW28bGK_Snplno1dbRLDkSZf1T7JmpjNDvTw/

VIS PAPER 1990-2021

Conference: The conference track of VIS papers. There are four tracks: InfoVis, SciVis, VAST, vis. Since 2021, IEEE VIS no longer distinguishes between conference tracsk and we assigned the term 'VIS' for all papers published in and after 2021
Year: The year this paper was published
Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)
DOI: Paper DOI
PaperType: either 'J' (Journal paper) or 'C' (conference paper). This data is from vispubdata . For IEEEVIS 2021 papers, we classified them all as 'J'
OpenAlex ID: The OpenAlex ID associated with this paper. With an ID, for example, W3203914472 , you can assess this paper's metadata on OpenAlex through https://api.openalex.org/works/W3203914472
Number of References: Number of references as shown on OpenAlex (as of June 2022)
Number of Concepts: Number of concepts as shown on OpenAlex (as of June 2022)
Number of Citations: Number of citations as shown on OpenAlex (as of June 2022)
Number of Authors: Number of authors
Cross-type Collaboration: Whether a paper involves collaborations among researchers from universities and non-educational affiliations (e.g., companies, facilities, government, healthcare, etc.)
Cross-country Collaboration: Whether a paper involves collaborations among researchers from different countries or regions
With US Authors: Whether a paper involves at least one author from the United States
Both Cross-type and Cross-country Collaboration: Whether a paper is both a cross-type and a cross-country collaboration paper
Google Scholar Citation: Citation counts as shown on Google Scholar (as of June 2022)
Award: Whether a paper is an award-winning paper. Note that we exclude Test of Time awards
Award Name: If a paper is an award-winning one, what award did it get. BP: Best Paper; HM: Honorable Mention; BCS: Best Case Study
Award Track: The conference track that presented this paper this award

VIS AUTHORS 1990-2021

Year: The year this paper was published
DOI: Paper DOI
Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)
Number of Authors: Number of authors
Author Position: Author position
Author Name: Author name
OpenAlex Author ID: OpenAlex author ID
Affiliation Name: Author affiliation name
Affiliation country code: alpha-2 (ISO 3166) country code for affiliations
Affiliation Type: The type of an affiliation, as defined by ROR
Binary Type: The type of an affiliation, either education or non-education

VIS PAPER CONCEPTS

Year: The year this paper was published
DOI: Paper DOI
Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)
Number of Concepts: Number of concepts as shown on OpenAlex (as of June 2022)
Index of Concept: Index of Concept as shown on OpenAlex (as of June 2022)
Concept: Concept name
Concept ID: Concept ID on OpenAlex
Wikidata: Link to Wikidata page of a Concept
Level: The level of this Concept as defined by OpenAlex. Level 0 indicates root Concepts like Computer Science and Psychology. The larger the number, the more granualr a Concept is.
Score: The score assigned to this Concept by OpenAlex. A higher score indicates this Concept is a better representation of a paper.

Google Scholar Citations

Year: The year this paper was published
DOI: Paper DOI
IEEE Title: Paper title as shown on IEEE Xplore (as of June 2022)
Title on Google Scholar: Paper title as shown on Google Scholar (as of June 2022)
Citation Link: Link to papers citing a VIS paper on Google Scholar (as of June 2022)
Citation Counts on Google Scholar: Citation counts on Google Scholar (as of June 2022)

Large data

The large folder within data/processed is empty because GitHub does not allow uploading files larger than 100M. Large files are stored in the repository of https://osf.io/zkvjm/ (OSF Storage -> large).

Dependencies

This project uses python 3.8 with the following packages:

snakemake
pandas
numpy
matplotlib
seaborn
altair
scikit-learn
scipy
plotnine
beautifulsoup4
selenium
urllib3
requests
lxml

All packages can be installed with pip install pkgname , for example, pip install scikit-learn . For lxml , use conda install -c anaconda lxml .

snakemake is used for the workflow. For details, see my tutorial on snakemake .

For citation analysis, we also used R . See citation_analysis.R .

For python , we recommend conda and creating a virtural environment. After installing anaconda , you can create a virtual environment:

conda create --name 32vis python=3.8
conda activate 32vis

Then you can install packages with conda or pip .

You can also use the environment.yml and requirements.yml but they contain many packages that are not used at all.

Reproducibility

Our work is designed to be reproducible.

Re-generate data?

If you want to reproduce our work from the very beginning, after installing the necessary packages mentioned above, you can delete all folders in data folder except for raw and README.md .

Then:

conda activate 32vis
cd workflow
snakemake --cores 1

This will generate all data again. Please note that:

We obtained data from the API of OpenAlex. However, OpenAlex updates its data every two weeks. This means that the data you will get will different from ours. The degree of differences is a function of time. For example, if you recreate the data ten years from now, our data will be totally different.
To crawl Google Scholar needs human participant due to the reCAPTCHA security checks.

After all data is obtained, you can run all files in analyses_and_get_figures to reproduce our results.

Okay with our current data?

If you don't plan to re-generate all the data but just want to reproduce results based on data we already had, you can simply run all files in analyses_and_get_figures directly.

Citation

@article{hao2022thirty,
 title={Thirty-two Years of IEEE VIS: Authors, Fields of Study and Citations},
 author={Hao, Hongtao and Cui, Yumian and Wang, Zhengxiang and Kim, Yea-Seul},
 journal={IEEE Transactions on Visualization and Computer Graphics},
 year={2022},
 doi={10.1109/TVCG.2022.3209422},
 publisher={IEEE}
}

Code Snippets

import sys
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as multi_score
from collections import Counter
from bs4 import BeautifulSoup

def get_simple_df(fname):
	"""
		- remove nan, 
		- get only two target columns, i.e., raw string and aff type
		- drop duplicates
	"""
	raw_string = 'Raw Affiliation String'
	aff_type = 'First Institution Country Code'
	df = pd.read_csv(fname)
	df = df[(df[raw_string].notnull()) & (df[aff_type].notnull())]
	df = df[[raw_string, aff_type]]
	df = df.drop_duplicates()
	return df

def get_df(cit_author, ref_author, oa_author):
	"""concatenate, drop_duplicates, reset index, rename columns,
		factorize label_str

	Returns:
		the df used for model training and testing. It contains three columns:
			1. aff, which is pre-processed strings of affiliations
			2. label_str, which is country codes in strings,
			3. label: which is factorized version of country codes
	"""

	df = pd.concat(
		[oa_author, ref_author, cit_author], ignore_index = True
		).drop_duplicates().reset_index(drop=True)
	df.columns = ['aff', 'label_str']
	df = df.assign(label = pd.factorize(df['label_str'])[0])
	return df 

def get_dicts(df):
	"""get two dicts; id <--> cntry
	"""
	cntry_to_id = dict(zip(df.label_str, df.label))
	id_to_cntry = dict(zip(df.label, df.label_str))
	return cntry_to_id, id_to_cntry

def clean_text(text):
    """
    Takes a string and returns a string
    """
    # remove html tags, lowercase, remove nonsense, remove non-letter
    aff = BeautifulSoup(text, "lxml").text 
    aff = aff.lower()
    aff = re.sub(r'xa0|#n#‡#n#|#tab#|#r#|\[|\]', "", aff)
    aff = re.sub(r'[^a-z]+', ' ', aff)
    return aff

def logist_regression(df):
	'''
	Input: 
		df: df
	Returns:
		logreg: logistic regression model
	'''
	X = df.aff
	y = df.label
	X_train, X_test, y_train, y_test = train_test_split(
		X, y, test_size=0.2, random_state = 42)
	logreg = Pipeline([('vect', CountVectorizer(stop_words='english', min_df = 5)),
				('clf', LogisticRegression(max_iter=600)),
			   ])
	print('model training now...')
	logreg.fit(X_train, y_train)

	y_train_pred = logreg.predict(X_train)
	y_test_pred = logreg.predict(X_test)

	target_names = list(set([id_to_cntry[x] for x in y_test]))

	f = open(CNTRY_CLASSIFICATION_REPORT,'a')
	f.write('The following is the result for affiliation country code classification' + '\n')
	f.write('Test set accuracy %s' % accuracy_score(y_test_pred, y_test))
	f.write('\n')
	precision, recall, fscore, support = multi_score(
		y_test, 
		y_test_pred, 
		average='weighted'
	)
	f.write('precision: {}'.format(precision))
	f.write('\n')
	f.write('recall: {}'.format(recall))
	f.write('\n')
	f.write('fscore: {}'.format(fscore))
	f.write('\n')
	f.write('support: {}'.format(support))
	f.write('\n')
	f.write('\n')
	f.write('Training set accuracy %s' % accuracy_score(y_train, y_train_pred))
	# f.write(classification_report(y_test, y_test_pred, target_names=target_names))
	f.close()

	return logreg

def get_processed_merged_author(DF, LOGREG):
	'''
	Input: 
		- DF: merged
		- LOGREG
	Returns:
		- DF with cntry classification results
	'''
	# clean text for affs to be predicted
	DF['IEEE Author Affiliation Filled_Processed'] = DF[
		'IEEE Author Affiliation Filled'].apply(clean_text)
	pred = LOGREG.predict(DF['IEEE Author Affiliation Filled_Processed'])
	results = [id_to_cntry[x] for x in pred]
	DF['country_code_results'] = results
	# if I have handcoded the country codes, use those first
	DF = DF.assign(country_code_results_updated = 
	    np.where(DF['First Institution Country Code By Hand'].notnull(), 
	         DF['First Institution Country Code By Hand'],
	         DF['country_code_results']
	        ))
	return DF

if __name__ == '__main__':

	CIT_AUTHOR = sys.argv[1]
	REF_AUTHOR = sys.argv[2]
	# openalex author df for VIS papers:
	OA_AUTHOR = sys.argv[3]
	MERGED_AUTHOR = sys.argv[4]
	MERGED_CNTRy_test_predICTED = sys.argv[5]
	CNTRY_CLASSIFICATION_REPORT = sys.argv[6]

	# load datasets:
	cit_author = get_simple_df(CIT_AUTHOR)
	ref_author = get_simple_df(REF_AUTHOR)
	oa_author = get_simple_df(OA_AUTHOR)
	merged = pd.read_csv(MERGED_AUTHOR)

	# get df for model trainig and testing
	df = get_df(cit_author, ref_author, oa_author)

	# clean affiliation texts 
	df['aff'] = df['aff'].apply(clean_text)

	df = df.drop_duplicates()
	f = open(CNTRY_CLASSIFICATION_REPORT,'a')
	f.write(f'there are {df.shape[0]} training examples in country classification.')
	f.write('\n')
	f.close()


	# get dicts
	cntry_to_id, id_to_cntry = get_dicts(df)

	# get logreg
	logreg = logist_regression(df)

	merged_processed = get_processed_merged_author(merged, logreg)

	# export merged_processed
	cols_to_keep = [
		'Year',
		'DOI',
		'Title',
		'IEEE Number of Authors',
		'IEEE Author Position', 
		'IEEE Author Name',
		'OpenAlex Author ID',
		'IEEE Author Affiliation Filled',
		'country_code_results_updated', 
		]
	col_renamer = {
		'Year':'Year',
		'DOI':'DOI',
		'Title':'Title',
		'IEEE Number of Authors':'Number of Authors',
		'IEEE Author Position':'Author Position', 
		'IEEE Author Name':'Author Name',
		'OpenAlex Author ID':'OpenAlex Author ID',
		'IEEE Author Affiliation Filled':'Affiliation Name',
		'country_code_results_updated':'Affiliation Country Code', 
		}
	merged_cntry_test_predicted = merged_processed[cols_to_keep]
	merged_cntry_test_predicted.rename(columns = col_renamer).to_csv(
		MERGED_CNTRy_test_predICTED, index = False
	)

Python Pandas numpy sklearn bs4 From line 5 of scripts/CLASS_country.py

import sys
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as multi_score
from bs4 import BeautifulSoup

def get_simple_df(fname):
	"""
		- remove nan, 
		- get only two target columns, i.e., raw string and aff type
		- drop duplicates
	"""
	raw_string = 'Raw Affiliation String'
	aff_type = 'First Institution Type'
	df = pd.read_csv(fname)
	df = df[(df[raw_string].notnull()) & (df[aff_type].notnull())]
	df = df[[raw_string, aff_type]]
	df = df.drop_duplicates()
	return df

def get_df(cit_author, ref_author, oa_author):
	"""concatenate, drop_duplicates, reset index, rename columns,
		factorize label_str

	Returns:
		the df used for model training and testing. It contains five columns:
			1. aff, which is pre-processed strings of affiliations
			2. label_str, which is country codes in strings,
			3. label: which is factorized version of country codes
			4. binary_label_str
			5. binary_label
	"""

	df = pd.concat(
		[oa_author, ref_author, cit_author], ignore_index = True
		).drop_duplicates().reset_index(drop=True)
	df.columns = ['aff', 'label_str']
	df = df.assign(label = pd.factorize(df['label_str'])[0])
	df = df.assign(binary_label_str = np.where(
		df.label_str == 'education', 'education', 'non-education'))
	df = df.assign(binary_label = pd.factorize(df['binary_label_str'])[0])
	return df 

def get_dicts(df):
	"""get four dicts; id <--> type, for both binary and multiclass
	"""
	multi_type_to_id = dict(zip(df.label_str, df.label))
	id_to_multi_type = dict(zip(df.label, df.label_str))
	binary_type_to_id = dict(zip(df.binary_label_str, df.binary_label))
	id_to_binary_type = dict(zip(df.binary_label, df.binary_label_str))
	return multi_type_to_id, id_to_multi_type, binary_type_to_id, id_to_binary_type

def clean_text(text):
    """
    Takes a string and returns a string
    """
    # remove html tags, lowercase, remove nonsense, remove non-letter
    aff = BeautifulSoup(text, "lxml").text 
    aff = aff.lower()
    aff = re.sub(r'xa0|#n#‡#n#|#tab#|#r#|\[|\]', "", aff)
    aff = re.sub(r'[^a-z]+', ' ', aff)
    return aff

def logist_regression(df, LABEL):
	'''
	Input: 
		df: df
		LABEL: 'label' if multiclass and 'binary_label' if binary
	Returns:
		logreg: logistic regression classifier (model)

	'''
	X = df.aff
	y = df[LABEL]
	X_train, X_test, y_train, y_test = train_test_split(
		X, y, test_size=0.2, random_state = 42)
	logreg = Pipeline([('vect', CountVectorizer(stop_words='english', min_df = 2)),
				('clf', LogisticRegression(max_iter=600)),
			   ])
	print('model training now...')
	logreg.fit(X_train, y_train)

	y_train_pred = logreg.predict(X_train)
	y_test_pred = logreg.predict(X_test)

	target_names = list(set(df.label_str)) if LABEL == 'label' else list(set(df.binary_label_str))
	logreg_type = 'multiclass classification' if LABEL == 'label' else 'binary classification'

	f = open(TYPE_CLASSIFICATION_REPORT,'a')
	f.write('The following is the result for aff type' + ' : ' + logreg_type + '\n')
	f.write('Test set accuracy %s' % accuracy_score(y_test, y_test_pred))
	f.write('\n')
	precision, recall, fscore, support = multi_score(
		y_test, 
		y_test_pred, 
		average='weighted'
	)
	f.write('precision: {}'.format(precision))
	f.write('\n')
	f.write('recall: {}'.format(recall))
	f.write('\n')
	f.write('fscore: {}'.format(fscore))
	f.write('\n')
	f.write('support: {}'.format(support))
	f.write('\n')
	f.write('\n')
	f.write('Training set accuracy %s' % accuracy_score(y_train, y_train_pred))
	# f.write('\n')
	# f.write(classification_report(y_test, y_test_pred, target_names=target_names))
	f.write('\n')
	f.write('\n')

	f.close()

	return logreg

def get_processed_merged_author(DF, LOGREG_MULTI, LOGREG_BINARY):
	'''
	Input: 
		- DF: merged
		- LOGREG_MULTI
		- LOGREG_BINARY
	Returns:
		- DF with binary and multiclass classification results
	'''
	# clean text for affs to be predicted
	DF['IEEE Author Affiliation Filled_Processed'] = DF[
		'IEEE Author Affiliation Filled'].apply(clean_text)
	pred_binary = LOGREG_BINARY.predict(DF['IEEE Author Affiliation Filled_Processed'])
	pred_binary_type = [id_to_binary_type[x] for x in pred_binary]
	pred_multi = LOGREG_MULTI.predict(DF['IEEE Author Affiliation Filled_Processed'])
	pred_multi_type = [id_to_multi_type[x] for x in pred_multi]
	DF['aff_type_results_binary'] = pred_binary_type
	DF['aff_type_results_multiclass'] = pred_multi_type
	# use type by hand if exists
	DF = DF.assign(aff_type_results_binary_updated = 
	    np.where(DF['Binary Institution Type By Hand'].notnull(), 
	         DF['Binary Institution Type By Hand'],
	         DF['aff_type_results_binary']
	        ))
	# use type by hand if exists
	DF = DF.assign(aff_type_results_multiclass_updated = 
	    np.where(DF['First Institution Type By Hand'].notnull(), 
	         DF['First Institution Type By Hand'],
	         DF['aff_type_results_multiclass']
	        ))
	return DF

if __name__ == '__main__':

	CIT_AUTHOR = sys.argv[1]
	REF_AUTHOR = sys.argv[2]
	# openalex author df for VIS papers:
	OA_AUTHOR = sys.argv[3]
	MERGED_AUTHOR = sys.argv[4]
	MERGED_AFF_TYPE_PREDICTED = sys.argv[5]
	TYPE_CLASSIFICATION_REPORT = sys.argv[6]

	# load datasets:
	cit_author = get_simple_df(CIT_AUTHOR)
	ref_author = get_simple_df(REF_AUTHOR)
	oa_author = get_simple_df(OA_AUTHOR)
	merged = pd.read_csv(MERGED_AUTHOR)

	# get df for model trainig and testing
	df = get_df(cit_author, ref_author, oa_author)

	# clean affiliation texts 
	df['aff'] = df['aff'].apply(clean_text)

	# drop duplicates after text pre-processing
	df = df.drop_duplicates()
	f = open(TYPE_CLASSIFICATION_REPORT,'a')
	f.write(f'there are {df.shape[0]} training examples in aff type classification.')
	f.write('\n')
	f.write('\n')
	f.close()

	# get dicts
	multi_type_to_id, id_to_multi_type, binary_type_to_id, id_to_binary_type = get_dicts(df)

	# get logreg
	logreg_multi = logist_regression(df, 'label')
	logreg_binary = logist_regression(df, 'binary_label')

	merged_processed = get_processed_merged_author(merged, logreg_multi, logreg_binary)

	# export merged_processed
	cols_to_keep = [
		'Year',
		'DOI',
		'Title',
		'IEEE Number of Authors',
		'IEEE Author Position', 
		'IEEE Author Name',
		'OpenAlex Author ID',
		'IEEE Author Affiliation Filled',
		'aff_type_results_multiclass_updated', 
		'aff_type_results_binary_updated', 
		]
	col_renamer = {
		'Year':'Year',
		'DOI':'DOI',
		'Title':'Title',
		'IEEE Number of Authors':'Number of Authors',
		'IEEE Author Position':'Author Position', 
		'IEEE Author Name':'Author Name',
		'OpenAlex Author ID':'OpenAlex Author ID',
		'IEEE Author Affiliation Filled':'Affiliation Name',
		'aff_type_results_multiclass_updated':'Multiclass Affiliation Type', 
		'aff_type_results_binary_updated':'Binary Affiliation Type',
		}
	merged_aff_type_predicted = merged_processed[cols_to_keep]
	merged_aff_type_predicted.rename(columns = col_renamer).to_csv(
		MERGED_AFF_TYPE_PREDICTED, index=False
	)

Python Pandas numpy sklearn bs4 From line 5 of scripts/CLASS_type.py

import sys
import pandas as pd
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException 
from selenium.common.exceptions import ElementNotInteractableException
import os
import random
import re
import csv
import numpy as np
import urllib.parse

PAPERS_TO_SUTDY = sys.argv[1]
IEEE_PAPER_DF = sys.argv[2]
GSCHOLAR_DATA = sys.argv[3]

def specify_driver_options():
	"""
	specify driver options
	"""
	options = Options()
	options.set_preference("browser.download.folderList", 2)
	options.set_preference("browser.download.manager.showWhenStarting", 
						   False)
	options.set_preference("browser.helperApps.neverAsk.saveToDisk", 
						   "text/plain, text/txt, application/plain, application/txt")

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(INPUT): # INPUT here is ieee_paper_df
	# get year_dict and title_dict
	df = pd.read_csv(INPUT)
	dois = df.loc[:, "DOI"].tolist()
	titles = df.loc[:, "IEEE Title"].tolist()
	years = df.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return doi_year_dict, doi_title_dict

def get_gscholar_data_by_title(doi, doi_index):
	# TITLE QUERY
	if doi in title_recode_dict.keys():
		title = title_recode_dict[doi]
	else:
		title = doi_title_dict[doi]
	title_to_query = urllib.parse.quote_plus(title)
	doi_to_query = urllib.parse.quote_plus(doi)
	query_string = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C50&q='
	# IF DOI IN TO_QUERY_BY_DOI, USE DOI QUERY
	if doi in to_query_by_doi:
		driver.get(query_string + doi_to_query + '&btnG=')
	# IF NOT, USE TITLE QUERY
	else:
		driver.get(query_string + title_to_query + '&btnG=')
	gs_paper_e = wait.until(EC.presence_of_element_located((
			By.CSS_SELECTOR, 'h3.gs_rt')))
	gs_paper_title = gs_paper_e.text
	gs_citation_e = wait.until(
		EC.presence_of_element_located((By.XPATH, '//div[@class="gs_fl"]//child::a[3]'
	)))
	citation_link = gs_citation_e.get_attribute('href')
	citation_count_string = gs_citation_e.get_attribute('innerHTML')
	if citation_count_string == "Related articles":
		gs_citation_count = 0
	else:
		gs_citation_count = int(re.findall(r'\d+', citation_count_string)[0])
	gscholar_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'IEEE Title': title,
		'Title on Google Scholar': gs_paper_title,
		'Citation Link': citation_link,
		'Citation Counts on Google Scholar': gs_citation_count,
	}
	gscholar_dict_list.append(gscholar_dict)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		get_gscholar_data_by_title(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.2+random.uniform(0, 0.2)) 
	driver.close()
	driver.quit()

if __name__ == '__main__':
	driver = webdriver.Firefox(options=specify_driver_options())
	wait = WebDriverWait(driver, 90)
	DOIS = read_txt(PAPERS_TO_SUTDY)
	doi_year_dict, doi_title_dict = get_dicts(IEEE_PAPER_DF)
	random_dois = random.sample(DOIS, 10)
	random_dois.append('10.1109/INFVIS.2001.963279')
	gscholar_dict_list = []
	title_recode_dict = {
	# If I don't change the title for querying, the results are wrong:
		# This is the real title on PDF:
		'10.1109/VISUAL.1999.809889': 'Enabling classification and shading for 3 D texture mapping based volume rendering using OpenGL and extensions',
	}
	to_query_by_doi = [
	# If I query by title, the results are false:
		'10.1109/VISUAL.1993.398863',
		'10.1109/VISUAL.1996.567807',
		'10.1109/VISUAL.1998.745315',
		'10.1109/INFVIS.2001.963282',
		'10.1109/VISUAL.1992.235194',
		'10.1109/VISUAL.1993.398866',
		'10.1109/VISUAL.1998.745348',
		'10.1109/VISUAL.1997.663925',
		'10.1109/VISUAL.1993.398900',
		'10.1109/VISUAL.2000.885719',
		'10.1109/TVCG.2021.3114849',
		'10.1109/VISUAL.1991.175771',
		'10.1109/INFVIS.2001.963279',
		'10.1109/INFVIS.2001.963295',
		'10.1109/VIS.1999.10000',
	]
	main(DOIS)
	df = pd.DataFrame(gscholar_dict_list)
	df.to_csv(GSCHOLAR_DATA, index = False)

Python Pandas numpy bs4 selenium From line 1 of scripts/get_gscholar_data.py

import sys
import pandas as pd
import numpy as np
import itertools

MERGED_CNTRY_PREDICTED = sys.argv[1]
MERGED_AFF_TYPE_PREDICTED = sys.argv[2]
HT_CLEANED_AUTHOR_DF = sys.argv[3]

def get_cross_country_dic(df):
	cross_country_dic = {}
	for group in df.groupby('DOI'):
	    DOI = group[0]
	    country_codes = group[1]['Affiliation Country Code'].tolist()
	    num_of_cntry = len(list(set(country_codes)))
	    if num_of_cntry != 1:
	        cross_country_dic[DOI] = True
	    else:
	        cross_country_dic[DOI] = False
	return cross_country_dic

def get_cross_type_dic(df):
	cross_type_dic = {}
	for group in df.groupby('DOI'):
	    DOI = group[0]
	    types = group[1]['Binary Type'].tolist()
	    num_of_types = len(list(set(types)))
	    if num_of_types != 1:
	        cross_type_dic[DOI] = True
	    else:
	        cross_type_dic[DOI] = False
	return cross_type_dic

if __name__ == '__main__':
	# load data
	cntry_df = pd.read_csv(MERGED_CNTRY_PREDICTED)
	type_df = pd.read_csv(MERGED_AFF_TYPE_PREDICTED)

	if cntry_df.shape[0] == type_df.shape[0]:
		print('cntry_df has the same length with type_df')

	# get the column of affiliation type
	multi_aff_types = type_df['Multiclass Affiliation Type']
	binary_aff_types = type_df['Binary Affiliation Type']

	# assign it to cntry_df and reanme columns
	cntry_df = cntry_df.assign(multi_aff_type = multi_aff_types)
	cntry_df = cntry_df.assign(binary_aff_type = binary_aff_types)
	cntry_df.rename(
		columns = {
			'multi_aff_type': 'Affiliation Type',
			'binary_aff_type': 'Binary Type',
		}, 
		inplace=True
	)

	df = cntry_df.copy()

	cross_country_dic = get_cross_country_dic(df)
	cross_type_dic = get_cross_type_dic(df)

	df['Cross-type Collaboration'] = df.DOI.apply(
    	lambda x: cross_type_dic[x]
	)
	df['International Collaboration'] = df.DOI.apply(
    	lambda x: cross_country_dic[x]
	)

	df.to_csv(HT_CLEANED_AUTHOR_DF, index=False)

Python Pandas numpy From line 4 of scripts/get_HT_cleaned_author_df.py

import sys
import pandas as pd
import numpy as np
from functools import reduce 

PAPER_TO_STUDY = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
OPENALEX_PAPER_DF = sys.argv[3]
HT_CLEANED_AUTHOR_DF = sys.argv[4]
GSCHOLAR_DATA = sys.argv[5]
AWARD_PAPER_DF = sys.argv[6]
HT_CLEANED_PAPER_DF = sys.argv[7]

def get_vispd(VISPUBDATA_PLUS, PAPER_TO_STUDY):
	cols = [
		'Conference',
		'Year',
		'Title',
		'DOI',
		'FirstPage',
		'LastPage',
		'PaperType',
	]
	vispd = VISPUBDATA_PLUS[
		VISPUBDATA_PLUS.DOI.isin(PAPER_TO_STUDY)].loc[:, cols].reset_index(drop=True)
	vispd.loc[vispd.Year == 2021, 'PaperType'] = 'J'
	return vispd 

def get_alex(OPENALEX_PAPER_DF):
	cols = [
		'DOI',
		'OpenAlex Year',
		'OpenAlex Publication Date',
		'OpenAlex ID',
		'OpenAlex Venue Name',
		'OpenAlex First Page',
		'OpenAlex Last Page',
		'Number of Pages',
		'Number of References',
		'Number of Concepts',
		'Number of Citations',
	]
	alex = OPENALEX_PAPER_DF.loc[:, cols]
	return alex 

def get_authors(HT_CLEANED_AUTHOR_DF):
	cols = [
		'DOI',
		'Number of Authors',
		'Cross-type Collaboration',
		'International Collaboration',
		'With US Authors',
	]
	# create the column of "With US Authors"
	for doi in list(set(HT_CLEANED_AUTHOR_DF.DOI)):
		if 'US' in HT_CLEANED_AUTHOR_DF[
			HT_CLEANED_AUTHOR_DF.DOI == doi]['Affiliation Country Code'].tolist():
			HT_CLEANED_AUTHOR_DF.loc[HT_CLEANED_AUTHOR_DF.DOI == doi, 'With US Authors'] = True
		else:
			HT_CLEANED_AUTHOR_DF.loc[HT_CLEANED_AUTHOR_DF.DOI == doi, 'With US Authors'] = False
	HT_CLEANED_AUTHOR_DF.drop_duplicates(subset=['DOI'], inplace=True)
	authors = HT_CLEANED_AUTHOR_DF.loc[:, cols].reset_index(drop=True) 
	# create the column of both cross-type and cross-country collaboration
	authors['Both Cross-type and Cross-country Collaboration'] = authors[
		'Cross-type Collaboration'] * authors['International Collaboration']
	# rename column
	authors.rename(
		columns={'International Collaboration': 'Cross-country Collaboration'},
		inplace=True
	)
	return authors 

def get_gscholar(GSCHOLAR_DATA):
	cols = [
		'DOI',
		'IEEE Title',
		'Citation Counts on Google Scholar',
	]
	gscholar = GSCHOLAR_DATA.loc[:, cols]
	return gscholar

def get_df_merged(dfs):
	df_merged = reduce(lambda left,right: pd.merge(left,right,on='DOI'), dfs)
	return df_merged

def get_award_dicts(AWARD_PAPER_DF):
	awards = AWARD_PAPER_DF[AWARD_PAPER_DF.Award != 'TT']
	kwargs = {'Track Updated': np.where(awards.Year == 2021, 'VIS', awards.Track)}
	awards = awards.assign(**kwargs)
	award_dois = awards.DOI.tolist()
	award_names = awards.Award.tolist()
	award_tracks = awards['Track Updated'].tolist()
	doi_award_name_dict = dict(zip(award_dois, award_names))
	doi_award_track_dict = dict(zip(award_dois, award_tracks))
	return award_dois, doi_award_name_dict, doi_award_track_dict

def get_df_final(df_merged, award_dois, doi_award_name_dict, doi_award_track_dict):
	df_merged['Award'] = df_merged['DOI'].apply(
		lambda x: True if x in award_dois else False
	)
	df_merged['Award Name'] = df_merged['DOI'].apply(
		lambda x: doi_award_name_dict[x] if x in award_dois else np.nan)
	df_merged['Award Track'] = df_merged['DOI'].apply(
		lambda x: doi_award_track_dict[x] if x in award_dois else np.nan)
	df_final = df_merged
	return df_final

def main():
	# process data
	vispd = get_vispd(VISPUBDATA_PLUS, PAPER_TO_STUDY)
	alex = get_alex(OPENALEX_PAPER_DF)
	authors = get_authors(HT_CLEANED_AUTHOR_DF)
	gscholar = get_gscholar(GSCHOLAR_DATA)
	# merge data
	dfs = [vispd, alex, authors, gscholar]
	df_merged = get_df_merged(dfs)
	# get award data
	award_dois, doi_award_name_dict, doi_award_track_dict = get_award_dicts(AWARD_PAPER_DF)
	df_final = get_df_final(
		df_merged, award_dois, doi_award_name_dict, doi_award_track_dict)
	# write to file
	df_final.to_csv(HT_CLEANED_PAPER_DF, index=False)

if __name__ == '__main__':
	# load data
	VISPUBDATA_PLUS = pd.read_csv(VISPUBDATA_PLUS)
	PAPER_TO_STUDY = pd.read_csv(PAPER_TO_STUDY, header=None)[0].tolist()
	OPENALEX_PAPER_DF = pd.read_csv(OPENALEX_PAPER_DF)
	HT_CLEANED_AUTHOR_DF = pd.read_csv(HT_CLEANED_AUTHOR_DF)
	GSCHOLAR_DATA = pd.read_csv(GSCHOLAR_DATA)
	AWARD_PAPER_DF = pd.read_csv(AWARD_PAPER_DF)
	main()

Python Pandas numpy From line 5 of scripts/get_HT_cleaned_paper_df.py

import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
import json
import numpy as np
import sys
import random
import time
from io import StringIO
from html.parser import HTMLParser
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import re

PAPERS_TO_STUDY = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
IEEE_AUTHOR_DF = sys.argv[3]
IEEE_PAPER_DF = sys.argv[4]
PROBLEM_DOIS = sys.argv[5]

def get_s():
	# set retry if status codes in [ 500, 502, 503, 504, 429]
	# als return headers
	s = requests.Session()
	retries = Retry(total=5,
		backoff_factor=0.1,
		status_forcelist=[ 500, 502, 503, 504, 429],
	)
	s.mount('http://', HTTPAdapter(max_retries=retries))
	headers = {
	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	'Accept': 'application/json',
	}
	return s, headers

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return doi_year_dict, doi_title_dict

def get_response(URL):
	response = s.get(url=URL, headers=headers)
	while response.status_code != 200:
		print(f'response status code is {response.status_code}. retrying now...')
		time.sleep(5)
		response = s.get(url=URL, headers=headers)
	return response 

def get_soup(RESPONSE):
	html = RESPONSE.text
	soup = BeautifulSoup(html, 'lxml')
	return soup 

def get_j(DOI, SOUP):
	if DOI != '10.1109/VIS.1999.10000':
		str = SOUP.find_all('script')[11].string.rsplit(
			'xplGlobal.document.metadata=')[1].rsplit(
			'xplGlobal.document.userLoggedIn=')[0]

		# delete anything after the last `}`
		str = str.replace(re.findall(r'[^\}]+$', str)[0], '')
		j = json.loads(str)
	else:
		j = None
	return j

# scrip html tags and entities in titles
# source: https://stackoverflow.com/a/925630
class MLStripper(HTMLParser):
	def __init__(self):
		super().__init__()
		self.reset()
		self.strict = False
		self.convert_charrefs= True
		self.text = StringIO()
	def handle_data(self, d):
		self.text.write(d)
	def get_data(self):
		return self.text.getvalue()

def strip_tags(html):
	s = MLStripper()
	s.feed(html)
	return s.get_data()

# def get_ieee_title(J):
# 	# get ieee paper title
# 	title_raw = J['title']
# 	title = strip_tags(title_raw)
# 	return title

def update_paper_dict_list(J, DOI):
	if DOI != '10.1109/VIS.1999.10000':
		title_raw = J['title']
		ieee_title = strip_tags(title_raw)
		ieee_doi = J['doi']
	else:
		ieee_title = doi_title_dict[DOI]
		ieee_doi = DOI
	paper_dict = {
		'Year': doi_year_dict[DOI],
		'DOI': DOI,
		'Title': doi_title_dict[DOI],
		'IEEE Title': ieee_title,
		'IEEE DOI': ieee_doi,
	}
	paper_dict_list.append(paper_dict)

def update_author_dict_list(J, DOI):
	AUTHOR_JSON = J['authors']
	for i in AUTHOR_JSON:
		try:
			first_name = i['firstName']
		except:
			first_name = None
		try:
			last_name = i['lastName']
		except:
			last_name = None
		try:
			author_name = i['name']
		except:
			author_name = None
		author_num = len(AUTHOR_JSON)
		author_position = AUTHOR_JSON.index(i) + 1
		try:
			affiliation_element = i['affiliation']
			affiliation_name = affiliation_element[0]
			affiliation_num = len(affiliation_element)
			one_affiliation = True if affiliation_num == 1 else False
		except:
			affiliation_name = None
			affiliation_num = None
			one_affiliation = None
		try:
			author_id = 'https://ieeexplore.ieee.org/author/' + i['id']
		except:
			author_id = None
		author_dict = {
			'Year': doi_year_dict[DOI],
			'DOI': DOI,
			'Title': doi_title_dict[DOI],
			# 'IEEE Title': IEEE_TITLE,
			# 'First Name': first_name,
			# 'Last Name': last_name,
			'Number of Authors': author_num,
			'Author Position': author_position,
			'Author Name': author_name,
			'Author ID': author_id,
			'Author Affiliation': affiliation_name,
			# 'Number of Affiliations': affiliation_num,
			'One Affiliation': one_affiliation,
		}
		author_dict_list.append(author_dict)

def get_empty_author_dict(DOI):
	author_dict = {
		'Year': doi_year_dict[DOI],
		'DOI': DOI,
		'Title': doi_title_dict[DOI],
	}
	author_dict_list.append(author_dict)

def main(DOIS):
	for DOI in DOIS:
		doi_index = DOIS.index(DOI) + 1
		url = 'https://doi.org/' + DOI
		response = get_response(url)
		soup = get_soup(response)
		j = get_j(DOI, soup)
		update_paper_dict_list(j, DOI)
		try:
			if DOI != '10.1109/VIS.1999.10000':
				update_author_dict_list(j, DOI)
			else:
				get_empty_author_dict(DOI)
		except:
			problem_dois_list.append(DOI)
			print(f'something wrong with {DOI}')
		time.sleep(0.4+random.uniform(0, 0.4)) 
		print(f'{doi_index} is done')

if __name__ == '__main__':
	s = get_s()[0]
	headers = get_s()[1]
	PAPERS = pd.read_csv(PAPERS_TO_STUDY, header=None)
	DOIS = PAPERS[0].tolist()
	random_dois = random.sample(DOIS, 10)
	random_dois.append('10.1109/VIS.1999.10000')
	doi_year_dict, doi_title_dict = get_dicts(VISPUBDATA_PLUS)
	headers = {
	'User-agent':
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
	}
	author_dict_list = []
	paper_dict_list = []
	problem_dois_list = []
	# main(random_dois)
	main(DOIS)
	author_df = pd.DataFrame(author_dict_list)
	paper_df = pd.DataFrame(paper_dict_list)
	author_df.to_csv(IEEE_AUTHOR_DF, index=False)
	paper_df.to_csv(IEEE_PAPER_DF, index=False)
	with open(PROBLEM_DOIS, 'w') as f:
		for doi in problem_dois_list:
			f.write("%s\n" % doi)

Python Pandas numpy JSON requests bs4 urllib3 From line 14 of scripts/get_ieee_author_and_paper_title.py

import sys
import pandas as pd
import re
import numpy as np
import csv
import difflib 

IEEE_AUTHOR = sys.argv[1]
OPENALEX_AUTHOR = sys.argv[2]
PAPERS_TO_STUDY = sys.argv[3]
VISPUBDATA = sys.argv[4]
MERGED_AUTHOR_DF = sys.argv[5]

def get_dicts(VISPUBDATA):
	# get year_dict and title_dict
	vispd = pd.read_csv(VISPUBDATA)
	dois = vispd.loc[:, "DOI"].tolist()
	titles = vispd.loc[:, "Title"].tolist()
	years = vispd.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return doi_year_dict, doi_title_dict

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def update_ieee_orig(DF): # df here is iee_orig
	"""update ieee_org

	ieee_org is wrong in '10.1109/TVCG.2008.157' as it contains an additional author that shouldn't be there;
	also, ieee_org lacks author info for '10.1109/VIS.1999.10000'.

	What this function does is to delete the additional author in '10.1109/TVCG.2008.157' and update info in 
	that paper. Then, I added author data manually for '10.1109/VIS.1999.10000'.

	"""
	DF = DF.drop(DF[DF.DOI == '10.1109/VIS.1999.10000'].index)
	row_to_drop = DF.index[DF.DOI == '10.1109/TVCG.2008.157'].tolist()[0]
	df_dropped = DF.drop([row_to_drop])
	df_dropped.loc[df_dropped.DOI == '10.1109/TVCG.2008.157', 'Number of Authors'] -= 1
	df_dropped.loc[df_dropped.DOI == '10.1109/TVCG.2008.157', 'Author Position'] -= 1.0
	df = df_dropped
	FILL_DATA = [
	{
		'Year': 1999,
		'DOI': '10.1109/VIS.1999.10000',
		'Title': 'Progressive Compression of Arbitrary Triangular Meshes',
		'Number of Authors': 3,
		'Author Position': 1,
		'Author Name': 'Daniel Cohen-Or',
		'Author ID': np.NaN,
		'Author Affiliation': 'Tel Aviv University',
		'One Affiliation': True,
	},
	{
		'Year': 1999,
		'DOI': '10.1109/VIS.1999.10000',
		'Title': 'Progressive Compression of Arbitrary Triangular Meshes',
		'Number of Authors': 3,
		'Author Position': 2,
		'Author Name': 'David Levin',
		'Author ID': np.NaN,
		'Author Affiliation': 'Tel Aviv University',
		'One Affiliation': True,
	},
	{
		'Year': 1999,
		'DOI': '10.1109/VIS.1999.10000',
		'Title': 'Progressive Compression of Arbitrary Triangular Meshes',
		'Number of Authors': 3,
		'Author Position': 3,
		'Author Name': 'Offir Remez',
		'Author ID': np.NaN,
		'Author Affiliation': 'Tel Aviv University',
		'One Affiliation': True,
	}
	]
	fill_data_df = pd.DataFrame(FILL_DATA)
	df = df.append(fill_data_df, ignore_index = True)
	return df

def get_diff_dois(IEEE, ALEX): # ieee, alex
	# return a list of DOIs where alex is wrong in Number of Authors
	DOIS = list(set(IEEE.DOI))
	diff_dois = []
	for doi in DOIS:
		ieee_n = IEEE[IEEE.DOI == doi]['Number of Authors'].tolist()[0]
		alex_n = ALEX[ALEX.DOI == doi]['Number of Authors'].tolist()[0]
		if ieee_n != alex_n:
			diff_dois.append(doi)
	return diff_dois 

def get_alex_new(IEEE, ALEX, DIFF_DOIS):
	"""
	For DOIs where alex is wrong in Number of Authors, get correct data from IEEE first
	Drop the rows where alex is wrong from alex, and append the correct ieee data to alex_dropped

	Returns:
		alex_new, where data of Number of Authors is correct
	"""
	df_to_append = IEEE[IEEE.DOI.isin(DIFF_DOIS)].iloc[:, 0:6]
	alex_dropped = ALEX.drop(ALEX[ALEX.DOI.isin(DIFF_DOIS)].index)
	alex_new = alex_dropped.append(df_to_append, ignore_index = True)
	return alex_new

def get_sorted_dfs(IEEE, ALEX_NEW, PAPERS):
	"""sort ieee and alex author df by paper index and author position

	I added a variable 'Paper Index' to both ieee and alex_new. I 
	also added a prefix of 'IEEE ' in ieee. Then I sort the two datasets 
	by 'Paper Index' and 'Author Position'. 

	Returns:
		two dataframes, ieee_sorted, and alex_new_sorted

	"""
	IEEE['Paper Index'] = [PAPERS.index(i) for i in IEEE.DOI.tolist()]
	ALEX_NEW['Paper Index'] = [PAPERS.index(i) for i in ALEX_NEW.DOI.tolist()]
	IEEE = IEEE.add_prefix('IEEE ')
	alex_new_sorted = ALEX_NEW.sort_values(
		by=['Paper Index', 'Author Position'], ).reset_index(drop=True)
	ieee_sorted = IEEE.sort_values(
		by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True)
	return ieee_sorted, alex_new_sorted

def get_concat_df(IEEE, ALEX, PAPERS): # ieee_sorted, alex_sorted
	"""check https://stackoverflow.com/a/13680953 for details
	"""
	fuzzy_match_df_list = []
	mismatch_doi_list = []
	for doi in PAPERS:
		df1 = IEEE[IEEE['IEEE DOI'] == doi]
		df2 = ALEX[ALEX['DOI'] == doi]
		try:
			kwargs = {'IEEE Author Name': 
			df2['Author Name'].apply(
				lambda x: difflib.get_close_matches(
					x, df1['IEEE Author Name'], cutoff=0.6)[0])
			}
		except:
			kwargs = {'IEEE Author Name': df1['IEEE Author Name']}
			mismatch_doi_list.append(doi)
		df2 = df2.assign(**kwargs)
		df = df1.merge(df2, on='IEEE Author Name', how='inner')
		fuzzy_match_df_list.append(df)
	print(f'in {len(mismatch_doi_list)} dois, fuzzy matching was not successful, so I assumed author position in merging')
	df = pd.concat(fuzzy_match_df_list, ignore_index=True)
	return df 

def flatten(t):
	"""convert list of lists to a list of items"""
	"""source: https://stackoverflow.com/a/952952"""
	return [item for sublist in t for item in sublist]

def update_with_vispubdata_author_data(VISPD, DF): # vispd, concat_df
	ieee_wrong = [
	'10.1109/INFVIS.2005.1532150',
	'10.1109/VISUAL.2005.1532819',
	'10.1109/VISUAL.2005.1532794',
	'10.1109/VISUAL.1992.235178',
	]
	correct_author_num = [5, 2, 5, 4]
	correct_author_num_dict = dict(zip(ieee_wrong, correct_author_num))
	vispd_names = VISPD.loc[VISPD.DOI.isin(ieee_wrong), 'AuthorNames-Deduped'].tolist()
	dois = flatten([np.repeat(doi, correct_author_num_dict[doi]) for doi in ieee_wrong])
	years = [doi_year_dict[x] for x in dois]
	titles = [doi_title_dict[x] for x in dois]
	author_names = flatten([x.split(';') for x in vispd_names])
	author_nums = flatten([np.repeat(i, i) for i in correct_author_num])
	author_positions = flatten([range(1, i+1) for i in correct_author_num])
	paper_index = [papers.index(doi) for doi in dois]
	DF_TO_FILL = pd.DataFrame({
		'IEEE DOI': dois,
		'DOI': dois,
		'IEEE Year': years,
		'Year': years,
		'IEEE Title': titles,
		'Title': titles,
		'IEEE Number of Authors': author_nums,
		'IEEE Author Position': author_positions,
		'IEEE Author Name': author_names,
		'Number of Authors': author_nums,
		'Author Position': author_positions,
		'Author Name': author_names,
		'IEEE Paper Index': paper_index,
		'Paper Index': paper_index,
	})
	df_dropped = DF.drop(DF[DF['IEEE DOI'].isin(ieee_wrong)].index)
	df_new = df_dropped.append(DF_TO_FILL, ignore_index=True)
	df_new = df_new.sort_values(
		by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True)
	return df_new

def update_country_code(DF, DOI, NEW_DATA):
	DF.loc[DF['DOI'] == DOI, 'First Institution Country Code By Hand'] = NEW_DATA
	# this is to change openalex author names to be the same as IEEE author names
	# DF.loc[DF['DOI'] == DOI, 'Author Name'] = DF.loc[DF['DOI'] == DOI, 'IEEE Author Name']
	return DF 

def update_country_code_by_raw_string(DF, RAW_STRING, NEW_DATA):
    DF.loc[DF['Raw Affiliation String'] == RAW_STRING, 'First Institution Country Code By Hand'] = NEW_DATA
    return DF 

def update_type(DF, DOI, NEW_DATA):
    DF.loc[DF['DOI'] == DOI, 'First Institution Type By Hand'] = NEW_DATA
    return DF 

def update_type_by_raw_string(DF, RAW_STRING, NEW_DATA):
    DF.loc[DF['Raw Affiliation String'] == RAW_STRING, 'First Institution Type By Hand'] = NEW_DATA
    return DF 

def update_affiliations(DF, DOI, NEW_DATA):
    # update both ieee author affiliation, alex first institution names and raw string
    DF.loc[DF['DOI'] == DOI, 'IEEE Author Affiliation'] = NEW_DATA
    DF.loc[DF['DOI'] == DOI, 'First Institution Name'] = NEW_DATA
    DF.loc[DF['DOI'] == DOI, 'Raw Affiliation String'] = NEW_DATA
    return DF 

def update_author_name(DF, DOI, NEW_DATA):
    DF.loc[DF['DOI'] == DOI, 'IEEE Author Name'] = NEW_DATA
    return DF


def update_concat_df(DF): # DF here is concat_df
	"""Update data for specific DOIs

	Return:
		still concat_df, but updated
	"""
	# '10.1109/VISUAL.1996.568115',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568115',
		['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568115',
		['company']*2 + ['facility'],
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568115',
		['MRJ, Inc']*2 + ['NASA Ames Research Center']
	)
	# '10.1109/VISUAL.2000.885735'
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885735',
		np.repeat('NL', 6),
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885735',
		['government']*2 + ['education']*4,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885735',
		np.append(
			np.repeat(
				'Center for Mathematics and Computer Science, CWI, Amsterdam, Netherlands', 2),
			np.repeat(
				'Swammerdam Inst. for Life Sciences, BioCentrum Amsterdam, Amsterdam, Netherlands', 4)
			)
	)
	# '10.1109/VISUAL.1996.568143',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568143',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568143',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568143',
		['Ohio State University, Columbus, OH, USA']*6
	)
	# '10.1109/VISUAL.1999.809936',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809936',
		['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809936',
		['education']*3,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809936',
		['Worcester Polytechnic Institute, Worcester, MA, USA']*3
	)
	# '10.1109/INFVIS.2002.1173147',
	# IEEE Xplore got author name wrong
	update_country_code(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		['SE', 'US', 'SE'],
	)
	update_type(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		['education']*3,
	)
	update_affiliations(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		[
			'Dept. of Information Science, Uppsala University, Uppsala, Sweden',
			'Dept. of Psychology, Indiana University, Bloomington, Indiana, USA',
			'Dept. of Information Science, Uppsala University, Uppsala, Sweden',
		]
	)
	update_author_name(
		DF, 
		'10.1109/INFVIS.2002.1173147',
		['M. Lind', 'G.P. Bingham', 'C. Forsell'],
	)
	# '10.1109/VISUAL.1992.235175',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1992.235175',
		['US']*12,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1992.235175',
		['company']*3 + ['government']*2 + ['education']*6 + ['company']*1
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1992.235175',
		[
			'Unisys Corporation',
			'Sterling Software',
			'Unisys Corporation',
			'U.S. Environmental Protection Agency, United States',
			'U.S. Environmental Protection Agency',
			'University of Alabama in Huntsville (UAH), United States',
			'Florida State University, United States',
			'Florida State University, United States',
			'University of Wisconsin, Madison, WI, United States',
			'University of Wisconsin, Madison, WI, United States',
			'University of Wisconsin, Madison, WI, United States',
			'IBM T.J. Watson Research Center, United States',
		]
	)
	# '10.1109/TVCG.2006.182',
	update_country_code(
		DF, 
		'10.1109/TVCG.2006.182',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/TVCG.2006.182',
		['education']*5,
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2006.182',
		['Brown University, United States']*5,
	)
	# '10.1109/TVCG.2015.2467971',
	update_country_code(
		DF, 
		'10.1109/TVCG.2015.2467971',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/TVCG.2015.2467971',
		['education']*5, 
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2015.2467971',
		['University of North Carolina at Charlotte, NC, United States']*5,
	)
	# '10.1109/SciVis.2015.7429489', 
	# author affilitions listed on ieee are all WRONG!!!
	# I found the authors' correct affilition on their ieee author id pages
	update_country_code(
		DF, 
		'10.1109/SciVis.2015.7429489',
		['DE']*5,
	)
	update_type(
		DF, 
		'10.1109/SciVis.2015.7429489',
		['education']*5, 
	)
	update_affiliations(
		DF, 
		'10.1109/SciVis.2015.7429489',
		['Technical University of Munich, Germany']*5,
	)
	# '10.1109/VISUAL.2005.1532821',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532821',
		['AT', 'HR', 'AT', 'AT', 'US'],
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532821',
		['company']*4 + ['education']*1 
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532821',
		['VRVis Research Center Vienna, Austria'] + ['AVL-AST Zagreb, Croatia'] + [
		'VRVis Research Center Vienna, Austria']*2 + ['Virginia Tech']
	)
	# '10.1109/VISUAL.2000.885692',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885692',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885692',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885692',
		['University of Utah, Salt Lake City, UT, USA']*4 + ['Vanderbilt University, USA'] + [
		  'University of Utah, Salt Lake City, UT, USA'],
	)
	# '10.1109/VISUAL.1999.809912',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809912',
		['DE']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809912',
		['education']*2 + ['healthcare']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809912',
		['WSUGRIS, University of Tubingen, Tubingen, Germany']*2 + [
		 'Department of Neuroradiology, University Hospital Tubingen, Tubingen, Germany']*2 ,
	)
	# '10.1109/VISUAL.1999.809929',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809929',
		['US']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809929',
		['company']*4,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809929',
		['IBM T.J. Watson Research Center, United States']*3 + [
		 'UBS Group AG'] ,
	)
	# '10.1109/VISUAL.1999.809884',
	# In this paper, openalex got country wrong and ieee got some of the affiliation wrong
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809884',
		['DE']*5,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809884',
		['nonprofit']*4 + ['education']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809884',
		['German National Research Centre for Information Technology, Germany']*4 + [
		 'Department of Physics & Astronomy, University of Heidelberg, Germany'] ,
	)
	# '10.1109/VISUAL.1999.809920',
	# openalex got country wrong
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809920',
		['DE']*5,
	)
	# '10.1109/VISUAL.1993.398911',
	# openalex got this paper country wrong for the last two authors
	update_country_code(
		DF, 
		'10.1109/VISUAL.1993.398911',
		['RU']*4 + ['DE']*2,
	)
	# '10.1109/VISUAL.2005.1532816',
	# ieee xplore got author positions and author affiliations wrong
	update_author_name(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		[
			'Gregor Schlosser',
			'J ̈urgen Hesser',
			'Frank Zeilfelder',
			'Christian Rossl',
			'Reinhard Manner',
			'Gunther Nurnberger',
			'Hans-Peter Seidel',
		],
	)
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		['DE']*7,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		['education']*3 + ['nonprofit']*1 + ['education']*2 + ['nonprofit']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532816',
		['ICM, Universitäten Mannheim und Heidelberg, Mannheim, Germany']*2 +
		['Institut für Mathematik, Universität Mannheim, Mannheim, Germany'] +
		['Max Planck Institut für Informatik, Saarbruecken, Germany'] +
		['ICM, Universitäten Mannheim und Heidelberg, Mannheim, Germany'] +
		['Institut für Mathematik, Universität Mannheim, Mannheim, Germany'] +
		['Max Planck Institut für Informatik, Saarbruecken, Germany'],
	)
	# '10.1109/VAST.2016.7883507',
	# This is the paper where i don't have ieee author affilition or openalex raw string,
	# but i have openalex first institution name.
	# Another note: Information on IEEE about the first two authors of this paper is WRONG!
	update_country_code(
		DF, 
		'10.1109/VAST.2016.7883507',
		['DE']*5,
	)
	update_type(
		DF, 
		'10.1109/VAST.2016.7883507',
		['education']*5,
	)
	update_affiliations(
		DF, 
		'10.1109/VAST.2016.7883507',
		['University of Stuttgart, Germany']*5
	)
	# '10.1109/VISUAL.2004.38',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2004.38',
		['CN']*1 + ['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2004.38',
		['education']*3 + ['company']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2004.38',
		['Zhejiang University, China'] + ['Carnegie Mellon University, United States'] + [
			'Massachusetts Institute Of Technology, United States'] + [
				'Mitsubishi Electric Research Laboratories, United States']
	)
	"""The following are cases where i have raw string, but not type or country code"""
	# '10.1109/TVCG.2006.195',
	update_country_code(
		DF, 
		'10.1109/TVCG.2006.195',
		['NL']*3
	)
	update_type(
		DF, 
		'10.1109/TVCG.2006.195',
		['education']*2 + ['government']*1,
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2006.195',
		['Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Netherlands']*2 + [
			'Center for Mathematics and Computer Science (CWI), Netherlands'
		]*1
	)
	# '10.1109/VISUAL.1996.567752',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.567752',
		['US']*3
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.567752',
		['company']*3
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.567752',
		['GE Corporate Research & Development, United States']*3,
	)
	# '10.1109/VISUAL.1999.809907',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1999.809907',
		['NL']*2
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1999.809907',
		['government']*2
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1999.809907',
		['Center for Mathematics and Computer Science (CWI), Netherlands']*2,
	)
	# '10.1109/VISUAL.2004.88',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2004.88',
		['DE']*2
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2004.88',
		['nonprofit'] + ['education']
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2004.88',
		['Caesar Research Center, Bonn, Germany'] + [
		'Interdisciplinary Center for Scientific Computing, Heidelberg, Germany'],
	)
	# '10.1109/VISUAL.2004.113',
	update_type_by_raw_string(
		DF,
		'DLR Goettingen',
		['government']
	)
	update_country_code_by_raw_string(
		DF,
		'DLR Goettingen',
		'DE'
	)
	# '10.1109/VISUAL.2000.885722',
	update_type_by_raw_string(
		DF,
		'ETH Zentrum, CH - 8092 Switzerland',
		'education'
	)
	update_country_code_by_raw_string(
		DF,
		'ETH Zentrum, CH - 8092 Switzerland',
		'CH'
	)
	# '10.1109/VISUAL.2000.885715',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885715',
		['DE']*3 + ['NL'] + ['DE'] + ['NL']
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885715',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885715',
		['University of Bonn, Bonn, Germany'] * 3 + ['Eindhoven University of Technology'] + [
			'University of Bonn, Bonn, Germany'] + ['Eindhoven University of Technology']
	)
	# '10.1109/VISUAL.2000.885731',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2000.885731',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2000.885731',
		['education']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2000.885731',
		['Brown University, United States']*6,
	)
	# '10.1109/VISUAL.1996.568133',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568133',
		['US']*7,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568133',
		['healthcare'] + ['education'] + ['facility']*2 + ['healthcare'] + ['education']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568133',
		['National Jewish Center for Immunology and Respiratory Medicine, United States'] + [
		'University of New Mexico, United States'] + [
		'Sandia National Laboratories, United States']*2 + [
		'National Jewish Center for Immunology and Respiratory Medicine, United States'] + [
		'State University of New York at Stony Brook, United States'] + [
		'University of New Mexico, United States']
	)
	# '10.1109/VISUAL.2005.1532808',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532808',
		['DE'],
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532808',
		['education'],
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532808',
		['University of Stuttgart']
	)
	# '10.1109/VISUAL.1998.745350',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1998.745350',
		['US']*6,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1998.745350',
		['facility']*6,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1998.745350',
		['Naval Reseach Lab, Washington, D.C.']*6
	)
	# '10.1109/VISUAL.2005.1532776',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532776',
		['US']*7,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532776',
		['company']*3 + ['facility']*2 + ['company']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532776',
		['Kitware, United States']*3 + [
		'Sandia National Laboratories, United States']*2 + [
		'Simmetrix, United States']*2,
	)
	# '10.1109/VISUAL.1996.568150',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568150',
		['NL']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568150',
		['nonprofit'] + ['government']*2 + ['education']
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568150',
		['Netherlands Energy Research Foundation, Netherlands'] + [
		'Centre for Mathematics and Computer Science (CWI), Netherlands']*2 + [
		'Vrije Universiteit, Netherlands']
	)
	# '10.1109/VISUAL.1990.146398',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1990.146398',
		['US']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1990.146398',
		['government'] + ['company']*3 
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1990.146398',
		['NASA Ames Research Center, Moffett Field, CA, USA'] + [
		'Sterling Software, United States'] + [
			'Crossfield Marketing, United States'] + [
			'Crystal River Engineering, Inc., Groveland, CA, USA']
	) 
	# '10.1109/VISUAL.1996.568120',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1996.568120',
		['US']*3,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1996.568120',
		['education']*3 
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1996.568120',
		['University of Illinois at Chicago, United States'] + [
		'University of Chicago, United States'] + [
			'University of Illinois at Chicago, United States']
	) 
	"""BELOW ARE WHERE I FILL AUTHOR DATA FOR ROWS WHERE DATA WAS FROM VISPUBDATA RAW"""
	# '10.1109/INFVIS.2005.1532150',
	update_country_code(
		DF, 
		'10.1109/INFVIS.2005.1532150',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/INFVIS.2005.1532150',
		['education']*5,
	)
	update_affiliations(
		DF, 
		'10.1109/INFVIS.2005.1532150',
		['Stanford University, United States']*5,
	) 
	# '10.1109/VISUAL.2005.1532819',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532819',
		['CA']*2,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532819',
		['education']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532819',
		['University of Alberta, Canada']*2,
	) 
	# '10.1109/VISUAL.2005.1532794',
	update_country_code(
		DF, 
		'10.1109/VISUAL.2005.1532794',
		['US']*5,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.2005.1532794',
		['facility'] + ['education']*3 + ['facility'],
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.2005.1532794',
		['Oak Ridge National Lab, United States'] + [
			'The University of Tennessee, United States']*3 + [
			'Oak Ridge National Lab, United States'],
	) 
	# '10.1109/VISUAL.1992.235178',
	update_country_code(
		DF, 
		'10.1109/VISUAL.1992.235178',
		['US']*4,
	)
	update_type(
		DF, 
		'10.1109/VISUAL.1992.235178',
		['education']*4,
	)
	update_affiliations(
		DF, 
		'10.1109/VISUAL.1992.235178',
		['University of Utah, United States']*4,
	) 
	## IEEE Website updates the name of Sehi LYi but this update is 
	## different from the name shown on PDF. I changed it back. 
	# '10.1109/TVCG.2021.3114876',
	update_author_name(
		DF, 
		'10.1109/TVCG.2021.3114876', 
		["Sehi L'Yi", 'Qianwen Wang', 'Fritz Lekschas', 'Nils Gehlenborg'],
	)
	## I found the in this paper, Some authors' affiliations contain two institutions
	update_country_code(
		DF, 
		'10.1109/TVCG.2011.207',
		['DE']*4,
	)
	update_type(
		DF, 
		'10.1109/TVCG.2011.207',
		['company'] + ['education']*1 + ['company']*2,
	)
	update_affiliations(
		DF, 
		'10.1109/TVCG.2011.207',
		['Fraunhofer MEVIS, Germany'] + [
			'Center of Complex Systems and Visualization (CeVis), University of Bremen, Germany']*1 + [
			'Fraunhofer MEVIS, Germany']*2,
	) 
	## I found that in this paper, the first author has two affiliations
	update_country_code(
		DF, 
		'10.1109/INFVIS.2004.1',
		['FR']*3,
	)
	update_type(
		DF, 
		'10.1109/INFVIS.2004.1',
		['education']*1 + ['nonprofit']*1 + ['education']*1
	)
	update_affiliations(
		DF, 
		'10.1109/INFVIS.2004.1',
		['ecole des mines de nantes nantes france'] + ['INRIA']*1 + ['ecole des mines de nantes nantes france'],
	) 

	return DF

def manual_update(DF, DOI, AUTHOR_NAME, COL_TO_CHANGE, TEXT):
	"""This is to manually update errors in rows where ieee author info is nan 
	and where openalex author info is complete
	"""
	DF.loc[(DF['DOI'] == DOI) & (DF['IEEE Author Name'] == AUTHOR_NAME), COL_TO_CHANGE] = TEXT

def manual_update_concat_df(DF): # DF here is concat_df
    manual_update(
        DF,
        '10.1109/VISUAL.1997.663848',
        'R. Machiraju',
        'Raw Affiliation String',
        'Mississippi State University, Mississippi, United States'
    )
    manual_update(
        DF,
        '10.1109/VISUAL.2004.128',
        'E. Parkinson',
        'Raw Affiliation String',
        'VA Tech Hydro Corporation, Swizerland',
    )
    manual_update(
        DF,
        '10.1109/VISUAL.2004.128',
        'E. Parkinson',
        'First Institution Type',
        'company'
    )
    manual_update(
        DF,
        '10.1109/VISUAL.2004.128',
        'E. Parkinson',
        'First Institution Country Code',
        'CH',
    )
    manual_update(
        DF,
        '10.1109/INFVIS.1999.801864',
        'J. Sean',
        'IEEE Author Name',
        'Jeffrey Senn',
    )
    manual_update(
        DF,
        '10.1109/INFVIS.1999.801864',
        'J. Sean',
        'Author Name',
        'Jeffrey Senn',
    )
    manual_update(
        DF,
        '10.1109/TVCG.2019.2934260',
        'Andrew J. Solis',
        'Raw Affiliation String',
        'University of Texas Austin, Texas, United States',
    )
    manual_update(
        DF,
        '10.1109/TVCG.2019.2934260',
        'Andrew J. Solis',
        'First Institution Name',
        'University of Texas Austin',
    )

def get_concat_df_filled(DF): # DF here is concat_df
	""" find out who don't have affilition, and fill the data manually

	Get the subset of concat_df where there does not exist any affiliation name. 
	Then drop this subset from concat_df

	Update this subset's IEEE Author Affiliation with fill_dict, and then append 
	this updated subset to concat_df_dropped

	Returns:
		concat_df_filled, where all authors have at least one affiliation name

	"""
	fill_dict = {
	'K.I. Joy': 'University of California, Davis, United States',
	'H. Pfister': 'Department of Computer Science, State University of New York at Stony Brook, United States',
	'A.J. Kolojechick': 'Carnegie Mellon University，School of Computer Science，Pittsburgh，United States',
	'M. Roth': 'Computer Graphics Research Group, Deptartment of Computer Science, ETH Zurich, Switzerland',
	'P.C. Wong': 'Pacific Northwest National Laboratory, United States',
	'H. Foote': 'Pacific Northwest National Laboratory, United States',
	'W. Strasser': 'Computer Graphics Lab, University of Tubingen, Germany',
	'M. Tuveri': 'Center for Advanced Studies, Research and Development in Sardinia, Cagliari, Italy',
	'N. Fanst': 'Georgia Institute of Technology, United States',
	'Heike Janicke': 'Image and Signal Processing Group at the Universi ̈at Leipzig, Germany',
	'A. Vilanova': 'Institute of Computer Graphics, Vienna University of Technology, Austria',
	'P. Thiansathaporn': 'Department of Physics & Astronomy, University of North Carolina, Chapel Hill, United States',
	'B. Hegedust': 'Institute of Computer Graphics, Vienna University of Technology, Austria',
	'W.C. Flowers': 'Massachusetts Institute of Technology, United States',
	'G. Turk': 'GVU Center, College of Computing, Georgia Institute of Technology, United States',
	'P. Ermest': 'Philips Medical Systems, Best, Netherlands',
	'T. Moller': 'Department Of Computer And Information Science, The Ohio State University, Columbus, Ohio, United States',
	'K. Fostiropoulos': 'German National Research Centre for Information Technology, Germany',
	'F. Sobieczky': 'University of Göttingen, Germany',
	'W. Bertelheimer': 'Bayerische Motoren Werke AG (BMW) Corporation, Germany',
	}
	to_fill_df = DF[(
		DF['IEEE Author Affiliation'].isnull()) & (
		DF['Raw Affiliation String'].isnull()) & (
		DF['First Institution Name'].isnull())
	]
	rows_to_drop = DF.index[(
		DF['IEEE Author Affiliation'].isnull()) & (
		DF['Raw Affiliation String'].isnull()) & (
		DF['First Institution Name'].isnull())
	]
	concat_df_dropped = DF.drop(rows_to_drop)
	if concat_df_dropped.shape[0] + to_fill_df.shape[0] == DF.shape[0]:
		print('concat_df_dropped has correct row numbers')
	else:
		print('concat_df_dropped has incorrect row numbers')
	name_list = to_fill_df['IEEE Author Name'].tolist()
	kwargs = {'IEEE Author Affiliation' : lambda x: [fill_dict[i] for i in name_list]}
	to_fill_df = to_fill_df.assign(**kwargs)
	concat_df_filled = concat_df_dropped.append(
		to_fill_df, ignore_index=True).sort_values(
		by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True)
	return concat_df_filled

def recode_to_edu(DF): # df here is concat_df_filled
	# openalex got these institutions' type wrong. they should be education.
	edu_recode_list = [
		'Paris Diderot University',
		'Paris Descartes University',
		'École Polytechnique Fédérale de Lausanne',
		'Johns Hopkins University School of Medicine'
	]
	DF.loc[
	  DF['First Institution Name'].isin(edu_recode_list), 'First Institution Type'
	] = 'education'
	return DF 

def get_alex_raw_string_correct(DF): # DF here is concat_df_filled
	"""if openalex raw string is wrong, correct/update it with ieee author affliation
	"""
	openalex_raw_string_wrong = [
		'10.1109/VISUAL.1999.809920', 
		'10.1109/VISUAL.1999.809884', 
		'10.1109/VISUAL.1993.398911',
	]
	DF.loc[DF.DOI.isin(openalex_raw_string_wrong), 'Raw Affiliation String'] = DF.loc[
		DF.DOI.isin(openalex_raw_string_wrong)]['IEEE Author Affiliation']
	return DF

def binary_type(row):
	if row['First Institution Type'] == 'education':
		binary_type = 'education'
	elif row['First Institution Type'] in [
		'facility', 'government', 'company', 'healthcare', 'archive', 'nonprofit','other'
	]:
		binary_type = 'non-education'
	else:
		binary_type = np.NaN
	return binary_type

def binary_type_by_hand(row):
	'''This is to transform type handcoded by me to binary type
	'''
	if row['First Institution Type By Hand'] == 'education':
		binary_type = 'education'
	elif row['First Institution Type By Hand'] in [
		'facility', 'government', 'company', 
		'healthcare', 'archive', 'nonprofit', 'other', 
		# just in case I have input these by hand:
		'noneducation', 'non-education'
	]:
		binary_type = 'non-education'
	else:
		binary_type = np.NaN
	return binary_type

def add_binary_type(DF): # DF here is concat_df_filled
	DF['Binary Institution Type'] = DF.apply(binary_type, axis=1)
	DF['Binary Institution Type By Hand'] = DF.apply(binary_type_by_hand, axis=1)
	return DF 

def check_delete_rename(DF): # DF here is concat_df_filled
	# check paper index, author num, and author positions
	if DF['IEEE Paper Index'].tolist() == DF['Paper Index'].tolist():
		print('Two paper index vectors are equal')
	else:
		print('Something wrong with paper index vectors')
	if DF['IEEE Number of Authors'].tolist() == DF['Number of Authors'].tolist():
		print('Two author num vectors are equal')
	else:
		print('Something wrong with author num vectors')
	if DF['IEEE Author Position'].tolist() == DF['Author Position'].tolist():
		print('Two author position vectors are equal')
	else:
		print('Something wrong with author position vectors\
			, but this is expected as it indicates that the fuzzy matching above works.')
	# delete useless columns
	DF.drop(['Year', 'DOI', 'Title', 'IEEE Paper Index', 'Paper Index'], axis=1, inplace=True)
	# add a column called IEEE Author Affiliation Filled. It is bascially the same as 
	# ieee author affiliation. The only difference is that if ieee is nan, 
	# i get the data from openalex raw string
	DF['IEEE Author Affiliation Filled'] = np.where(
		DF['IEEE Author Affiliation'].notnull(),
		DF['IEEE Author Affiliation'],
		DF['Raw Affiliation String'],
	)
	# rename columns
	DF.rename(columns={
		'IEEE Year': 'Year',
		'IEEE DOI': 'DOI',
		'IEEE Title': 'Title',
		'IEEE Author Affiliation': 'IEEE Author Affiliation Updated',
		'First Institution Name': 'First Institution Name Updated',
		'Raw Affiliation String': 'Raw Affiliation String Updated',
		# 'First Institution Type': 'First Institution Type Updated',
		# 'First Institution Country Code': 'First Institution Country Code Updated',
	}, inplace=True)
	return DF

def main():
	ieee = update_ieee_orig(ieee_orig)
	diff_dois = get_diff_dois(ieee, alex)
	alex_new = get_alex_new(ieee, alex, diff_dois)
	ieee_sorted, alex_sorted = get_sorted_dfs(ieee, alex_new, papers)
	concat_df = get_concat_df(ieee_sorted, alex_sorted, papers)
	concat_df = update_with_vispubdata_author_data(vispd, concat_df)
	concat_df = update_concat_df(concat_df)
	manual_update_concat_df(concat_df)
	concat_df_filled = get_concat_df_filled(concat_df)
	concat_df_filled = recode_to_edu(concat_df_filled)
	concat_df_filled = get_alex_raw_string_correct(concat_df_filled)
	concat_df_filled = add_binary_type(concat_df_filled)
	concat_df_filled = check_delete_rename(concat_df_filled)
	return concat_df_filled

if __name__ == '__main__':
	vispd = pd.read_csv(VISPUBDATA)
	doi_year_dict, doi_title_dict = get_dicts(VISPUBDATA)
	ieee_orig = pd.read_csv(IEEE_AUTHOR)
	alex = pd.read_csv(OPENALEX_AUTHOR)
	papers = read_txt(PAPERS_TO_STUDY)
	df = main()
	df.to_csv(MERGED_AUTHOR_DF, index=False)

Python Pandas numpy From line 33 of scripts/get_merged_author_df.py

import pandas as pd 
import numpy as np 
import requests
import random
import math
import re 
import sys 
import time 
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import json 

OPENALEX_PAPER_DF = sys.argv[1]
OPENALEX_CITATION_AUTHOR_DF = sys.argv[2]
OPENALEX_CITATION_CONCEPT_DF = sys.argv[3]
OPENALEX_CITATION_PAPER_DF = sys.argv[4]

def get_dicts(OPENALEX_PAPER_DF): # vispd_openalex_match here is OPENALEX_PAPER_DF
	df = pd.read_csv(OPENALEX_PAPER_DF)
	dois = df['DOI'].tolist()
	urls = df['Citation API URL'].tolist()
	openalex_ids = df['OpenAlex ID'].tolist()
	years = df['Year'].tolist()
	titles = df['Title'].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	doi_url_dict = dict(zip(dois, urls))
	doi_openalexID_dict = dict(zip(dois, openalex_ids))
	return [dois, urls, doi_year_dict, doi_title_dict, doi_url_dict, doi_openalexID_dict]

def get_s():
	# set retry if status codes in [ 500, 502, 503, 504, 429]
	# als return headers
	s = requests.Session()
	retries = Retry(total=5,
		backoff_factor=0.1,
		status_forcelist=[ 500, 502, 503, 504, 429],
	)
	s.mount('http://', HTTPAdapter(max_retries=retries))
	headers = {
	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	'Accept': 'application/json',
	}
	return s, headers

def get_concept_dict_list_from_concepts(doi, result, concepts):
	"""returns a list of dicts
	"""
	openalex_year = result['publication_year']
	openalex_id = re.sub('https://openalex.org/', '', result['id'])
	openalex_title = result['display_name']
	openalex_doi = result['doi']
	concept_dict_list = []
	num_concepts = len(concepts)
	for i in concepts:
		concept_index = concepts.index(i) + 1
		concept_name = i['display_name']
		openalex_concept_id = i['id']
		wikidata_url = i['wikidata']
		level = i['level']
		score = i['score']
		concept_dict = {
			'Cited Ppaer Year': doi_year_dict[doi],
			'Cited Paper DOI': doi,
			'Cited Paper Title': doi_title_dict[doi],
			'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
			'Citation Paper Year': openalex_year,
			'Citation Paper OpenAlex ID': openalex_id,
			'Citation Ppaer OpenAlex Title': openalex_title,
			'Citation Paper OpenAlex DOI': openalex_doi,
			'Number of Concepts': num_concepts,
			'Index of Concept': concept_index,
			'Concept': concept_name,
			'Concept ID': openalex_concept_id,
			'Wikidata': wikidata_url,
			'Level': level,
			'Score': score,
		}
		concept_dict_list.append(concept_dict)
	return concept_dict_list

def get_author_dict_list_from_authors(doi, result, authors):
	"""returns a list of dicts
	"""
	openalex_year = result['publication_year']
	openalex_id = re.sub('https://openalex.org/', '', result['id'])
	openalex_title = result['display_name']
	openalex_doi = result['doi']
	author_dict_list = []
	num_authors = len(authors)
	for i in authors:
		author = i['author']
		author_name = author['display_name']
		author_position = authors.index(i) + 1
		position_type = i['author_position']
		openalex_author_id = author['id']
		author_orcid = author['orcid']
		raw_affiliation_string = i['raw_affiliation_string']
		if len(i['institutions']) == 0:
			num_institutions = np.NaN
			first_institution = np.NaN
			institution_name = np.NaN
			institution_id = np.NaN
			ror = np.NaN
			country_code = np.NaN
			institution_type = np.NaN
		else:
			num_institutions = len(i['institutions'])
			first_institution = i['institutions'][0]
			# Check whether the institution object is empty
			# this is because, in the first citation of 10.1109/TVCG.2007.70599
			# the first author's institution is empty, which causes errors 
			if first_institution:
				institution_name = first_institution['display_name']
				institution_id = first_institution['id']
				ror = first_institution['ror']
				country_code = first_institution['country_code']
				institution_type = first_institution['type']
			else:
				institution_name = np.NaN
				institution_id = np.NaN
				ror = np.NaN
				country_code = np.NaN
				institution_type = np.NaN
		author_dict = {
			'Cited Ppaer Year': doi_year_dict[doi],
			'Cited Paper DOI': doi,
			'Cited Paper Title': doi_title_dict[doi],
			'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
			'Citation Paper Year': openalex_year,
			'Citation Paper OpenAlex ID': openalex_id,
			'Citation Ppaer OpenAlex Title': openalex_title,
			'Citation Paper OpenAlex DOI': openalex_doi,
			'Number of Authors': num_authors,
			'Author Name': author_name,
			'Author Position': author_position,
			'Author Position Type': position_type,
			'OpenAlex Author ID': openalex_author_id,
			'Author ORCID': author_orcid,
			'Number of Affiliations': num_institutions,
			'First Institution Name': institution_name,
			'Raw Affiliation String': raw_affiliation_string,
			'First Institution ID': institution_id,
			'First Institution ROR': ror,
			'First Institution Type': institution_type,
			'First Institution Country Code': country_code
		}
		author_dict_list.append(author_dict)
	return author_dict_list

def get_paper_dict_from_json_result(j, doi):
	"""returns a dict 
	"""
	authors = j['authorships']
	num_authors = len(authors)
	concepts = j['concepts']
	num_concepts = len(concepts)
	openalex_year = j['publication_year']
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_doi = j['doi']
	openalex_publication_date = j['publication_date']
	venue = j['host_venue']
	openalex_venue_id = venue['id']
	openalex_url = venue['url']
	openalex_venue_name = venue['display_name']
	openalex_publisher = venue['publisher']
	publication_type = j['type']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	# num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None 
	# 	else int(openalex_last_page) - int(openalex_first_page) + 1)
	num_references = len(j['referenced_works'])
	num_citations = j['cited_by_count']
	# cited_by_api_url is a little bit complicated because in the results of title query
	#   it returns a list whereas it returns a str in doi query
	cited_url = j['cited_by_api_url']
	cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0]
	num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url)
	paper_dict = {
		'Cited Ppaer Year': doi_year_dict[doi],
		'Cited Paper DOI': doi,
		'Cited Paper Title': doi_title_dict[doi],
		'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex Publication Date': openalex_publication_date,
		'Citation Paper OpenAlex ID': openalex_id,
		'Citation Paper OpenAlex Title': openalex_title,
		'Citation Paper OpenAlex DOI': openalex_doi,
		'Citation Paper OpenAlex URL': openalex_url,
		'OpenAlex Venue ID': openalex_venue_id,
		'OpenAlex Venue Name': openalex_venue_name,
		'OpenAlex Publisher': openalex_publisher,
		'Publication Type': publication_type,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
		# 'Number of Pages': num_pages,
		'Number of References': num_references,
		'Number of Authors': num_authors,
		'Number of Concepts': num_concepts,
		'Number of Citations': num_citations,
		'Citation API URL': cited_by_api_url,
		'Number of Citation API URLs': num_cited_by_api_url,
	}
	return paper_dict

def get_empty_dict_list(doi):
	dict_list = [{
		'Cited Ppaer Year': doi_year_dict[doi],
		'Cited Paper DOI': doi,
		'Cited Paper Title': doi_title_dict[doi],
		'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
	}]
	return dict_list

def get_empty_dict(doi):
	a_dict = {
		'Cited Ppaer Year': doi_year_dict[doi],
		'Cited Paper DOI': doi,
		'Cited Paper Title': doi_title_dict[doi],
		'Cited Paper OpenAlex ID': doi_openalexID_dict[doi],
	}
	return a_dict 

def get_json_result(url, s, headers):
	"""if 404 or other error codes, retry
	This function prevents error codes. I am pretty sure that every api_cited_url will get 
		a status_code of 200, that's why I am confident to use this function

	Also, it should be noted that if the status code is 404, then s.get(url).json() will 
		throw an error. So i don't need to check the status code in this function. 
	"""
	try: 
		j = s.get(url, headers=headers).json()
	except:
		time.sleep(1) 
		return get_json_result(url, s, headers)
	else:
		return j

def main(DOIS, s, headers):
	for doi in DOIS:
		# to make sure the api-url is not nan:
		if doi_url_dict[doi] == doi_url_dict[doi]:
			url = doi_url_dict[doi] + '&per-page=50'
			j0 = get_json_result(url, s, headers)
			count = j0['meta']['count']
			per_page = 50
			total_pages = math.ceil(count/per_page)
			# checking whether results are empty
			if count > 0:
				# for every page
				for i in range(1,total_pages+1):
					list_of_concept_dict_lists = []
					list_of_author_dict_lists = []
					paper_dict_list = []
					j = get_json_result(url + f'&page={i}', s, headers=headers)
					results = j['results']
					# for every result in a page
					for result in results:
						concepts = result['concepts']
						authors = result['authorships']
						concept_dict_list = get_concept_dict_list_from_concepts(doi, result, concepts)
						author_dict_list = get_author_dict_list_from_authors(doi, result, authors)
						paper_dict = get_paper_dict_from_json_result(result, doi)
						list_of_concept_dict_lists.append(concept_dict_list)
						list_of_author_dict_lists.append(author_dict_list)
						paper_dict_list.append(paper_dict)
					lists_concepts.append(list_of_concept_dict_lists)
					lists_authors.append(list_of_author_dict_lists)
					list_of_paper_dict_lists.append(paper_dict_list)
					time.sleep(0.2)

			# if empty results:
			else:
				list_of_concept_dict_lists = []
				list_of_author_dict_lists = []
				paper_dict_list = []
				concept_dict_list = get_empty_dict_list(doi)
				author_dict_list = get_empty_dict_list(doi)
				paper_dict = get_empty_dict(doi)
				list_of_concept_dict_lists.append(concept_dict_list)
				list_of_author_dict_lists.append(author_dict_list)
				paper_dict_list.append(paper_dict)
				lists_concepts.append(list_of_concept_dict_lists)
				lists_authors.append(list_of_author_dict_lists)
				list_of_paper_dict_lists.append(paper_dict_list)
		else:
			list_of_concept_dict_lists = []
			list_of_author_dict_lists = []
			paper_dict_list = []
			concept_dict_list = get_empty_dict_list(doi)
			author_dict_list = get_empty_dict_list(doi)
			paper_dict = get_empty_dict(doi)
			list_of_concept_dict_lists.append(concept_dict_list)
			list_of_author_dict_lists.append(author_dict_list)
			paper_dict_list.append(paper_dict)
			lists_concepts.append(list_of_concept_dict_lists)
			lists_authors.append(list_of_author_dict_lists)
			list_of_paper_dict_lists.append(paper_dict_list)
		print(f'{DOIS.index(doi) + 1} is done')
		time.sleep(0.5)

if __name__ == '__main__':
	# I don't need to worry papers having no citations. 
	# This is because even if there is no citation, there is still a cited_api_url
	# and the result count in that cited_api_url will be zero.
	# I have solved this issue in main()
	dois = get_dicts(OPENALEX_PAPER_DF)[0]
	random_dois = random.sample(dois, 10)
	urls = get_dicts(OPENALEX_PAPER_DF)[1]
	doi_year_dict = get_dicts(OPENALEX_PAPER_DF)[2]
	doi_title_dict = get_dicts(OPENALEX_PAPER_DF)[3]
	doi_url_dict = get_dicts(OPENALEX_PAPER_DF)[4]
	doi_openalexID_dict = get_dicts(OPENALEX_PAPER_DF)[5]
	lists_concepts = [] # list of lists of concept dict lists
	lists_authors = [] # list of lists of author dict lists
	list_of_paper_dict_lists = [] # list of paper dict lists
	s = get_s()[0]
	headers = get_s()[1]
	main(dois, s, headers)

author_df_initiate = pd.DataFrame()
concept_df_initiate = pd.DataFrame()

def build_df_from_lists(lists, df):
	for i in lists:
		df1 = pd.concat([pd.DataFrame(l) for l in i], ignore_index=True)
		df = df.append(df1, ignore_index=True)
	return df 

author_df = build_df_from_lists(lists_authors, author_df_initiate)
concept_df = build_df_from_lists(lists_concepts, concept_df_initiate)
paper_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_paper_dict_lists], ignore_index=True)

author_df.to_csv(OPENALEX_CITATION_AUTHOR_DF, index=False)
concept_df.to_csv(OPENALEX_CITATION_CONCEPT_DF, index=False)
paper_df.to_csv(OPENALEX_CITATION_PAPER_DF, index=False)

Python Pandas numpy JSON requests urllib3 From line 9 of scripts/get_openalex_citation_dfs.py

import pandas as pd 
import numpy as np 
import requests
import random
import math
import csv  
import re 
import sys 
import time 
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

PAPERS_TO_STUDY = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
OPENALEX_PAPER_DF = sys.argv[3]
OPENALEX_AUTHOR_DF = sys.argv[4]
OPENALEX_CONCEPT_DF = sys.argv[5]
OPENALEX_REFERENCE_DF = sys.argv[6]
TITEL_QUERY_EMPTY_DOI_QUERY_404_DFS = sys.argv[7]
TITLE_QUERY_404_DFS = sys.argv[8]
DOI_QUERY_404_DFS = sys.argv[9]

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return [doi_year_dict, doi_title_dict]

def get_concept_dict_list_from_concepts(doi, concepts):
	"""returns a list of dicts
	"""
	concept_dict_list = []
	num_concepts = len(concepts)
	# first check whether the list concepts is empty:
	if concepts:
		for i in concepts:
			concept_index = concepts.index(i) + 1
			concept_name = i['display_name']
			openalex_concept_id = i['id']
			wikidata_url = i['wikidata']
			level = i['level']
			score = i['score']
			concept_dict = {
				'Year': doi_year_dict[doi],
				'DOI': doi,
				'Title': doi_title_dict[doi],
				'Number of Concepts': num_concepts,
				'Index of Concept': concept_index,
				'Concept': concept_name,
				'Concept ID': openalex_concept_id,
				'Wikidata': wikidata_url,
				'Level': level,
				'Score': score,
			}
			concept_dict_list.append(concept_dict)
	# if concept list is empty, 'number of concepts' will be NaN
	else:
		concept_dict = {
			'Year': doi_year_dict[doi],
			'DOI': doi,
			'Title': doi_title_dict[doi],
		}
		concept_dict_list.append(concept_dict)
	return concept_dict_list

def get_reference_dict_list_from_referenced_works(doi, referenced_works):
	reference_dict_list = []
	num_references = len(referenced_works)
	# first check whether the list of referenced works is empty
	if referenced_works:
		for i in referenced_works:
			reference_index = referenced_works.index(i) + 1
			reference_dict = {
				'Year': doi_year_dict[doi],
				'DOI': doi,
				'Title': doi_title_dict[doi],
				'Number of References': num_references,
				'Index of Reference': reference_index,
				'Reference': i,
			}
			reference_dict_list.append(reference_dict)
	# if refs list is empty, 'number of references' will be NaN
	else:
		reference_dict = {
			'Year': doi_year_dict[doi],
			'DOI': doi,
			'Title': doi_title_dict[doi],
		}
		reference_dict_list.append(reference_dict)
	return reference_dict_list

def get_author_dict_list_from_authors(doi, authors):
	"""returns a list of dicts
	"""
	author_dict_list = []
	num_authors = len(authors)
	# first check whether authors is empty
	if authors:
		for i in authors:
			author = i['author']
			author_name = author['display_name']
			author_position = authors.index(i) + 1
			position_type = i['author_position']
			openalex_author_id = author['id']
			author_orcid = author['orcid']
			raw_affiliation_string = i['raw_affiliation_string']
			if len(i['institutions']) == 0:
				num_institutions = np.NaN
				first_institution = np.NaN
				institution_name = np.NaN
				institution_id = np.NaN
				ror = np.NaN
				country_code = np.NaN
				institution_type = np.NaN
			else:
				num_institutions = len(i['institutions'])
				first_institution = i['institutions'][0]
				institution_name = first_institution['display_name']
				institution_id = first_institution['id']
				ror = first_institution['ror']
				country_code = first_institution['country_code']
				institution_type = first_institution['type']
			author_dict = {
				'Year': doi_year_dict[doi],
				'DOI': doi,
				'Title': doi_title_dict[doi],
				'Number of Authors': num_authors,
				'Author Name': author_name,
				'Author Position': author_position,
				'Author Position Type': position_type,
				'OpenAlex Author ID': openalex_author_id,
				'Author ORCID': author_orcid,
				'Number of Affiliations': num_institutions,
				'First Institution Name': institution_name,
				'Raw Affiliation String': raw_affiliation_string,
				'First Institution ID': institution_id,
				'First Institution ROR': ror,
				'First Institution Type': institution_type,
				'First Institution Country Code': country_code
			}
			author_dict_list.append(author_dict)
	# if authors list is empty, 'number of authors' will be NaN
	else:
		author_dict = {
			'Year': doi_year_dict[doi],
			'DOI': doi,
			'Title': doi_title_dict[doi],
		}
		author_dict_list.append(author_dict)
	return author_dict_list

def get_paper_dict_from_json_result(j, doi):
	"""returns a dict 
	"""
	authors = j['authorships']
	num_authors = len(authors)
	concepts = j['concepts']
	num_concepts = len(concepts)
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_publication_date = j['publication_date']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue_id = venue['id']
	openalex_url = venue['url']
	openalex_venue_name = venue['display_name']
	openalex_publisher = venue['publisher']
	publication_type = j['type']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None 
		else int(openalex_last_page) - int(openalex_first_page) + 1)
	num_references = len(j['referenced_works'])
	num_citations = j['cited_by_count']
	# cited_by_api_url is a little bit complicated because in the results of title query
	#   it returns a list whereas it returns a str in doi query
	cited_url = j['cited_by_api_url']
	cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0]
	num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url)
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex Publication Date': openalex_publication_date,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue ID': openalex_venue_id,
		'OpenAlex Venue Name': openalex_venue_name,
		'OpenAlex Publisher': openalex_publisher,
		'Publication Type': publication_type,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
		'Number of Pages': num_pages,
		'Number of References': num_references,
		'Number of Authors': num_authors,
		'Number of Concepts': num_concepts,
		'Number of Citations': num_citations,
		'Citation API URL': cited_by_api_url,
		'Number of Citation API URLs': num_cited_by_api_url,
	}
	return paper_dict

def get_empty_dict_list(doi):
	dict_list = [{
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}]
	return dict_list

def get_empty_paper_dict(doi):
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}
	return paper_dict

def get_title_query_response(doi):
	title = doi_title_dict[doi]
	title_to_query = re.sub(r'\:|\?|\&|\,', '', title)
	response = requests.get(
		'https://api.openalex.org/works?filter=title.search:' + title_to_query)
	return response, title_to_query

def check_results_count(response):
	j = response.json()
	count = j['meta']['count']
	return j, count 

def get_doi_query_response(doi):
	response = requests.get("https://api.openalex.org/works/doi:" + doi)
	return response

def get_data(doi, doi_index):
	# if doi not in to_query_by_doi, query title first
	if doi not in to_query_by_doi:
		# query title first:
		response = get_title_query_response(doi)[0]
		# if the response.status_code is in retry_code, then there is something wrong
		#    I will sleep for a while and try again. Note that if the status_code is 404, 
		#      I put it to no_matching (see below, if status_code != 200), rather than retryihng
		while response.status_code in retry_code:
			print(f'Title query has errors for {doi_index} : {doi_title_dict[doi]}. Error status code is {response.status_code}. Retrying...')
			time.sleep(3)
			response = get_title_query_response(doi)[0]
		# if title query succeeds:
		if response.status_code == 200:
			# get json and check results count:
			j = check_results_count(response)[0]
			count = check_results_count(response)[1]
			# if count is non-zero:
			if count > 0:
				# if doi not in special_result_index_dict, use index of 0
				#   otherwise, use the value corresponding to the key
				if doi not in list(special_result_index_dict.keys()):
					correct_result = j['results'][0]
				else:
					correct_index = special_result_index_dict[doi]
					correct_result = j['results'][correct_index]
				authors = correct_result['authorships']
				concepts = correct_result['concepts']
				referenced_works = correct_result['referenced_works']
				paper_dict = get_paper_dict_from_json_result(correct_result, doi)
				author_dict_list = get_author_dict_list_from_authors(doi, authors)
				concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts)
				reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works)
			# if count is zero, query doi instead
			else:
				# get doi query response:
				response2 = get_doi_query_response(doi)
				# if status code is in retry_code, retry
				while response2.status_code in retry_code:
					print(f'doi query has error for {doi_index} : {doi}, error status code is {response2.status_code}, retrying...')
					time.sleep(3)
					response2 = get_doi_query_response(doi)
				# if doi query succeeds:
				if response2.status_code == 200:
					j2 = response2.json()
					authors = j2['authorships']
					concepts = j2['concepts']
					referenced_works = j2['referenced_works']
					paper_dict = get_paper_dict_from_json_result(j2, doi)
					author_dict_list = get_author_dict_list_from_authors(doi, authors)
					concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts)
					reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works)
				# if doi query fails, add the doi to no_result_bad_doi list
				else:
					error_status_code(response2.status_code)
					title_query_empty_doi_query_404_list.append(doi)
					paper_dict = get_empty_paper_dict(doi)
					author_dict_list = get_empty_dict_list(doi)
					concept_dict_list = get_empty_dict_list(doi)
					reference_dict_list = get_empty_dict_list(doi)
					print(f'doi query fails for {doi_index} : {doi}')
		# if title query fails (most likely status code 404), which is very unlikely!
		#    add it to no_title_matching
		else:
			title_query_404_list.append(doi)
			error_status_code.append(response.status_code)
			paper_dict = get_empty_paper_dict(doi)
			author_dict_list = get_empty_dict_list(doi)
			concept_dict_list = get_empty_dict_list(doi)
			reference_dict_list = get_empty_dict_list(doi)
			print(f'title query fails for {doi_index} : {doi_title_dict[doi]}')
	# if doi in to_query_by_doi, use doi query
	else:
		# get doi query response:
		response0 = get_doi_query_response(doi)
		# if status code is in retry_code, retry
		while response0.status_code in retry_code:
			print(f'doi query for {doi_index} : {doi} has error, status code is {response0.status_code}, retrying...')
			time.sleep(3)
			response0 = get_doi_query_response(doi)
		# if doi query succeeds:
		if response0.status_code == 200:
			j0 = response0.json()
			authors = j0['authorships']
			concepts = j0['concepts']
			referenced_works = j0['referenced_works']
			paper_dict = get_paper_dict_from_json_result(j0, doi)
			author_dict_list = get_author_dict_list_from_authors(doi, authors)
			concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts)
			reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works)
		# if doi query fails, add the doi to no_doi_matching
		else:
			error_status_code.append(response0.status_code)
			doi_query_404_list.append(doi)
			paper_dict = get_empty_paper_dict(doi)
			author_dict_list = get_empty_dict_list(doi)
			concept_dict_list = get_empty_dict_list(doi)
			reference_dict_list = get_empty_dict_list(doi)
			print(f'doi query fails for {doi_index} : {doi}')
	list_of_paper_dicts.append(paper_dict)
	list_of_author_dict_lists.append(author_dict_list)
	list_of_concept_dict_lists.append(concept_dict_list)
	list_of_reference_dict_lists.append(reference_dict_list)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		get_data(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.5)
	print(list(set(error_status_code)))

if __name__ == '__main__':
	papers_to_study = read_txt(PAPERS_TO_STUDY)
	random_papers_to_study = random.sample(papers_to_study, 10)
	doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0]
	doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1]
	list_of_paper_dicts = []
	list_of_author_dict_lists = []
	list_of_concept_dict_lists = []
	list_of_reference_dict_lists = []
	title_query_empty_doi_query_404_list = []
	title_query_404_list = []
	doi_query_404_list = []
	retry_code = [ 500, 502, 503, 504, 429]
	error_status_code = []
	to_query_by_doi = [
		'10.1109/VISUAL.2001.964489',
		'10.1109/VISUAL.1996.568113',
		'10.1109/VISUAL.1999.809896',
		'10.1109/VISUAL.1991.175771',
		'10.1109/VISUAL.1998.745302',
		'10.1109/VISUAL.1993.398868',
		'10.1109/INFVIS.2005.1532128',
		'10.1109/VISUAL.1993.398859',
		'10.1109/VISUAL.1991.175795',
		'10.1109/VISUAL.2003.1250401',
		'10.1109/VISUAL.1991.175789',
		'10.1109/VISUAL.2000.885739',
		'10.1109/TVCG.2014.2346922',
		'10.1109/VISUAL.1999.809871',
		'10.1109/VISUAL.1996.567807',
		'10.1109/VISUAL.2000.885692',
		'10.1109/VISUAL.1991.175777',
		'10.1109/VISUAL.1998.745315',
		'10.1109/VISUAL.1997.663909',
		'10.1109/VISUAL.2000.885697',
		'10.1109/VISUAL.2001.964504',
		'10.1109/TVCG.2006.168',
		'10.1109/TVCG.2007.70617',
		'10.1109/VISUAL.1997.663910',
		'10.1109/VISUAL.1997.663931',
		'10.1109/VISUAL.2002.1183792',
		'10.1109/VISUAL.1992.235201',
		'10.1109/VISUAL.1996.568128',
		'10.1109/VISUAL.1997.663923',
		'10.1109/VAST.2011.6102441',
		'10.1109/VISUAL.2000.885732',
		'10.1109/VISUAL.2001.964522',
		'10.1109/VISUAL.2005.1532812',
		'10.1109/VISUAL.1998.745350',
		'10.1109/INFVIS.2001.963282',
		'10.1109/VISUAL.1995.480804',
		'10.1109/VISUAL.2005.1532847',
		'10.1109/INFVIS.1996.559229',
		'10.1109/VISUAL.2000.885738',
		'10.1109/VISUAL.1991.175800',
		'10.1109/VISUAL.1993.398865',
		'10.1109/VISUAL.1993.398866',
		'10.1109/VISUAL.1998.745348',
		'10.1109/VISUAL.1993.398867',
		'10.1109/VISUAL.1997.663925',
		'10.1109/VISUAL.1993.398900',
		'10.1109/VISUAL.1992.235181',
		'10.1109/VISUAL.1992.235195',
		'10.1109/VISUAL.2000.885719',
		'10.1109/VISUAL.1991.175816',
		'10.1109/VISUAL.1990.146414',
		'10.1109/VISUAL.1993.398861',
		'10.1109/VISUAL.1993.398872',
		'10.1109/VISUAL.1994.346292',
		'10.1109/VISUAL.1994.346295',
		'10.1109/VISUAL.1994.346297',
		'10.1109/VISUAL.1994.346301',
		'10.1109/VISUAL.1999.809913',
		'10.1109/VISUAL.2001.964546',
		'10.1109/VISUAL.2003.1250404',
		'10.1109/TVCG.2014.2346442',
		'10.1109/TVCG.2020.3028948',
		'10.1109/TVCG.2020.3030363',
		'10.1109/TVCG.2020.3030364',
		'10.1109/tvcg.2021.3114784',
		'10.1109/tvcg.2021.3114780',
		'10.1109/tvcg.2021.3114782',
		'10.1109/tvcg.2021.3114783',
		'10.1109/tvcg.2021.3114836',
		'10.1109/TVCG.2021.3064037',
		'10.1109/TVCG.2021.3114849',
		'10.1109/TVCG.2021.3114842',
		'10.1109/TVCG.2021.3114766',
		'10.1109/TVCG.2021.3114777'
	]
	special_result_index_dict = {
		'10.1109/VISUAL.1992.235194': 4,
	}
	main(papers_to_study)

paper_df = pd.DataFrame(list_of_paper_dicts)
author_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_author_dict_lists], ignore_index=True)
concept_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_concept_dict_lists], ignore_index=True)
reference_df = pd.concat(
	[pd.DataFrame(l) for l in list_of_reference_dict_lists], ignore_index=True)

paper_df.to_csv(OPENALEX_PAPER_DF, index=False)
author_df.to_csv(OPENALEX_AUTHOR_DF, index=False)
concept_df.to_csv(OPENALEX_CONCEPT_DF, index=False)
reference_df.to_csv(OPENALEX_REFERENCE_DF, index=False)

with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_DFS, 'w') as f:
	for doi in title_query_empty_doi_query_404_list:
		f.write("%s\n" % doi)

with open(TITLE_QUERY_404_DFS, 'w') as f:
	for doi in title_query_404_list:
		f.write("%s\n" % doi)

with open(DOI_QUERY_404_DFS, 'w') as f:
	for doi in doi_query_404_list:
		f.write("%s\n" % doi)

Python Pandas numpy JSON requests urllib3 From line 9 of scripts/get_openalex_dfs.py

import pandas as pd 
import numpy as np 
import requests
import random
import re 
import sys 
import time 
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

OPENALEX_REFERENCE_DF = sys.argv[1]
OPENALEX_REFERENCE_PAPER_DF_UNIQUE = sys.argv[2]
OPENALEX_REFERENCE_AUTHOR_DF_UNIQUE = sys.argv[3]
OPENALEX_REFERENCE_CONCEPT_DF_UNIQUE = sys.argv[4]
OPENALEX_REFERENCE_PAPER_DF = sys.argv[5]
OPENALEX_REFERENCE_AUTHOR_DF = sys.argv[6]
OPENALEX_REFERENCE_CONCEPT_DF = sys.argv[7]
OPENALEX_REFERENCE_ERROR_DF = sys.argv[8]

def get_unique_ref_urls(ref_df): # ref_df here is OPENALEX_REFERENCE_DF
	# returns a list: unique reference paper urls
	ref = pd.read_csv(ref_df).dropna(subset=['Number of References'])
	unique_ref_urls = list(set(ref.Reference.tolist()))
	return ref, unique_ref_urls

def get_s():
	# set retry if status codes in [ 500, 502, 503, 504, 429]
	# als return headers
	s = requests.Session()
	retries = Retry(total=5,
		backoff_factor=0.1,
		status_forcelist=[ 500, 502, 503, 504, 429],
	)
	s.mount('http://', HTTPAdapter(max_retries=retries))
	headers = {
	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	'Accept': 'application/json',
	}
	return s, headers

def get_paper_dict_from_json_result(j, url, paper_dict_list):
	"""returns a dict 
	"""
	authors = j['authorships']
	num_authors = len(authors)
	concepts = j['concepts']
	num_concepts = len(concepts)
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_publication_date = j['publication_date']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue_id = venue['id']
	openalex_url = venue['url']
	openalex_venue_name = venue['display_name']
	openalex_publisher = venue['publisher']
	publication_type = j['type']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	# num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None 
	# 	else int(openalex_last_page) - int(openalex_first_page) + 1)
	num_references = len(j['referenced_works'])
	num_citations = j['cited_by_count']
	# cited_by_api_url is a little bit complicated because in the results of title query
	#   it returns a list whereas it returns a str in doi query
	cited_url = j['cited_by_api_url']
	cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0]
	num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url)
	paper_dict = {
		'Reference': re.sub('//api.', '//', url),
		'OpenAlex Year': openalex_year,
		'OpenAlex Publication Date': openalex_publication_date,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue ID': openalex_venue_id,
		'OpenAlex Venue Name': openalex_venue_name,
		'OpenAlex Publisher': openalex_publisher,
		'Publication Type': publication_type,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
		# 'Number of Pages': num_pages,
		'Number of References for Reference paper': num_references,
		'Number of Citations': num_citations,
		'Number of Authors': num_authors,
		'Number of Concepts': num_concepts,
		'Citation API URL': cited_by_api_url,
		'Number of Citation API URLs': num_cited_by_api_url,
	}
	paper_dict_list.append(paper_dict)
	return paper_dict_list

def get_author_dict_list_from_authors(j, url, author_dict_list):
	"""returns a list of dicts
	"""
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	authors = j['authorships']
	num_authors = len(authors)
	for i in authors:
		author = i['author']
		author_name = author['display_name']
		author_position = authors.index(i) + 1
		position_type = i['author_position']
		openalex_author_id = author['id']
		author_orcid = author['orcid']
		raw_affiliation_string = i['raw_affiliation_string']
		if len(i['institutions']) == 0:
			num_institutions = np.NaN
			first_institution = np.NaN
			institution_name = np.NaN
			institution_id = np.NaN
			ror = np.NaN
			country_code = np.NaN
			institution_type = np.NaN
		else:
			num_institutions = len(i['institutions'])
			first_institution = i['institutions'][0]
			institution_name = first_institution['display_name']
			institution_id = first_institution['id']
			ror = first_institution['ror']
			country_code = first_institution['country_code']
			institution_type = first_institution['type']
		author_dict = {
			'Reference': re.sub('//api.', '//', url),
			'Reference OpenAlex Year': openalex_year,
			'Reference OpenAlex ID': openalex_id,
			'Reference OpenAlex Title': openalex_title,
			'Number of Authors': num_authors,
			'Author Name': author_name,
			'Author Position': author_position,
			'Author Position Type': position_type,
			'OpenAlex Author ID': openalex_author_id,
			'Author ORCID': author_orcid,
			'Number of Affiliations': num_institutions,
			'First Institution Name': institution_name,
			'Raw Affiliation String': raw_affiliation_string,
			'First Institution ID': institution_id,
			'First Institution ROR': ror,
			'First Institution Type': institution_type,
			'First Institution Country Code': country_code
		}
		author_dict_list.append(author_dict)
	return author_dict_list

def get_concept_dict_list_from_concepts(j, url, concept_dict_list):
	"""returns a list of dicts
	"""
	openalex_id = re.sub('https://openalex.org/', '', j['id'])
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	concepts = j['concepts']
	num_concepts = len(concepts)
	for i in concepts:
		concept_index = concepts.index(i) + 1
		concept_name = i['display_name']
		openalex_concept_id = i['id']
		wikidata_url = i['wikidata']
		level = i['level']
		score = i['score']
		concept_dict = {
			'Reference': re.sub('//api.', '//', url),
			'Reference OpenAlex Year': openalex_year,
			'Reference OpenAlex ID': openalex_id,
			'Reference OpenAlex Title': openalex_title,
			'Number of Concepts': num_concepts,
			'Index of Concept': concept_index,
			'Concept': concept_name,
			'Concept ID': openalex_concept_id,
			'Wikidata': wikidata_url,
			'Level': level,
			'Score': score,
		}
		concept_dict_list.append(concept_dict)
	return concept_dict_list

def main(URLS, s, headers):
	for url in URLS:
		url_index = URLS.index(url) + 1
		api_url = re.sub('https://', 'https://api.', url)
		response = s.get(api_url, headers=headers)
		# if the response.status_code is in retry_code, then there is something wrong
		#    I will sleep for a while and try again. Note that if the status_code is 404, 
		#      I except it and put it in error_url_dict
		while response.status_code in retry_code:
			print(f'doi query {url_index} : {api_url} has error, status code is {response.status_code}, retrying...')
			time.sleep(3)
			response = s.get(api_url, headers=headers)
		# note that if the error code is 404, which means the following `response.jons()` will fail,
		#   then that url will NOT be included in paper_dict, author_dict, or concept list
		#.   Instead, that url will be put in error_url_dict
		#.     This is not a problem because later when I merge with REF, the merged file
		#.       will show NaN for 'number of concepts'....
		#.        In fact, even if I create empty dicts for those urls with 404 status codes,
		#          the final merged output will be the same. 
		try:
			j = response.json()
			get_paper_dict_from_json_result(j, url, paper_dict_list)
			get_author_dict_list_from_authors(j, url, author_dict_list)
			get_concept_dict_list_from_concepts(j, url, concept_dict_list)
			print(f'{url_index} / {len(URLS)} is done')
		except:
			error_url_dict = {
			    'Error URL': url,
			    'Error Status Code': response.status_code,
			}
			error_url_dict_list.append(error_url_dict)
			print(f'{url} : {response.status_code}')
		time.sleep(0.5)

if __name__ == '__main__':
	s = get_s()[0]
	headers = get_s()[1]
	# REF is openalex_reference_df with rows omitted whose 'number of reference' is missing
	REF = get_unique_ref_urls(OPENALEX_REFERENCE_DF)[0]
	URLS = get_unique_ref_urls(OPENALEX_REFERENCE_DF)[1]
	random_urls = URLS[0:11]
	paper_dict_list = []
	author_dict_list = []
	concept_dict_list = []
	error_url_dict_list = []
	retry_code = [ 500, 502, 503, 504, 429]
	main(URLS, s, headers)
	paper_df = pd.DataFrame(paper_dict_list)
	author_df = pd.DataFrame(author_dict_list)
	concept_df = pd.DataFrame(concept_dict_list)
	error_df = pd.DataFrame(error_url_dict_list)
	ref_paper_df = REF.merge(paper_df, on="Reference", how='left')
	ref_author_df = REF.merge(author_df, on="Reference", how='left')
	ref_concept_df = REF.merge(concept_df, on="Reference", how='left')
	paper_df.to_csv(OPENALEX_REFERENCE_PAPER_DF_UNIQUE, index=False)
	author_df.to_csv(OPENALEX_REFERENCE_AUTHOR_DF_UNIQUE, index=False)
	concept_df.to_csv(OPENALEX_REFERENCE_CONCEPT_DF_UNIQUE, index=False)
	ref_paper_df.to_csv(OPENALEX_REFERENCE_PAPER_DF, index=False)
	ref_author_df.to_csv(OPENALEX_REFERENCE_AUTHOR_DF, index=False)
	ref_concept_df.to_csv(OPENALEX_REFERENCE_CONCEPT_DF, index=False)
	error_df.to_csv(OPENALEX_REFERENCE_ERROR_DF, index=False)

Python Pandas numpy JSON requests urllib3 From line 9 of scripts/get_openalex_reference_dfs.py

import pandas as pd
import csv
import sys

VISPD_PLUS_GOOD_PAPERS = sys.argv[1]
PAPERS_TO_STUDY = sys.argv[2]

def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

def get_papers_to_study(INPUT): # INPUT here is vispd_plus_good_papers
    vispd_plus_good_papers = read_txt(INPUT)
    to_exclude_from_analysis = [
        '10.1109/VISUAL.1990.146412', # this one simply cannot be found by either title or doi query
        '10.1109/VISUAL.2003.1250379', # this one is wrong match and I can't find a way to locate it on openalex
    ]
    papers_to_study = [
        x for x in vispd_plus_good_papers if x not in to_exclude_from_analysis
    ]
    return papers_to_study

papers_to_study = get_papers_to_study(VISPD_PLUS_GOOD_PAPERS)

with open(PAPERS_TO_STUDY, 'w') as f:
    for doi in papers_to_study:
        f.write("%s\n" % doi)

Python Pandas From line 11 of scripts/get_papers_to_study.py

import sys
import pandas as pd
import requests
from bs4 import BeautifulSoup

TITLES_2021 = sys.argv[1]

def get_page(url):
	r = requests.get(url)
	soup = BeautifulSoup(r.content, 'lxml')
	page = soup.find('article')
	return page

page = get_page('http://ieeevis.org/year/2021/info/papers-sessions')

def get_all_title_str(page):
	"""all_title_str contains both full and short papers' titles
	"""
	strong_elements = page.find_all('strong')
	time_str_elements = [
		x for x in strong_elements if 'CDT' in x.string or 'October' in x.string
	]
	all_title_str = [x.string for x in strong_elements if x not in time_str_elements]
	return all_title_str

all_title_str = get_all_title_str(page)

def get_str_to_exclude(page):
	"""i obtain the list of short paper titles

	First, I obtain both 'strong' and 'em'. Then, I obtain the index of the line that contain 'short papers:'
	That will serve as the "starting index" later. 

	Then, for each line that contain 'short papers:', i obtain the index of the immediate line that contains
	'session chair:'. That index will serve as the "end index".

	For each "starting" and "end" pair, I got the elements in between an extract their string. These include 
	all short papers' titlees. 

	"""
	strong_and_em = page.find_all(['strong', 'em'])
	short_paper_em_idx = [
		strong_and_em.index(i) for i in strong_and_em if 'Short Papers:' in i.string
	]
	session_chair_em_idx = [
		strong_and_em.index(i) for i in strong_and_em if 'Session Chair:' in i.string
	]
	end_idx_list = []
	for idx in short_paper_em_idx:
		end_idx = session_chair_em_idx.index(idx+1)
		end_idx_list.append(session_chair_em_idx[end_idx+1])
	start_end_dic = dict(zip(short_paper_em_idx, end_idx_list))
	str_to_exclude_list = []
	for start in start_end_dic.keys():
		to_exclude = strong_and_em[start:start_end_dic[start]]
		str_to_exclude = [x.string for x in to_exclude]
		str_to_exclude_list.append(str_to_exclude)
	str_to_exclude_list_flattened = [
		item for sublist in str_to_exclude_list for item in sublist
	]
	return str_to_exclude_list_flattened

str_to_exclude = get_str_to_exclude(page)

title_str = [x for x in all_title_str if x not in str_to_exclude]

title_str.remove(
	'Jurassic Mark: Inattentional Blindness for a Datasaurus Reveals that Visualizations are Explored, not Seen'
)

# This paper changed its title for publication on TCVG
title_replace_dict = {
    'IRVINE: Using Interactive Clustering and Labeling to Analyze Correlation Patterns: A Design Study from the Manufacturing of Electrical Engines':
    'IRVINE: A Design Study on Analyzing Correlation Patterns of Electrical Engines',
}

def replace_title(TITLES, DIC):
    for i,n in enumerate(TITLES):
        if n in DIC.keys():
            TITLES[i] = DIC[n]
    return TITLES

title_str = replace_title(title_str, title_replace_dict)

if len(title_str) == 170:
	print('title_str has 170 elements. everything correct')
else:
	print('something is wrong. the length of title_str is not 170')

df = pd.DataFrame(title_str, columns=['title'])

df.to_csv(TITLES_2021, index=False)

Python Pandas requests bs4 From line 7 of scripts/get_titles_2021.py

import requests
import csv
import pandas as pd
import random
import re
import time
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import sys
import time

VISPD_PLUS_GOOD_PAPERS = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
VISPD_OPENALEX_MATCH_1 = sys.argv[3]
TITEL_QUERY_EMPTY_DOI_QUERY_404_1 = sys.argv[4]
TITLE_QUERY_404_1 = sys.argv[5]

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return [doi_year_dict, doi_title_dict]

# def get_s():
# 	# set retry if status codes in [ 500, 502, 503, 504, 429]
# 	# als return headers
# 	s = requests.Session()
# 	retries = Retry(total=5,
# 		backoff_factor=0.1,
# 		status_forcelist=[ 500, 502, 503, 504, 429],
# 	)
# 	s.mount('http://', HTTPAdapter(max_retries=retries))
# 	headers = {
# 	"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
# 	'Accept': 'application/json',
# 	}
# 	return s, headers

def get_title_query_response(doi):
	title_original = doi_title_dict[doi]
	title = re.sub(r'\:|\?|\&|\,', '', title_original)
	response = requests.get(
		'https://api.openalex.org/works?filter=title.search:' + title)
	return response

def check_results_count(response):
	j = response.json()
	count = j['meta']['count']
	return j, count 

def get_doi_query_response(doi):
	response = requests.get("https://api.openalex.org/works/doi:" + doi)
	return response

def get_paper_dict_from_json_result(j, doi):
	openalex_id = j['id']
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue = venue['id']
	openalex_url = venue['url']
	openalex_journal = venue['display_name']
	openalex_publisher = venue['publisher']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue': openalex_venue,
		'OpenAlex Journal': openalex_journal,
		'OpenAlex Publisher': openalex_publisher,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
	}
	return paper_dict

def get_empty_paper_dict(doi):
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}
	return paper_dict

def get_paper_dict_list(doi, doi_index):
	# query title first:
	response = get_title_query_response(doi)
	while response.status_code in retry_code:
		print(f'title query for {doi_index} : {doi} has error. Error status code is {response.status_code}. Retrying...')
		time.sleep(1)
		response = get_title_query_response(doi)
	# if title query succeeds:
	if response.status_code == 200:
		# get json and check results count:
		j = check_results_count(response)[0]
		count = check_results_count(response)[1]
		# if count is non-zero:
		if count > 0:
			first_result = j['results'][0]
			paper_dict = get_paper_dict_from_json_result(first_result, doi)
		# if count is zero, use doi query instead
		else:
			# get doi query response:
			response2 = get_doi_query_response(doi)
			while response2.status_code in retry_code:
				print(f'doi query for {doi_index} : {doi} has error. Error status code is {response2.status_code}. Retrying...')
				time.sleep(1)
				response2 = get_doi_query_response(doi)
			# if doi query succeeds:
			if response2.status_code == 200:
				j2 = response2.json()
				paper_dict = get_paper_dict_from_json_result(j2, doi)
			# empty title query, and 404 for doi query:
			else:
				error_status_code.append(response2.status_code)
				title_query_empty_doi_query_404_list.append(doi)
				paper_dict = get_empty_paper_dict(doi)
				print(f'doi query is not successful for {doi_index} : {doi}, whose title is {doi_title_dict[doi]}')

	# if title query fails:	
	else:
		title_query_404_list.append(doi)
		error_status_code.append(response.status_code)
		# error_status_code.append([doi, response.status_code])
		paper_dict = get_empty_paper_dict(doi)
		print(f'title query is not successful for {doi_index} : {doi_title_dict[doi]}')
	paper_dict_list.append(paper_dict)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		get_paper_dict_list(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.5)
	print(list(set(error_status_code)))

if __name__ == '__main__':
	# note on 2022-01-21: it's not a bug here but it might be error-prone:
	# i defined s here and then i used it direclty in the function of `main`
	# without "importing" the parameter, like `main(vispd_plus_good_papers, s)`
	# it's working, but as I said, it might be error prone
	vispd_plus_good_papers = read_txt(VISPD_PLUS_GOOD_PAPERS)
	doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0]
	doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1]
	retry_code = [ 500, 502, 503, 504, 429]
	paper_dict_list = []
	title_query_empty_doi_query_404_list = []
	title_query_404_list = []
	error_status_code = []
	# s = get_s()[0]
	# headers = get_s()[1]
	main(vispd_plus_good_papers)

paper_df = pd.DataFrame(paper_dict_list)

paper_df.to_csv(VISPD_OPENALEX_MATCH_1, index=False)

with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_1, 'w') as f:
	for doi in title_query_empty_doi_query_404_list:
		f.write("%s\n" % doi)

with open(TITLE_QUERY_404_1, 'w') as f:
	for doi in title_query_404_list:
		f.write("%s\n" % doi)

Python Pandas JSON requests urllib3 From line 12 of scripts/get_vispd_openalex_match_1.py

import requests
import csv
import pandas as pd
import random
import re
import time
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import sys
import time

VISPD_PLUS_GOOD_PAPERS = sys.argv[1]
VISPUBDATA_PLUS = sys.argv[2]
VISPD_OPENALEX_MATCH_2 = sys.argv[3]
TITEL_QUERY_EMPTY_DOI_QUERY_404_2 = sys.argv[4]
TITLE_QUERY_404_2 = sys.argv[5]
DOI_QUERY_404_2 = sys.argv[6]

def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

def get_dicts(VISPUBDATA_PLUS):
	# get year_dict and title_dict
	vispd_plus = pd.read_csv(VISPUBDATA_PLUS)
	dois = vispd_plus.loc[:, "DOI"].tolist()
	titles = vispd_plus.loc[:, "Title"].tolist()
	years = vispd_plus.loc[:, "Year"].tolist()
	doi_year_dict = dict(zip(dois, years))
	doi_title_dict = dict(zip(dois, titles))
	return [doi_year_dict, doi_title_dict]

def get_title_query_response(doi):
	title_original = doi_title_dict[doi]
	title = re.sub(r'\:|\?|\&|\,', '', title_original)
	response = requests.get(
		'https://api.openalex.org/works?filter=title.search:' + title)
	return response

def check_results_count(response):
	j = response.json()
	count = j['meta']['count']
	return j, count 

def get_doi_query_response(doi):
	response = requests.get("https://api.openalex.org/works/doi:" + doi)
	return response

def get_paper_dict_from_json_result(j, doi):
	openalex_id = j['id']
	openalex_title = j['display_name']
	openalex_year = j['publication_year']
	openalex_doi = j['doi']
	venue = j['host_venue']
	openalex_venue = venue['id']
	openalex_url = venue['url']
	openalex_journal = venue['display_name']
	openalex_publisher = venue['publisher']
	openalex_first_page = j['biblio']['first_page']
	openalex_last_page = j['biblio']['last_page']
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
		'OpenAlex Year': openalex_year,
		'OpenAlex ID': openalex_id,
		'OpenAlex Title': openalex_title,
		'OpenAlex DOI': openalex_doi,
		'OpenAlex URL': openalex_url,
		'OpenAlex Venue': openalex_venue,
		'OpenAlex Journal': openalex_journal,
		'OpenAlex Publisher': openalex_publisher,
		'OpenAlex First Page': openalex_first_page,
		'OpenAlex Last Page': openalex_last_page,
	}
	return paper_dict

def get_empty_paper_dict(doi):
	paper_dict = {
		'Year': doi_year_dict[doi],
		'DOI': doi,
		'Title': doi_title_dict[doi],
	}
	return paper_dict

def update_paper_dict_list(doi, doi_index):
	if doi not in to_query_by_doi:
		# query title first:
		response = get_title_query_response(doi)
		# if status code is in retry_code, retry:
		while response.status_code in retry_code:
			print(f'title query for {doi_index} : {doi} is having errors, error status code is {response.status_code}, retrying...')
			time.sleep(1)
			response = get_title_query_response(doi)
		# if title query succeeds:
		if response.status_code == 200:
			# get json and check results count:
			j = check_results_count(response)[0]
			count = check_results_count(response)[1]
			# if count is non-zero:
			if count > 0:
				# if doi not in special_result_index_dict, use index of 0
				if doi not in list(special_result_index_dict.keys()):
					first_result = j['results'][0]
					paper_dict = get_paper_dict_from_json_result(first_result, doi)
				else:
					correct_index = special_result_index_dict[doi]
					correct_result = j['results'][correct_index]
					paper_dict = get_paper_dict_from_json_result(correct_result, doi)
			# if count is zero, use doi query instead
			else:
				# get doi query response:
				response2 = get_doi_query_response(doi)
				# if status code is in retry_code, retry:
				while response2.status_code in retry_code:
					print(f'doi query for {doi_index} : {doi} is having errors, error status code is {response2.status_code}, retrying...')
					time.sleep(1)
					response2 = get_doi_query_response(doi)
				# if doi query succeeds:
				if response2.status_code == 200:
					j2 = response2.json()
					paper_dict = get_paper_dict_from_json_result(j2, doi)
				# if doi query fails, add the list to no_result list
				else:
					# empty title query results and bad doi query
					error_status_code.append(response2.status_code)
					title_query_empty_doi_query_404_list.append(doi)
					paper_dict = get_empty_paper_dict(doi)
					print(f'doi query is fails for {doi_index} : {doi}, whose title is {doi_title_dict[doi]}')

		# if title query fails:	
		else:
			title_query_404_list.append(doi)
			error_status_code.append(response.status_code)
			paper_dict = get_empty_paper_dict(doi)
			print(f'title query fails for {doi_index} : {doi_title_dict[doi]}')
	else:
		response0 = get_doi_query_response(doi)
		# if status code is in retry_code, retry
		while response0.status_code in retry_code:
			print(f'doi query for {doi_index} : {doi} is having errors, error status code is {response0.status_code}, retrying...')
			time.sleep(3)
			response0 = get_doi_query_response(doi)
		# if doi query succeeds:
		if response0.status_code == 200:
			j0 = response0.json()
			paper_dict = get_paper_dict_from_json_result(j0, doi)
		# if doi query fails:
		else:
			error_status_code.append(response0.status_code)
			doi_query_404_list.append(doi)
			paper_dict = get_empty_paper_dict(doi)
			print(f'doi query fails for {doi_index} : {doi}')
	paper_dict_list.append(paper_dict)

def main(DOIS):
	for doi in DOIS:
		doi_index = DOIS.index(doi) + 1
		update_paper_dict_list(doi, doi_index)
		print(f'{doi_index} is done')
		time.sleep(0.5)
	print(list(set(error_status_code)))

if __name__ == '__main__':
	# note on 2022-01-21: it's not a bug here but it might be error-prone:
	# i defined s here and then i used it direclty in the function of `main`
	# without "importing" the parameter, like `main(vispd_plus_good_papers, s)`
	# it's working, but as I said, it might be error prone
	vispd_plus_good_papers = read_txt(VISPD_PLUS_GOOD_PAPERS)
	doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0]
	doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1]
	retry_code = [ 500, 502, 503, 504, 429]
	paper_dict_list = []
	title_query_empty_doi_query_404_list = []
	title_query_404_list = []
	doi_query_404_list = []
	error_status_code = []
	to_query_by_doi = [
		'10.1109/VISUAL.2001.964489',
		'10.1109/VISUAL.1996.568113',
		'10.1109/VISUAL.1999.809896',
		'10.1109/VISUAL.1991.175771',
		'10.1109/VISUAL.1998.745302',
		'10.1109/VISUAL.1993.398868',
		'10.1109/INFVIS.2005.1532128',
		'10.1109/VISUAL.1993.398859',
		'10.1109/VISUAL.1991.175795',
		'10.1109/VISUAL.2003.1250401',
		'10.1109/VISUAL.1991.175789',
		'10.1109/VISUAL.2000.885739',
		'10.1109/TVCG.2014.2346922',
		'10.1109/VISUAL.1999.809871',
		'10.1109/VISUAL.1996.567807',
		'10.1109/VISUAL.2000.885692',
		'10.1109/VISUAL.1991.175777',
		'10.1109/VISUAL.1998.745315',
		'10.1109/VISUAL.1997.663909',
		'10.1109/VISUAL.2000.885697',
		'10.1109/VISUAL.2001.964504',
		'10.1109/TVCG.2006.168',
		'10.1109/TVCG.2007.70617',
		'10.1109/VISUAL.1997.663910',
		'10.1109/VISUAL.1997.663931',
		'10.1109/VISUAL.2002.1183792',
		'10.1109/VISUAL.1992.235201',
		'10.1109/VISUAL.1996.568128',
		'10.1109/VISUAL.1997.663923',
		'10.1109/VAST.2011.6102441',
		'10.1109/VISUAL.2000.885732',
		'10.1109/VISUAL.2001.964522',
		'10.1109/VISUAL.2005.1532812',
		'10.1109/VISUAL.1998.745350',
		'10.1109/INFVIS.2001.963282',
		'10.1109/VISUAL.1995.480804',
		'10.1109/VISUAL.2005.1532847',
		'10.1109/INFVIS.1996.559229',
		'10.1109/VISUAL.2000.885738',
		'10.1109/VISUAL.1991.175800',
		'10.1109/VISUAL.1993.398865',
		'10.1109/VISUAL.1993.398866',
		'10.1109/VISUAL.1998.745348',
		'10.1109/VISUAL.1993.398867',
		'10.1109/VISUAL.1997.663925',
		'10.1109/VISUAL.1993.398900',
		'10.1109/VISUAL.1992.235181',
		'10.1109/VISUAL.1992.235195',
		'10.1109/VISUAL.2000.885719',
		'10.1109/VISUAL.1991.175816',
		'10.1109/VISUAL.1990.146414',
		'10.1109/VISUAL.1993.398861',
		'10.1109/VISUAL.1993.398872',
		'10.1109/VISUAL.1994.346292',
		'10.1109/VISUAL.1994.346295',
		'10.1109/VISUAL.1994.346297',
		'10.1109/VISUAL.1994.346301',
		'10.1109/VISUAL.1999.809913',
		'10.1109/VISUAL.2001.964546',
		'10.1109/VISUAL.2003.1250404',
		'10.1109/TVCG.2014.2346442',
		'10.1109/TVCG.2020.3028948',
		'10.1109/TVCG.2020.3030363',
		'10.1109/TVCG.2020.3030364',
		'10.1109/tvcg.2021.3114784',
		'10.1109/tvcg.2021.3114780',
		'10.1109/tvcg.2021.3114782',
		'10.1109/tvcg.2021.3114783',
		'10.1109/tvcg.2021.3114836',
		'10.1109/TVCG.2021.3064037',
		'10.1109/TVCG.2021.3114849',
		'10.1109/TVCG.2021.3114842',
		'10.1109/TVCG.2021.3114766',
		'10.1109/TVCG.2021.3114777'
	]
	special_result_index_dict = {
		'10.1109/VISUAL.1992.235194': 4,
	}
	main(vispd_plus_good_papers)

paper_df = pd.DataFrame(paper_dict_list)

paper_df.to_csv(VISPD_OPENALEX_MATCH_2, index=False)

with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_2, 'w') as f:
	for doi in title_query_empty_doi_query_404_list:
		f.write("%s\n" % doi)

with open(TITLE_QUERY_404_2, 'w') as f:
	for doi in title_query_404_list:
		f.write("%s\n" % doi)

with open(DOI_QUERY_404_2, 'w') as f:
	for doi in doi_query_404_list:
		f.write("%s\n" % doi)

Python Pandas JSON requests urllib3 From line 11 of scripts/get_vispd_openalex_match_2.py

import pandas as pd
import sys

VISPUBDATA_PLUS = sys.argv[1]
VISPD_PLUS_GOOD_PAPERS = sys.argv[2]

def get_vispd_plus_good_papers(INPUT):
    """get the list of good dois
    """
    vispd_plus = pd.read_csv(INPUT)
    jc = ['J', 'C']
    good_papers = vispd_plus[
        (vispd_plus.PaperType.isin(jc)) | (vispd_plus.Year > 2020)
        ]
    dois = good_papers.loc[:, "DOI"].tolist()
    # remove the invalid DOI
    dois.remove('10.0000/00000001')
    return dois

vispd_plus_good_papers = get_vispd_plus_good_papers(VISPUBDATA_PLUS)

with open(VISPD_PLUS_GOOD_PAPERS, 'w') as f:
    for doi in vispd_plus_good_papers:
        f.write("%s\n" % doi)

Python Pandas From line 13 of scripts/get_vispd_plus_good_papers.py

import sys
import pandas as pd

DOIS_2021 = sys.argv[1]
VISPUBDATA = sys.argv[2]
VISPUBDATA_PLUS = sys.argv[3]

if __name__ == '__main__':
	dois_2021_df = pd.read_csv(DOIS_2021)
	vispd = pd.read_csv(VISPUBDATA)
	vispd_plus = vispd.append(dois_2021_df, ignore_index=True)
	vispd_plus.to_csv(VISPUBDATA_PLUS, index=False)

Python Pandas From line 9 of scripts/get_vispd_plus.py

import pandas as pd
import urllib
import requests
from bs4 import BeautifulSoup
import re
import csv
import random
import numpy as np
import time
import sys

INPUT = sys.argv[1]
OUT_FNAME = sys.argv[2]

def get_wos_id_from_doi(doi):
	url = f'http://ws.isiknowledge.com/cps/openurl/service?url_ver=Z39.88-2004&rft_id=info:doi/{doi}'
	headers = {
		"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
	}
	response = requests.get(url=url, headers=headers)
	wos_url = response.history[-1].url
	wos_id_list = re.findall(r'(?<=2FWOS%3A)(.*)(?=%3F)', wos_url)
	if wos_id_list:
		wos_id = wos_id_list[0]
	else:
		wos_id = np.NaN
	doi_wos_dict = {
		'DOI': doi,
		'WOS ID': wos_id
	}
	doi_wos_dict_list.append(doi_wos_dict)

def get_dois(INPUT):
	good_dois = open(INPUT, 'r')
	reader = csv.reader(good_dois)
	allRows = [row for row in reader]
	dois = [i[0] for i in allRows]
	return dois 

def build_df_from_dict_list(df, dict_list):
	"""build df from a list of dictionaries

	Arguments:
	   df: an empty df you just initiated

	   dict_list: a list of dictionaries containing data you want to form a df

	Returns:
	  The updated df
	"""
	for i in dict_list:
		df_1 = pd.DataFrame([i])
		df = df.append(df_1, ignore_index=True)
	return df

def main():
	for doi in dois:
		get_wos_id_from_doi(doi)
		time.sleep(2+random.uniform(0, 2))
		print(f'{dois.index(doi) + 1} is done')

if __name__ == '__main__':
	# initiate a list of dicts
	doi_wos_dict_list = []
	dois = get_dois(INPUT)
	main()
	# initiate a dataframe 
	doi_wos_df_initiate = pd.DataFrame(columns=['DOI', 'WOS ID'])
	doi_wos_df = build_df_from_dict_list(
		doi_wos_df_initiate, doi_wos_dict_list)
	doi_wos_df.to_csv(OUT_FNAME, index=False)

Python Pandas numpy requests bs4 From line 4 of scripts/get_wos_id.py

import sys
import pandas as pd
import itertools
from collections import Counter

HT_CLEANED_AUTHOR_DF = sys.argv[1]
AUTHOR_CHORD_DF = sys.argv[2]
TS_AUTHOR_CHORD_DF = sys.argv[3]

def get_dic(DF): # DF here is HT_CLEANED_AUTHOR_DF
	"""get the dictionary of bicode counts"""
	tuple_list = []
	for group in DF.groupby('DOI'):
		country_codes = list(set(group[1]['Affiliation Country Code']))
		if len(country_codes) > 1:
			tuples = [x for x in itertools.combinations(country_codes, 2)]
			tuple_list.append(tuples)
	bicode = list(itertools.chain(*tuple_list))
	bicode_counts = Counter(bicode)
	bicode_counts_dic = dict(bicode_counts)
	return bicode_counts_dic 

def get_chord_df(DIC): # DIC here is bicode_counts_dic
	"""
	Return:
		A dataframe containig three columns: source, targe, value.
		Even though I am using `source`, and `target`, this is an undirected ntework. 
	"""
	chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value'])
	chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
	chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
	chord_df_sorted = chord_df[
		['source', 'target', 'value']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	return chord_df_sorted

def get_ts_chord_df(DF, ts_chord_data): # DF here is HT_CLEANED_AUTHOR_DF
	"""
	get timeseries data. groupby year first. get each year's data and then concatenate
	"""
	for year_group in DF.groupby("Year"):
		bicode_counts_dic = get_dic(year_group[1])
		chord_df = pd.DataFrame(
			bicode_counts_dic.items(), columns=['pairs','value'])
		chord_df['year'] = year_group[0]
		chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
		chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
		chord_df_sorted = chord_df[
			['source', 'target', 'value', 'year']].sort_values(
			by='value', ascending=False).reset_index(drop=True)
		ts_chord_data.append(chord_df_sorted)
	ts_chord_df = pd.concat(
		ts_chord_data, ignore_index=True)
	return ts_chord_df 

def rename_countries(DF):
	"""to convert country codes to name"""
	DF.replace({
		'CH': 'Switzerland',
		'CN': 'China',
		'DE': 'Germany',
		'CA': 'Canada',
		'FR': 'France',
		'NL': 'Netherlands',
		'AT': 'Austria',
		'AU': 'Australia',
	},
		inplace=True
	)
	return DF 

if __name__ == '__main__':
	HT_CLEANED_AUTHOR_DF = pd.read_csv(HT_CLEANED_AUTHOR_DF)
	ts_chord_data = []
	bicode_counts_dic = get_dic(HT_CLEANED_AUTHOR_DF)
	chord_df = get_chord_df(bicode_counts_dic)
	chord_df.to_csv(AUTHOR_CHORD_DF, index=False)
	ts_chord_df = get_ts_chord_df(HT_CLEANED_AUTHOR_DF, ts_chord_data)
	ts_chord_df.to_csv(TS_AUTHOR_CHORD_DF, index=False)

Python Pandas From line 3 of scripts/plot_data_author_chord_diagram_data.py

import pandas as pd 
import sys
import numpy as np
import itertools
from collections import Counter

VISPUBDATA_PLUS = sys.argv[1]
OPENALEX_CONCEPT_DF = sys.argv[2]
REFERENCE_CONCEPT_DF = sys.argv[3]
CITATION_CONCEPT_DF = sys.argv[4]
SANKEY_AGGREGATED_DF = sys.argv[5]
SANKEY_TS_DF = sys.argv[6]

def get_vis_doi_concept_dic(DF, LEVEL): # DF here is OPENALEX_CONCEPT_DF
	vis_levelns_df = DF[DF.Level == LEVEL].reset_index(drop=True)
	max_score_leveln = []
	for group in vis_levelns_df.groupby('DOI'):
		max_score = max(group[1]['Score'])
		df_to_append = group[1][group[1]['Score'] == max_score]
		max_score_leveln.append(df_to_append)
	vis_leveln_df = pd.concat(max_score_leveln, ignore_index=True)
	vis_leveln_doi_concept_dic = dict(
		zip(vis_leveln_df.DOI, vis_leveln_df.Concept))
	return vis_leveln_doi_concept_dic

def get_leveln_df(DF, LEVEL, ID_NAME): 
	"""
	inputs:
		DF is either s REF_DF or CIT_DF
		ID_NAME is either REF_ID_NAME or CIT_ID_NAME
	Returns:
		a dataframe of two columns: 
			1. IEEE VIS papers' DOI
			2. REF/CIT papers' concept
	"""
	dfs = []
	levelns_df = DF[DF.Level == LEVEL]
	# keep only the highest score concept
	for group in levelns_df.groupby(ID_NAME):
		dff = group[1].sort_values(by='Score', ascending=False)
		max_score = max(dff['Score'])
		dff_to_append = dff[dff['Score'] == max_score]
		dfs.append(dff_to_append)

	leveln_df = pd.concat(dfs, ignore_index=True)[['DOI', 'Concept', ID_NAME]]
	return leveln_df

def get_leveln_output_df(DF, VIS_DOI_CONCEPT_DIC, YEAR_DICT, YEAR_KEY, SUFFIX):
	"""
	inputs:
		DF is either REF_LEVELN_DF or CIT_LEVELN_DF
		YEAR_DICT is either DOI_YEAR_DICT or CIT_ID_YEAR_DICT
		YEAR_KEY is either REF_YEAR_KEY, or CIT_YEAR_KEY
		SUFFIX is either REF_SUFFIX or CIT_SUFFIX

	The purpose of this step:
		1. map DOI to IEEE VIS concept
		2. get the year when this citation happens
	"""

	DF['IEEE VIS Concept'] = DF.DOI.apply(
		lambda x: VIS_DOI_CONCEPT_DIC[
			x] if x in VIS_DOI_CONCEPT_DIC.keys() else np.NaN
	)
	DF['Year'] = DF[YEAR_KEY].apply(lambda x: YEAR_DICT[x])
	leveln_df_nonan = DF[DF['IEEE VIS Concept'].notnull()]
	leveln_df_output = leveln_df_nonan.drop(
		columns=['DOI']).reset_index(drop=True)
	if SUFFIX == REF_SUFFIX:
		leveln_df_output['Concept'] = leveln_df_output[
			'Concept'].apply(lambda s: s + REF_SUFFIX)
	else:
		leveln_df_output['Concept'] = leveln_df_output[
			'Concept'].apply(lambda s: s + CIT_SUFFIX)
	leveln_df_output['IEEE VIS Concept'] = leveln_df_output[
		'IEEE VIS Concept'].apply(lambda s: s + "(v)")
	return leveln_df_output

def get_leveln_aggregated(SOURCE, DF, LEVEL): 
	"""
	inputs:
		SOURCE is either 'REF' or 'CIT'
		DF is either REF_LEVELN_OUTPUT or CIT_LEVELN_OUTPUT
	"""
	if SOURCE == 'REF':
		tuples = list(zip(
			DF['Concept'], 
			DF['IEEE VIS Concept'],
		))
	else:
		tuples = list(zip(
			DF['IEEE VIS Concept'],
			DF['Concept'], 
		))
	biconcept_counts = Counter(tuples)
	dic = dict(biconcept_counts)
	sankey_df = pd.DataFrame(dic.items(), columns=['pairs','value'])
	sankey_df['level'] = LEVEL
	sankey_df['source'] = sankey_df.pairs.apply(lambda x: x[0])
	sankey_df['target'] = sankey_df.pairs.apply(lambda x: x[1])
	sankey_df_sorted = sankey_df[
		['source', 'target', 'value', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	sankey_df_sorted['rank'] = sankey_df_sorted.index + 1
	return sankey_df_sorted

def get_ts_year_group_data(SOURCE, DF, LEVEL):
	"""
	inputs:
		SOURCE is either 'REF' or 'CIT'
		DF is year_group

	This is much the same as the get_leveln_aggregated() function
	"""
	if SOURCE == 'REF':
		tuples = list(zip(
			DF[1]['Concept'], 
			DF[1]['IEEE VIS Concept'],
		))
	else:
		tuples = list(zip(
			DF[1]['IEEE VIS Concept'],
			DF[1]['Concept'], 
		))
	biconcept_counts = Counter(tuples)
	dic = dict(biconcept_counts)
	sankey_df = pd.DataFrame(dic.items(), columns=['pairs','value'])
	sankey_df['level'] = LEVEL
	sankey_df['source'] = sankey_df.pairs.apply(lambda x: x[0])
	sankey_df['target'] = sankey_df.pairs.apply(lambda x: x[1])
	sankey_df_sorted = sankey_df[
		['source', 'target', 'value', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	sankey_df_sorted['rank'] = sankey_df_sorted.index + 1
	sankey_df_sorted['year'] = DF[0]
	return sankey_df_sorted

if __name__ == '__main__':
	VISPUBDATA_PLUS = pd.read_csv(VISPUBDATA_PLUS)
	OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF)
	REF_DF = pd.read_csv(REFERENCE_CONCEPT_DF)
	CIT_DF = pd.read_csv(CITATION_CONCEPT_DF)

	REF_ID_NAME = 'Reference OpenAlex ID'
	CIT_ID_NAME = 'Citation Paper OpenAlex ID'

	REF_DF = REF_DF[REF_DF[REF_ID_NAME].notnull()]
	CIT_DF = CIT_DF[CIT_DF[CIT_ID_NAME].notnull()]
	CIT_DF.rename(columns = {'Cited Paper DOI': 'DOI'}, inplace=True)

	DOI_YEAR_DICT = dict(zip(
		VISPUBDATA_PLUS.DOI, VISPUBDATA_PLUS.Year
	))

	CIT_ID_YEAR_DICT = dict(zip(
		CIT_DF[CIT_ID_NAME], CIT_DF['Citation Paper Year']
	))

	REF_YEAR_KEY = 'DOI'
	CIT_YEAR_KEY = CIT_ID_NAME

	# Set parameters
	START_LEVEL = 0
	END_LEVEL = 3
	CUTOFF = 500
	REF_SUFFIX = '(r)'
	CIT_SUFFIX = '(c)'

	# initiate dfs
	REF_LEVELN_AGGREGATED_DFS = []
	CIT_LEVELN_AGGREGATED_DFS = []
	REF_LEVELN_TS_DFS = []
	CIT_LEVELN_TS_DFS = []

	for LEVEL in range(START_LEVEL, END_LEVEL + 1):
		VIS_DOI_CONCEPT_DIC = get_vis_doi_concept_dic(
			OPENALEX_CONCEPT_DF,
			LEVEL
		)

		# REFERENCE -> VIS
		REF_LEVELN_DF = get_leveln_df(
			REF_DF, 
			LEVEL, 
			REF_ID_NAME,
		)
		REF_LEVELN_OUTPUT = get_leveln_output_df(
			REF_LEVELN_DF, 
			VIS_DOI_CONCEPT_DIC, 
			DOI_YEAR_DICT, 
			REF_YEAR_KEY, 
			REF_SUFFIX,
		)
		REF_LEVELN_AGGREGATED = get_leveln_aggregated(
			'REF',
			REF_LEVELN_OUTPUT, 
			LEVEL,
		)

		REF_LEVELN_AGGREGATED_DFS.append(REF_LEVELN_AGGREGATED)

		# TIMESERIES:
		REF_LEVELN_YEAR_GROUP_DFS = []
		for year_group in REF_LEVELN_OUTPUT.groupby('Year'):
			year_group_data = get_ts_year_group_data(
				'REF',
				year_group,
				LEVEL
				)
			REF_LEVELN_YEAR_GROUP_DFS.append(year_group_data)
		REF_LEVELN_TS_DF = pd.concat(
			REF_LEVELN_YEAR_GROUP_DFS,
			ignore_index = True,
		)
		REF_LEVELN_TS_DFS.append(REF_LEVELN_TS_DF)

		# VIS -> CITATION
		CIT_LEVELN_DF = get_leveln_df(
			CIT_DF, 
			LEVEL, 
			CIT_ID_NAME,
		)
		CIT_LEVELN_OUTPUT = get_leveln_output_df(
			CIT_LEVELN_DF, 
			VIS_DOI_CONCEPT_DIC, 
			CIT_ID_YEAR_DICT, 
			CIT_YEAR_KEY, 
			CIT_SUFFIX,
		)
		CIT_LEVELN_AGGREGATED = get_leveln_aggregated(
			'CIT',
			CIT_LEVELN_OUTPUT, 
			LEVEL,
		)

		CIT_LEVELN_AGGREGATED_DFS.append(CIT_LEVELN_AGGREGATED)

		# TIMESERIES:
		CIT_LEVELN_YEAR_GROUP_DFS = []
		for year_group in CIT_LEVELN_OUTPUT.groupby('Year'):
			year_group_data = get_ts_year_group_data(
				'CIT',
				year_group,
				LEVEL,
			)
			CIT_LEVELN_YEAR_GROUP_DFS.append(year_group_data)
		CIT_LEVELN_TS_DF = pd.concat(
			CIT_LEVELN_YEAR_GROUP_DFS,
			ignore_index = True,
		)
		CIT_LEVELN_TS_DFS.append(CIT_LEVELN_TS_DF)

		print(f'level {LEVEL} is done')

	# GET AGGREGATED_DF
	ref_aggregated = pd.concat(
		REF_LEVELN_AGGREGATED_DFS,
		ignore_index = True,
	)
	ref_aggregated['source name'] = 'REF'
	cit_aggregated = pd.concat(
		CIT_LEVELN_AGGREGATED_DFS,
		ignore_index = True,
	)
	cit_aggregated['source name'] = 'VIS'

	aggregated_df = pd.concat(
		[ref_aggregated, cit_aggregated],
		ignore_index = True,
	)

	# GET TS_DF
	ref_timeseries = pd.concat(
		REF_LEVELN_TS_DFS,
		ignore_index = True,
	)
	ref_timeseries['source name'] = 'REF'
	cit_timeseries = pd.concat(
		CIT_LEVELN_TS_DFS,
		ignore_index = True,
	)
	cit_timeseries['source name'] = 'VIS'

	ts_df = pd.concat(
		[ref_timeseries, cit_timeseries],
		ignore_index = True,
	)

	# Write to file
	aggregated_df.to_csv(SANKEY_AGGREGATED_DF, index=False)
	ts_df.to_csv(SANKEY_TS_DF, index=False)

	print('sankey data has been saved!')

Python Pandas numpy From line 2 of scripts/plot_sankey_data.py

import sys
import numpy as np
import pandas as pd
from collections import Counter

OPENALEX_PAPER_DF = sys.argv[1]
OPENALEX_CONCEPT_DF = sys.argv[2]
TOP_CONCEPTS_TRENDS_DF = sys.argv[3]

def get_year_count_dic(DF): # DF here is openalex_paper_df
	"""I want proportion. So I need first know total number of pubs each year"""
	year_count_df = DF.groupby(
		'Year').size().to_frame('count').reset_index()
	year_count_dic = dict(
		zip(year_count_df['Year'], year_count_df['count']))
	return year_count_dic 

def get_top_concepts_rank_and_total(DF, LEVEL, CUTOFF): # DF here is OPENALEX_CONCEPT_DF
	"""get the top concepts, its rank, and its historical total
	"""
	# filter by specific level
	lvl = DF[DF.Level == LEVEL]

	# get the total frequency of the concepts within that level
	lvl_df = lvl.groupby(['Concept', 'Concept ID']).size().to_frame(
		'frequency').reset_index().sort_values(
		by='frequency', ascending=False).head(CUTOFF)

	# get the rank of each of the top 10 concepts within that level
	# generate two dics: one for rank, and the other for total
	lvl_df['rank'] = range(1, CUTOFF+1)
	top_concepts = lvl_df['Concept']
	concept_rank_dic = dict(zip(lvl_df['Concept'], lvl_df['rank']))
	concept_historical_total_dic = dict(zip(lvl_df['Concept'], lvl_df['frequency']))
	return top_concepts, concept_rank_dic, concept_historical_total_dic


def get_ts_for_top(DF, TOP_CONCEPTS): # DF here is OPENALEX_CONCEPT_DF
	"""
	get timeseries data for top concepts

	Returns:
		a dataframe where in each row I have a concept, a year, and 
		the total frequency of that concept in that year

	"""

	top_concepts_ts_df = DF[DF.Concept.isin(TOP_CONCEPTS)].groupby(
		['Concept', 'Year']).size().to_frame(
		'Concept Yearly Frequency').reset_index()
	return top_concepts_ts_df


def update_dfs(
	DF, 
	i, 
	TOP_RANK_DIC, 
	TOP_TOTAL_DIC, 
	YEAR_COUNT_DIC,
	DFS
	): # DF here is TOP_CONCEPTS_TS_DF

	LEVEL = i
	dfss = []
	start = 1990 ; end = 2021
	year_idx = range(start, end+1)

	for group in DF.groupby('Concept'):
		"""Normalize each concept in each level by the same time range, i.e., 1990-2021"""
		year_frequency_dic = dict(
			zip(group[1]['Year'], group[1]['Concept Yearly Frequency']))
		concepts = [group[1].iloc[0, :].Concept] * len(year_idx)
		frequencies = [
			year_frequency_dic[
			x] if x in year_frequency_dic.keys() else 0 for x in year_idx]
		time_series_df = pd.DataFrame(
			list(zip(concepts, year_idx, frequencies)), 
			columns = [f'concept_{LEVEL}', f'year_{LEVEL}', f'yearly frequency_{LEVEL}'])
		time_series_df[f'rank_{LEVEL}'] = time_series_df[f'concept_{LEVEL}'].apply(
			lambda x: TOP_RANK_DIC[x])
		time_series_df[f'level_{LEVEL}'] = LEVEL
		time_series_df[f'concept historical total_{LEVEL}'] = time_series_df[
			f'concept_{LEVEL}'].apply(
			lambda x: TOP_TOTAL_DIC[x])
		time_series_df[f'yearly vis total_{LEVEL}'] = time_series_df[f'year_{LEVEL}'].apply(
			lambda x: YEAR_COUNT_DIC[x])
		time_series_df[f'proportion_{LEVEL}'] = time_series_df[
			f'yearly frequency_{LEVEL}'] / time_series_df[f'yearly vis total_{LEVEL}']
		# time_series_df is for each concept within each level
		# dfss is to contain all concepts data within a level
		dfss.append(time_series_df.reset_index(drop=True))
	level_df_to_append = pd.concat(dfss, ignore_index = True)
	level_df_to_append.sort_values(by=[f'rank_{LEVEL}', f'year_{LEVEL}'], inplace=True)
	DFS.append(level_df_to_append.reset_index(drop=True))


if __name__ == '__main__':
	# Set parameters
	START_LEVEL = 0
	END_LEVEL = 3
	# CUTOFF = 30
	CUTOFF = 10

	OPENALEX_PAPER_DF = pd.read_csv(OPENALEX_PAPER_DF)
	OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF)

	YEAR_COUNT_DIC = get_year_count_dic(OPENALEX_PAPER_DF)

	DFS = []
	for i in range(START_LEVEL, END_LEVEL+1):
		TOP_CONCEPTS, TOP_RANK_DIC, TOP_TOTAL_DIC = get_top_concepts_rank_and_total(
			OPENALEX_CONCEPT_DF, 
			i, 
			CUTOFF
		)

		TOP_CONCEPTS_TS_DF = get_ts_for_top(
			OPENALEX_CONCEPT_DF, TOP_CONCEPTS
		)
		update_dfs(
		TOP_CONCEPTS_TS_DF, 
		i, 
		TOP_RANK_DIC, 
		TOP_TOTAL_DIC, 
		YEAR_COUNT_DIC,
		DFS
		)

	# concat, validate, and write to file

	dff = pd.concat(DFS, axis=1)

	print(dff['year_1'].tolist() == dff['year_2'].tolist())
	print(dff['year_1'].tolist() == dff['year_3'].tolist())
	print(dff['rank_1'].tolist() == dff['rank_3'].tolist())
	print(dff['rank_1'].tolist() == dff['rank_2'].tolist())

	dff.to_csv(TOP_CONCEPTS_TRENDS_DF, index = False)

Python Pandas numpy From line 8 of scripts/plot_top_concepts_trends.py

import sys
import numpy as np
import pandas as pd
import itertools
from collections import Counter

OPENALEX_CONCEPT_DF = sys.argv[1]
AGGREGATED_COOCCURANCE_DF = sys.argv[2]
TS_AGGREGATED_COOCCURANCE_DF = sys.argv[3]

def get_level_df(DF, LEVEL):
	# subset by level
	level_df = DF[DF.Level == LEVEL].reset_index(drop=True)
	return level_df

def get_dic(LEVEL_DF): # DF here is OPENALEX_CONCEPT_DF
	"""get the dictionary of biconcept counts"""

	# initiate a tuple list
	tuple_list = []

	# for each ieeevis paper, get combinations of level concepts if more than two 
	# level concepts exist
	for group in LEVEL_DF.groupby('DOI'):
		concepts = list(set(group[1].Concept))
		if len(concepts) > 1:
			tuples = [x for x in itertools.combinations(concepts, 2)]
			tuple_list.append(tuples)

	# get biconcepts dictionary
	biconcepts = list(itertools.chain(*tuple_list))
	biconcept_counts_dic = dict(Counter(biconcepts))

	return biconcept_counts_dic

def update_data(DIC, LEVEL, CUTOFF, DATA): # DIC: biconcept_counts_dic
	# DATA: cooccurance_aggregated_data
	chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value'])
	chord_df['level'] = LEVEL
	chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
	chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
	chord_df = chord_df[
		['source', 'target', 'value', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	chord_df = chord_df[chord_df['value'] >= CUTOFF]
	DATA.append(chord_df)

def update_ts_data(DIC, YEAR, LEVEL, CUTOFF, DATA):
	"""get timeseries chord dataframe"""
	chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value'])
	chord_df['year'] = YEAR
	chord_df['level'] = LEVEL
	chord_df['source'] = chord_df.pairs.apply(lambda x: x[0])
	chord_df['target'] = chord_df.pairs.apply(lambda x: x[1])
	chord_df = chord_df[
		['source', 'target', 'value', 'year', 'level']].sort_values(
		by='value', ascending=False).reset_index(drop=True)
	chord_df = chord_df[chord_df['value'] >= CUTOFF]
	DATA.append(chord_df)


if __name__ == '__main__':
	OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF)

	"""set parameters """ 

	CUTOFF = 1 # cutoff number for cooccurance
	START = 0 # top level
	END = 3 # lowest level 

	"""Get Aggregated data """

	# Aggregated data, involving data of all levles
	cooccurance_aggregated_data = []

	# iterate through all levels
	for LEVEL in range(START, END + 1):
		LEVEL_DF = get_level_df(OPENALEX_CONCEPT_DF, LEVEL)
		biconcept_counts_dic = get_dic(LEVEL_DF)
		update_data(
			biconcept_counts_dic, LEVEL, CUTOFF, cooccurance_aggregated_data)

	# write to file
	aggregated_df = pd.concat(cooccurance_aggregated_data, ignore_index=True)
	aggregated_df.to_csv(AGGREGATED_COOCCURANCE_DF, index=False)


	"""Get Timeseries data """

	cooccurance_timeseries_aggregated_data = []

	for LEVEL in range(START, END + 1):

		# initiate time series data for each level
		# it will collect each year's data within the current LEVEL
		cooccurance_timeseries_data = []

		LEVEL_DF = get_level_df(OPENALEX_CONCEPT_DF, LEVEL)

		for YEAR_GROUP in LEVEL_DF.groupby('Year'):
			biconcept_counts_dic = get_dic(YEAR_GROUP[1])
			update_ts_data(
				biconcept_counts_dic, 
				YEAR_GROUP[0], 
				LEVEL, 
				CUTOFF, 
				cooccurance_timeseries_data
			)

		# this is the final data for each level
		cooccurance_timeseries_df = pd.concat(
			cooccurance_timeseries_data, ignore_index=True)

		# append this level's data to aggregated data list
		cooccurance_timeseries_aggregated_data.append(cooccurance_timeseries_df)

	# write to file
	ts_aggregated_df = pd.concat(
		cooccurance_timeseries_aggregated_data, ignore_index=True)
	ts_aggregated_df.to_csv(TS_AGGREGATED_COOCCURANCE_DF, index=False)

Python Pandas numpy From line 3 of scripts/plot_vis_concepts_cooccurance_data.py

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import sys

# input
IEEE_AUTHOR_DF = sys.argv[1]

# output
AWARD_PAPER_DF = sys.argv[2]


def get_paragraphs(url):
    r = requests.get(url)
    if r.status_code == 200:
        soup = bs(r.text, 'html.parser')
        article = soup.find('article')
        paragraphs = list(article.stripped_strings)

    return paragraphs


def rename(x):
    if 'Honorable Mention Awards' in x:
        return 'HM'
    if 'Best Paper Award' in x:
        return 'BP'
    if 'Test of Time Award' in x:
        return 'TT'
    if 'Best Case Study Award' in x:
        return 'BCS'
    raise ValueError("Unknow award:", x)


rearranger = lambda x: [x[-1], x[-3], x[-2], x[-4], x[1], x[0]]


def get_parsed_results(years, years_idx, paragraphs):
    results = []
    intervals = zip(years_idx, years_idx[1:] + [len(paragraphs)])

    # every loop includes a year's awards
    for idx, (y1, y2) in enumerate(intervals):
        year = years[idx]
        paper_info = [] # initialize a list to store a paper info
        for i in range(y1+1, y2):
            p = paragraphs[i]

            if p.endswith(('Awards:', 'Award:')): 
                award = p.replace(':', '')
                award = rename(award)
                continue

            if p.endswith("\nDOI:"):
                p = p.replace(".\nDOI:", "").replace("Awarded at: ", '')

            if p == "DOI:":
                p = 'Vis'

            # every paper info has four lines: author, title, awarded at, DOI
            paper_info.append(p) 

            # all DOIs happen to have "/" not used anywhere else
            if '/' in p and paragraphs[i-1].endswith("DOI:"): 
                paper_info.extend([award, year]) # add award type and year
                results.append(paper_info)
                paper_info = []     

    return list(map(rearranger, results))


def doi_debug(results):
    df = pd.read_csv(IEEE_AUTHOR_DF)
    dois = df['DOI'].unique().tolist()
    dois_lower = [d.lower() for d in dois]

    for idx, res in enumerate(results):        
        if res[1] in dois:
            pass
        elif res[1].lower() in dois_lower:
            i = dois_lower.index(res[1].lower())
            print(res[1] + " has been unified as --> " + dois[i])
            results[idx][1] = dois[i]
        else:
            print(f"DOI: {res[1]} does not exist in {IEEE_AUTHOR_DF}!")

    return results



def get_2021_tt_papers():
    url = 'http://ieeevis.org/year/2021/info/awards/test-of-time-awards'
    paragraphs = get_paragraphs(url)
    tracks = ['VAST', 'InfoVis', 'SciVis']
    tracks_idx = [paragraphs.index(a) for a in tracks]

    years, years_idx = [], []
    for idx, p in enumerate(paragraphs):
        p = p.replace(":", "")
        if p.isdigit():
            years.append(int(p))
            years_idx.append(idx)

    def get_track(year_idx):
        for i in range(-1, -4, -1):
            if year_idx > tracks_idx[i]:
                return tracks[i]

    results = []
    award = 'TT'

    for idx, y_idx in enumerate(years_idx):
        year = years[idx]
        title = paragraphs[y_idx+1]
        author = paragraphs[y_idx+2]
        doi = paragraphs[y_idx+4]
        track = get_track(y_idx)
        results.append([year, doi, award, track, title, author])
    return doi_debug(results)


def main():
    url = 'http://ieeevis.org/year/2022/info/history/best-paper-award'
    paragraphs = get_paragraphs(url)
    years = [y for y in range(2021, 1989, -1)]
    years_idx = [paragraphs.index(str(y)) for y in years]
    assert len(years) == len(years_idx)
    results = get_parsed_results(years, years_idx, paragraphs)
    results = doi_debug(results)
    results.extend(get_2021_tt_papers())
    columns = ['Year', 'DOI', 'Award', 'Track', 'Title', 'Author']
    df = pd.DataFrame(results, columns=columns)
    df.to_csv(AWARD_PAPER_DF, index=False)


if __name__ == '__main__':
    main()

Python Pandas requests bs4 From line 4 of scripts/scrape_award_papers.py

shell: "python scripts/get_titles_2021.py {output}"

SnakeMake From line 199 of workflow/Snakefile

shell: "python scripts/get_vispd_plus.py {input} {output}"

SnakeMake From line 205 of workflow/Snakefile

shell: "python scripts/get_vispd_plus_good_papers.py {input} {output}"

SnakeMake From line 211 of workflow/Snakefile

shell: "python scripts/get_vispd_openalex_match_1.py {input} {output}"

SnakeMake From line 217 of workflow/Snakefile

shell: "python scripts/get_vispd_openalex_match_2.py {input} {output}"

SnakeMake From line 223 of workflow/Snakefile

shell: "python scripts/get_papers_to_study.py {input} {output}"

SnakeMake From line 229 of workflow/Snakefile

shell: "python scripts/get_openalex_dfs.py {input} {output}"

SnakeMake From line 235 of workflow/Snakefile

shell: "python scripts/get_openalex_citation_dfs.py {input} {output}"

SnakeMake From line 241 of workflow/Snakefile

shell: "python scripts/get_ieee_author_and_paper_title.py {input} {output}"

SnakeMake From line 247 of workflow/Snakefile

shell: "python scripts/get_merged_author_df.py {input} {output}"

SnakeMake From line 253 of workflow/Snakefile

shell: "python scripts/get_openalex_reference_dfs.py {input} {output}"

SnakeMake From line 259 of workflow/Snakefile

shell: "python scripts/scrape_award_papers.py {input} {output}"

SnakeMake From line 265 of workflow/Snakefile

shell: "python scripts/get_gscholar_data.py {input} {output}"

SnakeMake From line 271 of workflow/Snakefile

shell: "python scripts/get_wos_id.py {input} {output}"

SnakeMake From line 277 of workflow/Snakefile

shell: "python scripts/CLASS_country.py {input} {output}"

SnakeMake From line 286 of workflow/Snakefile

shell: "python scripts/CLASS_type.py {input} {output}"

SnakeMake From line 295 of workflow/Snakefile

shell: "python scripts/get_HT_cleaned_author_df.py {input} {output}"

SnakeMake From line 302 of workflow/Snakefile

shell: "python scripts/get_HT_cleaned_paper_df.py {input} {output}"

SnakeMake From line 313 of workflow/Snakefile

shell: "python scripts/plot_data_author_chord_diagram_data.py {input} {output}"

SnakeMake From line 318 of workflow/Snakefile

shell: "python scripts/plot_vis_concepts_cooccurance_data.py {input} {output}"

SnakeMake From line 323 of workflow/Snakefile

shell: "python scripts/plot_top_concepts_trends.py {input} {output}"

SnakeMake From line 328 of workflow/Snakefile

shell: "python scripts/plot_sankey_data.py {input} {output}"

SnakeMake From line 337 of workflow/Snakefile

ShowHide 41 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://32vis.hongtaoh.com/

Name: 32vis

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

data JSON Pandas Snakemake bs4 numpy requests selenium sklearn urllib3

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Thirty-Two Years of IEEE VIS

Help improve this workflow!

Updated Findings

For recalicability committee:

Table of Contents

Structure

Important data

Data dicionaries for public data

VIS PAPER 1990-2021

VIS AUTHORS 1990-2021

VIS PAPER CONCEPTS

Google Scholar Citations

Large data

Dependencies

Reproducibility

Re-generate data?

Okay with our current data?

Citation

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public