Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword data .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains data files and codes (data processing & analysis) for the paper of Thirty-two years of IEEE VIS: Authors, Fields of Study and Citations.
Updated Findings
In Fig. 3(d) and 3(e), we showed that the number of citations for VIS from non-VIS papers has been increasing dramatically but we did not analyze the publication venues of these citation papers. We did it later and found that citations coming from IEEE Transactions on Visualization and Computer Graphics accounted for 12.4% of all 153,549 citations (undeduplicated). Citations from Computer Graphics Forum , HCI venues, PacificVis, and journals in the filed of Visualization such as Information Visualization and Journal of Visualization are also major sources. This indicate that the impacts of VIS are mostly confined to visualization and HCI areas . Detailed results are available at https://hongtaoh.com/files/top_venues.html .
For recalicability committee:
Please go to the folder of
reproduce
and simply run
bash script.sh
.
Table of Contents
Structure
This repository consists of four folders:
-
analyses_and_get_figures
contains Jupyter notebooks that get the reported statistics and figures in the Results section of our paper. -
data
are data files we created and analyzed. -
results
are the output figures generated from codes inanalyses_and_get_figures
. Figures in both the paper and the supplementary material are included. -
workflow
contains (1) scripts to obtain data, and (2) Jupyter notebooks to validate data.
analyses_and_get_figures
and
results
are easy to understand. The most difficult and critical parts are
workflow
and
data
. For detailed data generation & processing procedures, refere to
workflow
. For detailed descriptions of data that were generated and used in the study, refer to the
data
folder.
Important data
The most important data files in analysis are as follows:
-
data/ht_class/ht_cleaned_author_df.csv
-
data/ht_class/ht_cleaned_paper_df.csv
-
data/interim/openalex_author_df.csv
-
data/processed/openalex_concept_df.csv
-
data/processed/large/openalex_citation_concept_df.csv
-
data/processed/large/openalex_reference_concept_df.csv
-
data/processed/openalex_refeernce_concept_df_unique.csv
Data dicionaries for public data
We have also made data that might be useful for other researcers working on scientometric analysis available on Google Sheets: https://docs.google.com/spreadsheets/d/1JRo33XurW28bGK_Snplno1dbRLDkSZf1T7JmpjNDvTw/
VIS PAPER 1990-2021
-
Conference: The conference track of VIS papers. There are four tracks: InfoVis, SciVis, VAST, vis. Since 2021, IEEE VIS no longer distinguishes between conference tracsk and we assigned the term 'VIS' for all papers published in and after 2021
-
Year: The year this paper was published
-
Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)
-
DOI: Paper DOI
-
PaperType: either 'J' (Journal paper) or 'C' (conference paper). This data is from vispubdata . For IEEEVIS 2021 papers, we classified them all as 'J'
-
OpenAlex ID: The OpenAlex ID associated with this paper. With an ID, for example,
W3203914472
, you can assess this paper's metadata on OpenAlex throughhttps://api.openalex.org/works/W3203914472
-
Number of References: Number of references as shown on OpenAlex (as of June 2022)
-
Number of Concepts: Number of concepts as shown on OpenAlex (as of June 2022)
-
Number of Citations: Number of citations as shown on OpenAlex (as of June 2022)
-
Number of Authors: Number of authors
-
Cross-type Collaboration: Whether a paper involves collaborations among researchers from universities and non-educational affiliations (e.g., companies, facilities, government, healthcare, etc.)
-
Cross-country Collaboration: Whether a paper involves collaborations among researchers from different countries or regions
-
With US Authors: Whether a paper involves at least one author from the United States
-
Both Cross-type and Cross-country Collaboration: Whether a paper is both a cross-type and a cross-country collaboration paper
-
Google Scholar Citation: Citation counts as shown on Google Scholar (as of June 2022)
-
Award: Whether a paper is an award-winning paper. Note that we exclude Test of Time awards
-
Award Name: If a paper is an award-winning one, what award did it get. BP: Best Paper; HM: Honorable Mention; BCS: Best Case Study
-
Award Track: The conference track that presented this paper this award
VIS AUTHORS 1990-2021
-
Year: The year this paper was published
-
DOI: Paper DOI
-
Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)
-
Number of Authors: Number of authors
-
Author Position: Author position
-
Author Name: Author name
-
OpenAlex Author ID: OpenAlex author ID
-
Affiliation Name: Author affiliation name
-
Affiliation country code: alpha-2 (ISO 3166) country code for affiliations
-
Affiliation Type: The type of an affiliation, as defined by ROR
-
Binary Type: The type of an affiliation, either education or non-education
VIS PAPER CONCEPTS
-
Year: The year this paper was published
-
DOI: Paper DOI
-
Title: Paper title as shown on vispubdata and IEEE Xplore (for 2021 IEEEVIS papers)
-
Number of Concepts: Number of concepts as shown on OpenAlex (as of June 2022)
-
Index of Concept: Index of Concept as shown on OpenAlex (as of June 2022)
-
Concept: Concept name
-
Concept ID: Concept ID on OpenAlex
-
Wikidata: Link to Wikidata page of a Concept
-
Level: The level of this Concept as defined by OpenAlex. Level 0 indicates root Concepts like Computer Science and Psychology. The larger the number, the more granualr a Concept is.
-
Score: The score assigned to this Concept by OpenAlex. A higher score indicates this Concept is a better representation of a paper.
Google Scholar Citations
-
Year: The year this paper was published
-
DOI: Paper DOI
-
IEEE Title: Paper title as shown on IEEE Xplore (as of June 2022)
-
Title on Google Scholar: Paper title as shown on Google Scholar (as of June 2022)
-
Citation Link: Link to papers citing a VIS paper on Google Scholar (as of June 2022)
-
Citation Counts on Google Scholar: Citation counts on Google Scholar (as of June 2022)
Large data
The
large
folder within
data/processed
is empty because GitHub does not allow uploading files larger than 100M. Large files are stored in the repository of
https://osf.io/zkvjm/
(OSF Storage -> large).
Dependencies
This project uses
python 3.8
with the following packages:
snakemake
pandas
numpy
matplotlib
seaborn
altair
scikit-learn
scipy
plotnine
beautifulsoup4
selenium
urllib3
requests
lxml
All packages can be installed with
pip install pkgname
, for example,
pip install scikit-learn
. For
lxml
, use
conda install -c anaconda lxml
.
snakemake
is used for the workflow. For details, see my
tutorial on snakemake
.
For citation analysis, we also used
R
. See
citation_analysis.R
.
For
python
, we recommend
conda
and creating a virtural environment. After installing
anaconda
, you can create a virtual environment:
conda create --name 32vis python=3.8
conda activate 32vis
Then you can install packages with
conda
or
pip
.
You can also use the
environment.yml
and
requirements.yml
but they contain many packages that are not used at all.
Reproducibility
Our work is designed to be reproducible.
Re-generate data?
If you want to reproduce our work from the very beginning, after installing the necessary packages mentioned above, you can delete all folders in
data
folder except for
raw
and
README.md
.
Then:
conda activate 32vis
cd workflow
snakemake --cores 1
This will generate all data again. Please note that:
-
We obtained data from the API of OpenAlex. However, OpenAlex updates its data every two weeks. This means that the data you will get will different from ours. The degree of differences is a function of time. For example, if you recreate the data ten years from now, our data will be totally different.
-
To crawl Google Scholar needs human participant due to the reCAPTCHA security checks.
After all data is obtained, you can run all files in
analyses_and_get_figures
to reproduce our results.
Okay with our current data?
If you don't plan to re-generate all the data but just want to reproduce results based on data we already had, you can simply run all files in
analyses_and_get_figures
directly.
Citation
@article{hao2022thirty,
title={Thirty-two Years of IEEE VIS: Authors, Fields of Study and Citations},
author={Hao, Hongtao and Cui, Yumian and Wang, Zhengxiang and Kim, Yea-Seul},
journal={IEEE Transactions on Visualization and Computer Graphics},
year={2022},
doi={10.1109/TVCG.2022.3209422},
publisher={IEEE}
}
Code Snippets
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | import sys import pandas as pd import numpy as np import re from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import accuracy_score, classification_report from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.metrics import precision_recall_fscore_support as multi_score from collections import Counter from bs4 import BeautifulSoup def get_simple_df(fname): """ - remove nan, - get only two target columns, i.e., raw string and aff type - drop duplicates """ raw_string = 'Raw Affiliation String' aff_type = 'First Institution Country Code' df = pd.read_csv(fname) df = df[(df[raw_string].notnull()) & (df[aff_type].notnull())] df = df[[raw_string, aff_type]] df = df.drop_duplicates() return df def get_df(cit_author, ref_author, oa_author): """concatenate, drop_duplicates, reset index, rename columns, factorize label_str Returns: the df used for model training and testing. It contains three columns: 1. aff, which is pre-processed strings of affiliations 2. label_str, which is country codes in strings, 3. label: which is factorized version of country codes """ df = pd.concat( [oa_author, ref_author, cit_author], ignore_index = True ).drop_duplicates().reset_index(drop=True) df.columns = ['aff', 'label_str'] df = df.assign(label = pd.factorize(df['label_str'])[0]) return df def get_dicts(df): """get two dicts; id <--> cntry """ cntry_to_id = dict(zip(df.label_str, df.label)) id_to_cntry = dict(zip(df.label, df.label_str)) return cntry_to_id, id_to_cntry def clean_text(text): """ Takes a string and returns a string """ # remove html tags, lowercase, remove nonsense, remove non-letter aff = BeautifulSoup(text, "lxml").text aff = aff.lower() aff = re.sub(r'xa0|#n#‡#n#|#tab#|#r#|\[|\]', "", aff) aff = re.sub(r'[^a-z]+', ' ', aff) return aff def logist_regression(df): ''' Input: df: df Returns: logreg: logistic regression model ''' X = df.aff y = df.label X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state = 42) logreg = Pipeline([('vect', CountVectorizer(stop_words='english', min_df = 5)), ('clf', LogisticRegression(max_iter=600)), ]) print('model training now...') logreg.fit(X_train, y_train) y_train_pred = logreg.predict(X_train) y_test_pred = logreg.predict(X_test) target_names = list(set([id_to_cntry[x] for x in y_test])) f = open(CNTRY_CLASSIFICATION_REPORT,'a') f.write('The following is the result for affiliation country code classification' + '\n') f.write('Test set accuracy %s' % accuracy_score(y_test_pred, y_test)) f.write('\n') precision, recall, fscore, support = multi_score( y_test, y_test_pred, average='weighted' ) f.write('precision: {}'.format(precision)) f.write('\n') f.write('recall: {}'.format(recall)) f.write('\n') f.write('fscore: {}'.format(fscore)) f.write('\n') f.write('support: {}'.format(support)) f.write('\n') f.write('\n') f.write('Training set accuracy %s' % accuracy_score(y_train, y_train_pred)) # f.write(classification_report(y_test, y_test_pred, target_names=target_names)) f.close() return logreg def get_processed_merged_author(DF, LOGREG): ''' Input: - DF: merged - LOGREG Returns: - DF with cntry classification results ''' # clean text for affs to be predicted DF['IEEE Author Affiliation Filled_Processed'] = DF[ 'IEEE Author Affiliation Filled'].apply(clean_text) pred = LOGREG.predict(DF['IEEE Author Affiliation Filled_Processed']) results = [id_to_cntry[x] for x in pred] DF['country_code_results'] = results # if I have handcoded the country codes, use those first DF = DF.assign(country_code_results_updated = np.where(DF['First Institution Country Code By Hand'].notnull(), DF['First Institution Country Code By Hand'], DF['country_code_results'] )) return DF if __name__ == '__main__': CIT_AUTHOR = sys.argv[1] REF_AUTHOR = sys.argv[2] # openalex author df for VIS papers: OA_AUTHOR = sys.argv[3] MERGED_AUTHOR = sys.argv[4] MERGED_CNTRy_test_predICTED = sys.argv[5] CNTRY_CLASSIFICATION_REPORT = sys.argv[6] # load datasets: cit_author = get_simple_df(CIT_AUTHOR) ref_author = get_simple_df(REF_AUTHOR) oa_author = get_simple_df(OA_AUTHOR) merged = pd.read_csv(MERGED_AUTHOR) # get df for model trainig and testing df = get_df(cit_author, ref_author, oa_author) # clean affiliation texts df['aff'] = df['aff'].apply(clean_text) df = df.drop_duplicates() f = open(CNTRY_CLASSIFICATION_REPORT,'a') f.write(f'there are {df.shape[0]} training examples in country classification.') f.write('\n') f.close() # get dicts cntry_to_id, id_to_cntry = get_dicts(df) # get logreg logreg = logist_regression(df) merged_processed = get_processed_merged_author(merged, logreg) # export merged_processed cols_to_keep = [ 'Year', 'DOI', 'Title', 'IEEE Number of Authors', 'IEEE Author Position', 'IEEE Author Name', 'OpenAlex Author ID', 'IEEE Author Affiliation Filled', 'country_code_results_updated', ] col_renamer = { 'Year':'Year', 'DOI':'DOI', 'Title':'Title', 'IEEE Number of Authors':'Number of Authors', 'IEEE Author Position':'Author Position', 'IEEE Author Name':'Author Name', 'OpenAlex Author ID':'OpenAlex Author ID', 'IEEE Author Affiliation Filled':'Affiliation Name', 'country_code_results_updated':'Affiliation Country Code', } merged_cntry_test_predicted = merged_processed[cols_to_keep] merged_cntry_test_predicted.rename(columns = col_renamer).to_csv( MERGED_CNTRy_test_predICTED, index = False ) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | import sys import pandas as pd import numpy as np import re from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import accuracy_score, classification_report from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.metrics import precision_recall_fscore_support as multi_score from bs4 import BeautifulSoup def get_simple_df(fname): """ - remove nan, - get only two target columns, i.e., raw string and aff type - drop duplicates """ raw_string = 'Raw Affiliation String' aff_type = 'First Institution Type' df = pd.read_csv(fname) df = df[(df[raw_string].notnull()) & (df[aff_type].notnull())] df = df[[raw_string, aff_type]] df = df.drop_duplicates() return df def get_df(cit_author, ref_author, oa_author): """concatenate, drop_duplicates, reset index, rename columns, factorize label_str Returns: the df used for model training and testing. It contains five columns: 1. aff, which is pre-processed strings of affiliations 2. label_str, which is country codes in strings, 3. label: which is factorized version of country codes 4. binary_label_str 5. binary_label """ df = pd.concat( [oa_author, ref_author, cit_author], ignore_index = True ).drop_duplicates().reset_index(drop=True) df.columns = ['aff', 'label_str'] df = df.assign(label = pd.factorize(df['label_str'])[0]) df = df.assign(binary_label_str = np.where( df.label_str == 'education', 'education', 'non-education')) df = df.assign(binary_label = pd.factorize(df['binary_label_str'])[0]) return df def get_dicts(df): """get four dicts; id <--> type, for both binary and multiclass """ multi_type_to_id = dict(zip(df.label_str, df.label)) id_to_multi_type = dict(zip(df.label, df.label_str)) binary_type_to_id = dict(zip(df.binary_label_str, df.binary_label)) id_to_binary_type = dict(zip(df.binary_label, df.binary_label_str)) return multi_type_to_id, id_to_multi_type, binary_type_to_id, id_to_binary_type def clean_text(text): """ Takes a string and returns a string """ # remove html tags, lowercase, remove nonsense, remove non-letter aff = BeautifulSoup(text, "lxml").text aff = aff.lower() aff = re.sub(r'xa0|#n#‡#n#|#tab#|#r#|\[|\]', "", aff) aff = re.sub(r'[^a-z]+', ' ', aff) return aff def logist_regression(df, LABEL): ''' Input: df: df LABEL: 'label' if multiclass and 'binary_label' if binary Returns: logreg: logistic regression classifier (model) ''' X = df.aff y = df[LABEL] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state = 42) logreg = Pipeline([('vect', CountVectorizer(stop_words='english', min_df = 2)), ('clf', LogisticRegression(max_iter=600)), ]) print('model training now...') logreg.fit(X_train, y_train) y_train_pred = logreg.predict(X_train) y_test_pred = logreg.predict(X_test) target_names = list(set(df.label_str)) if LABEL == 'label' else list(set(df.binary_label_str)) logreg_type = 'multiclass classification' if LABEL == 'label' else 'binary classification' f = open(TYPE_CLASSIFICATION_REPORT,'a') f.write('The following is the result for aff type' + ' : ' + logreg_type + '\n') f.write('Test set accuracy %s' % accuracy_score(y_test, y_test_pred)) f.write('\n') precision, recall, fscore, support = multi_score( y_test, y_test_pred, average='weighted' ) f.write('precision: {}'.format(precision)) f.write('\n') f.write('recall: {}'.format(recall)) f.write('\n') f.write('fscore: {}'.format(fscore)) f.write('\n') f.write('support: {}'.format(support)) f.write('\n') f.write('\n') f.write('Training set accuracy %s' % accuracy_score(y_train, y_train_pred)) # f.write('\n') # f.write(classification_report(y_test, y_test_pred, target_names=target_names)) f.write('\n') f.write('\n') f.close() return logreg def get_processed_merged_author(DF, LOGREG_MULTI, LOGREG_BINARY): ''' Input: - DF: merged - LOGREG_MULTI - LOGREG_BINARY Returns: - DF with binary and multiclass classification results ''' # clean text for affs to be predicted DF['IEEE Author Affiliation Filled_Processed'] = DF[ 'IEEE Author Affiliation Filled'].apply(clean_text) pred_binary = LOGREG_BINARY.predict(DF['IEEE Author Affiliation Filled_Processed']) pred_binary_type = [id_to_binary_type[x] for x in pred_binary] pred_multi = LOGREG_MULTI.predict(DF['IEEE Author Affiliation Filled_Processed']) pred_multi_type = [id_to_multi_type[x] for x in pred_multi] DF['aff_type_results_binary'] = pred_binary_type DF['aff_type_results_multiclass'] = pred_multi_type # use type by hand if exists DF = DF.assign(aff_type_results_binary_updated = np.where(DF['Binary Institution Type By Hand'].notnull(), DF['Binary Institution Type By Hand'], DF['aff_type_results_binary'] )) # use type by hand if exists DF = DF.assign(aff_type_results_multiclass_updated = np.where(DF['First Institution Type By Hand'].notnull(), DF['First Institution Type By Hand'], DF['aff_type_results_multiclass'] )) return DF if __name__ == '__main__': CIT_AUTHOR = sys.argv[1] REF_AUTHOR = sys.argv[2] # openalex author df for VIS papers: OA_AUTHOR = sys.argv[3] MERGED_AUTHOR = sys.argv[4] MERGED_AFF_TYPE_PREDICTED = sys.argv[5] TYPE_CLASSIFICATION_REPORT = sys.argv[6] # load datasets: cit_author = get_simple_df(CIT_AUTHOR) ref_author = get_simple_df(REF_AUTHOR) oa_author = get_simple_df(OA_AUTHOR) merged = pd.read_csv(MERGED_AUTHOR) # get df for model trainig and testing df = get_df(cit_author, ref_author, oa_author) # clean affiliation texts df['aff'] = df['aff'].apply(clean_text) # drop duplicates after text pre-processing df = df.drop_duplicates() f = open(TYPE_CLASSIFICATION_REPORT,'a') f.write(f'there are {df.shape[0]} training examples in aff type classification.') f.write('\n') f.write('\n') f.close() # get dicts multi_type_to_id, id_to_multi_type, binary_type_to_id, id_to_binary_type = get_dicts(df) # get logreg logreg_multi = logist_regression(df, 'label') logreg_binary = logist_regression(df, 'binary_label') merged_processed = get_processed_merged_author(merged, logreg_multi, logreg_binary) # export merged_processed cols_to_keep = [ 'Year', 'DOI', 'Title', 'IEEE Number of Authors', 'IEEE Author Position', 'IEEE Author Name', 'OpenAlex Author ID', 'IEEE Author Affiliation Filled', 'aff_type_results_multiclass_updated', 'aff_type_results_binary_updated', ] col_renamer = { 'Year':'Year', 'DOI':'DOI', 'Title':'Title', 'IEEE Number of Authors':'Number of Authors', 'IEEE Author Position':'Author Position', 'IEEE Author Name':'Author Name', 'OpenAlex Author ID':'OpenAlex Author ID', 'IEEE Author Affiliation Filled':'Affiliation Name', 'aff_type_results_multiclass_updated':'Multiclass Affiliation Type', 'aff_type_results_binary_updated':'Binary Affiliation Type', } merged_aff_type_predicted = merged_processed[cols_to_keep] merged_aff_type_predicted.rename(columns = col_renamer).to_csv( MERGED_AFF_TYPE_PREDICTED, index=False ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | import sys import pandas as pd import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.keys import Keys from selenium.webdriver.firefox.options import Options from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import ElementNotInteractableException import os import random import re import csv import numpy as np import urllib.parse PAPERS_TO_SUTDY = sys.argv[1] IEEE_PAPER_DF = sys.argv[2] GSCHOLAR_DATA = sys.argv[3] def specify_driver_options(): """ specify driver options """ options = Options() options.set_preference("browser.download.folderList", 2) options.set_preference("browser.download.manager.showWhenStarting", False) options.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/plain, text/txt, application/plain, application/txt") def read_txt(INPUT): """read txt files and return a list """ raw = open(INPUT, "r") reader = csv.reader(raw) allRows = [row for row in reader] data = [i[0] for i in allRows] return data def get_dicts(INPUT): # INPUT here is ieee_paper_df # get year_dict and title_dict df = pd.read_csv(INPUT) dois = df.loc[:, "DOI"].tolist() titles = df.loc[:, "IEEE Title"].tolist() years = df.loc[:, "Year"].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) return doi_year_dict, doi_title_dict def get_gscholar_data_by_title(doi, doi_index): # TITLE QUERY if doi in title_recode_dict.keys(): title = title_recode_dict[doi] else: title = doi_title_dict[doi] title_to_query = urllib.parse.quote_plus(title) doi_to_query = urllib.parse.quote_plus(doi) query_string = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C50&q=' # IF DOI IN TO_QUERY_BY_DOI, USE DOI QUERY if doi in to_query_by_doi: driver.get(query_string + doi_to_query + '&btnG=') # IF NOT, USE TITLE QUERY else: driver.get(query_string + title_to_query + '&btnG=') gs_paper_e = wait.until(EC.presence_of_element_located(( By.CSS_SELECTOR, 'h3.gs_rt'))) gs_paper_title = gs_paper_e.text gs_citation_e = wait.until( EC.presence_of_element_located((By.XPATH, '//div[@class="gs_fl"]//child::a[3]' ))) citation_link = gs_citation_e.get_attribute('href') citation_count_string = gs_citation_e.get_attribute('innerHTML') if citation_count_string == "Related articles": gs_citation_count = 0 else: gs_citation_count = int(re.findall(r'\d+', citation_count_string)[0]) gscholar_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'IEEE Title': title, 'Title on Google Scholar': gs_paper_title, 'Citation Link': citation_link, 'Citation Counts on Google Scholar': gs_citation_count, } gscholar_dict_list.append(gscholar_dict) def main(DOIS): for doi in DOIS: doi_index = DOIS.index(doi) + 1 get_gscholar_data_by_title(doi, doi_index) print(f'{doi_index} is done') time.sleep(0.2+random.uniform(0, 0.2)) driver.close() driver.quit() if __name__ == '__main__': driver = webdriver.Firefox(options=specify_driver_options()) wait = WebDriverWait(driver, 90) DOIS = read_txt(PAPERS_TO_SUTDY) doi_year_dict, doi_title_dict = get_dicts(IEEE_PAPER_DF) random_dois = random.sample(DOIS, 10) random_dois.append('10.1109/INFVIS.2001.963279') gscholar_dict_list = [] title_recode_dict = { # If I don't change the title for querying, the results are wrong: # This is the real title on PDF: '10.1109/VISUAL.1999.809889': 'Enabling classification and shading for 3 D texture mapping based volume rendering using OpenGL and extensions', } to_query_by_doi = [ # If I query by title, the results are false: '10.1109/VISUAL.1993.398863', '10.1109/VISUAL.1996.567807', '10.1109/VISUAL.1998.745315', '10.1109/INFVIS.2001.963282', '10.1109/VISUAL.1992.235194', '10.1109/VISUAL.1993.398866', '10.1109/VISUAL.1998.745348', '10.1109/VISUAL.1997.663925', '10.1109/VISUAL.1993.398900', '10.1109/VISUAL.2000.885719', '10.1109/TVCG.2021.3114849', '10.1109/VISUAL.1991.175771', '10.1109/INFVIS.2001.963279', '10.1109/INFVIS.2001.963295', '10.1109/VIS.1999.10000', ] main(DOIS) df = pd.DataFrame(gscholar_dict_list) df.to_csv(GSCHOLAR_DATA, index = False) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import sys import pandas as pd import numpy as np import itertools MERGED_CNTRY_PREDICTED = sys.argv[1] MERGED_AFF_TYPE_PREDICTED = sys.argv[2] HT_CLEANED_AUTHOR_DF = sys.argv[3] def get_cross_country_dic(df): cross_country_dic = {} for group in df.groupby('DOI'): DOI = group[0] country_codes = group[1]['Affiliation Country Code'].tolist() num_of_cntry = len(list(set(country_codes))) if num_of_cntry != 1: cross_country_dic[DOI] = True else: cross_country_dic[DOI] = False return cross_country_dic def get_cross_type_dic(df): cross_type_dic = {} for group in df.groupby('DOI'): DOI = group[0] types = group[1]['Binary Type'].tolist() num_of_types = len(list(set(types))) if num_of_types != 1: cross_type_dic[DOI] = True else: cross_type_dic[DOI] = False return cross_type_dic if __name__ == '__main__': # load data cntry_df = pd.read_csv(MERGED_CNTRY_PREDICTED) type_df = pd.read_csv(MERGED_AFF_TYPE_PREDICTED) if cntry_df.shape[0] == type_df.shape[0]: print('cntry_df has the same length with type_df') # get the column of affiliation type multi_aff_types = type_df['Multiclass Affiliation Type'] binary_aff_types = type_df['Binary Affiliation Type'] # assign it to cntry_df and reanme columns cntry_df = cntry_df.assign(multi_aff_type = multi_aff_types) cntry_df = cntry_df.assign(binary_aff_type = binary_aff_types) cntry_df.rename( columns = { 'multi_aff_type': 'Affiliation Type', 'binary_aff_type': 'Binary Type', }, inplace=True ) df = cntry_df.copy() cross_country_dic = get_cross_country_dic(df) cross_type_dic = get_cross_type_dic(df) df['Cross-type Collaboration'] = df.DOI.apply( lambda x: cross_type_dic[x] ) df['International Collaboration'] = df.DOI.apply( lambda x: cross_country_dic[x] ) df.to_csv(HT_CLEANED_AUTHOR_DF, index=False) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | import sys import pandas as pd import numpy as np from functools import reduce PAPER_TO_STUDY = sys.argv[1] VISPUBDATA_PLUS = sys.argv[2] OPENALEX_PAPER_DF = sys.argv[3] HT_CLEANED_AUTHOR_DF = sys.argv[4] GSCHOLAR_DATA = sys.argv[5] AWARD_PAPER_DF = sys.argv[6] HT_CLEANED_PAPER_DF = sys.argv[7] def get_vispd(VISPUBDATA_PLUS, PAPER_TO_STUDY): cols = [ 'Conference', 'Year', 'Title', 'DOI', 'FirstPage', 'LastPage', 'PaperType', ] vispd = VISPUBDATA_PLUS[ VISPUBDATA_PLUS.DOI.isin(PAPER_TO_STUDY)].loc[:, cols].reset_index(drop=True) vispd.loc[vispd.Year == 2021, 'PaperType'] = 'J' return vispd def get_alex(OPENALEX_PAPER_DF): cols = [ 'DOI', 'OpenAlex Year', 'OpenAlex Publication Date', 'OpenAlex ID', 'OpenAlex Venue Name', 'OpenAlex First Page', 'OpenAlex Last Page', 'Number of Pages', 'Number of References', 'Number of Concepts', 'Number of Citations', ] alex = OPENALEX_PAPER_DF.loc[:, cols] return alex def get_authors(HT_CLEANED_AUTHOR_DF): cols = [ 'DOI', 'Number of Authors', 'Cross-type Collaboration', 'International Collaboration', 'With US Authors', ] # create the column of "With US Authors" for doi in list(set(HT_CLEANED_AUTHOR_DF.DOI)): if 'US' in HT_CLEANED_AUTHOR_DF[ HT_CLEANED_AUTHOR_DF.DOI == doi]['Affiliation Country Code'].tolist(): HT_CLEANED_AUTHOR_DF.loc[HT_CLEANED_AUTHOR_DF.DOI == doi, 'With US Authors'] = True else: HT_CLEANED_AUTHOR_DF.loc[HT_CLEANED_AUTHOR_DF.DOI == doi, 'With US Authors'] = False HT_CLEANED_AUTHOR_DF.drop_duplicates(subset=['DOI'], inplace=True) authors = HT_CLEANED_AUTHOR_DF.loc[:, cols].reset_index(drop=True) # create the column of both cross-type and cross-country collaboration authors['Both Cross-type and Cross-country Collaboration'] = authors[ 'Cross-type Collaboration'] * authors['International Collaboration'] # rename column authors.rename( columns={'International Collaboration': 'Cross-country Collaboration'}, inplace=True ) return authors def get_gscholar(GSCHOLAR_DATA): cols = [ 'DOI', 'IEEE Title', 'Citation Counts on Google Scholar', ] gscholar = GSCHOLAR_DATA.loc[:, cols] return gscholar def get_df_merged(dfs): df_merged = reduce(lambda left,right: pd.merge(left,right,on='DOI'), dfs) return df_merged def get_award_dicts(AWARD_PAPER_DF): awards = AWARD_PAPER_DF[AWARD_PAPER_DF.Award != 'TT'] kwargs = {'Track Updated': np.where(awards.Year == 2021, 'VIS', awards.Track)} awards = awards.assign(**kwargs) award_dois = awards.DOI.tolist() award_names = awards.Award.tolist() award_tracks = awards['Track Updated'].tolist() doi_award_name_dict = dict(zip(award_dois, award_names)) doi_award_track_dict = dict(zip(award_dois, award_tracks)) return award_dois, doi_award_name_dict, doi_award_track_dict def get_df_final(df_merged, award_dois, doi_award_name_dict, doi_award_track_dict): df_merged['Award'] = df_merged['DOI'].apply( lambda x: True if x in award_dois else False ) df_merged['Award Name'] = df_merged['DOI'].apply( lambda x: doi_award_name_dict[x] if x in award_dois else np.nan) df_merged['Award Track'] = df_merged['DOI'].apply( lambda x: doi_award_track_dict[x] if x in award_dois else np.nan) df_final = df_merged return df_final def main(): # process data vispd = get_vispd(VISPUBDATA_PLUS, PAPER_TO_STUDY) alex = get_alex(OPENALEX_PAPER_DF) authors = get_authors(HT_CLEANED_AUTHOR_DF) gscholar = get_gscholar(GSCHOLAR_DATA) # merge data dfs = [vispd, alex, authors, gscholar] df_merged = get_df_merged(dfs) # get award data award_dois, doi_award_name_dict, doi_award_track_dict = get_award_dicts(AWARD_PAPER_DF) df_final = get_df_final( df_merged, award_dois, doi_award_name_dict, doi_award_track_dict) # write to file df_final.to_csv(HT_CLEANED_PAPER_DF, index=False) if __name__ == '__main__': # load data VISPUBDATA_PLUS = pd.read_csv(VISPUBDATA_PLUS) PAPER_TO_STUDY = pd.read_csv(PAPER_TO_STUDY, header=None)[0].tolist() OPENALEX_PAPER_DF = pd.read_csv(OPENALEX_PAPER_DF) HT_CLEANED_AUTHOR_DF = pd.read_csv(HT_CLEANED_AUTHOR_DF) GSCHOLAR_DATA = pd.read_csv(GSCHOLAR_DATA) AWARD_PAPER_DF = pd.read_csv(AWARD_PAPER_DF) main() |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | import pandas as pd from bs4 import BeautifulSoup import requests, lxml import json import numpy as np import sys import random import time from io import StringIO from html.parser import HTMLParser from urllib3.util.retry import Retry from requests.adapters import HTTPAdapter import re PAPERS_TO_STUDY = sys.argv[1] VISPUBDATA_PLUS = sys.argv[2] IEEE_AUTHOR_DF = sys.argv[3] IEEE_PAPER_DF = sys.argv[4] PROBLEM_DOIS = sys.argv[5] def get_s(): # set retry if status codes in [ 500, 502, 503, 504, 429] # als return headers s = requests.Session() retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[ 500, 502, 503, 504, 429], ) s.mount('http://', HTTPAdapter(max_retries=retries)) headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", 'Accept': 'application/json', } return s, headers def get_dicts(VISPUBDATA_PLUS): # get year_dict and title_dict vispd_plus = pd.read_csv(VISPUBDATA_PLUS) dois = vispd_plus.loc[:, "DOI"].tolist() titles = vispd_plus.loc[:, "Title"].tolist() years = vispd_plus.loc[:, "Year"].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) return doi_year_dict, doi_title_dict def get_response(URL): response = s.get(url=URL, headers=headers) while response.status_code != 200: print(f'response status code is {response.status_code}. retrying now...') time.sleep(5) response = s.get(url=URL, headers=headers) return response def get_soup(RESPONSE): html = RESPONSE.text soup = BeautifulSoup(html, 'lxml') return soup def get_j(DOI, SOUP): if DOI != '10.1109/VIS.1999.10000': str = SOUP.find_all('script')[11].string.rsplit( 'xplGlobal.document.metadata=')[1].rsplit( 'xplGlobal.document.userLoggedIn=')[0] # delete anything after the last `}` str = str.replace(re.findall(r'[^\}]+$', str)[0], '') j = json.loads(str) else: j = None return j # scrip html tags and entities in titles # source: https://stackoverflow.com/a/925630 class MLStripper(HTMLParser): def __init__(self): super().__init__() self.reset() self.strict = False self.convert_charrefs= True self.text = StringIO() def handle_data(self, d): self.text.write(d) def get_data(self): return self.text.getvalue() def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data() # def get_ieee_title(J): # # get ieee paper title # title_raw = J['title'] # title = strip_tags(title_raw) # return title def update_paper_dict_list(J, DOI): if DOI != '10.1109/VIS.1999.10000': title_raw = J['title'] ieee_title = strip_tags(title_raw) ieee_doi = J['doi'] else: ieee_title = doi_title_dict[DOI] ieee_doi = DOI paper_dict = { 'Year': doi_year_dict[DOI], 'DOI': DOI, 'Title': doi_title_dict[DOI], 'IEEE Title': ieee_title, 'IEEE DOI': ieee_doi, } paper_dict_list.append(paper_dict) def update_author_dict_list(J, DOI): AUTHOR_JSON = J['authors'] for i in AUTHOR_JSON: try: first_name = i['firstName'] except: first_name = None try: last_name = i['lastName'] except: last_name = None try: author_name = i['name'] except: author_name = None author_num = len(AUTHOR_JSON) author_position = AUTHOR_JSON.index(i) + 1 try: affiliation_element = i['affiliation'] affiliation_name = affiliation_element[0] affiliation_num = len(affiliation_element) one_affiliation = True if affiliation_num == 1 else False except: affiliation_name = None affiliation_num = None one_affiliation = None try: author_id = 'https://ieeexplore.ieee.org/author/' + i['id'] except: author_id = None author_dict = { 'Year': doi_year_dict[DOI], 'DOI': DOI, 'Title': doi_title_dict[DOI], # 'IEEE Title': IEEE_TITLE, # 'First Name': first_name, # 'Last Name': last_name, 'Number of Authors': author_num, 'Author Position': author_position, 'Author Name': author_name, 'Author ID': author_id, 'Author Affiliation': affiliation_name, # 'Number of Affiliations': affiliation_num, 'One Affiliation': one_affiliation, } author_dict_list.append(author_dict) def get_empty_author_dict(DOI): author_dict = { 'Year': doi_year_dict[DOI], 'DOI': DOI, 'Title': doi_title_dict[DOI], } author_dict_list.append(author_dict) def main(DOIS): for DOI in DOIS: doi_index = DOIS.index(DOI) + 1 url = 'https://doi.org/' + DOI response = get_response(url) soup = get_soup(response) j = get_j(DOI, soup) update_paper_dict_list(j, DOI) try: if DOI != '10.1109/VIS.1999.10000': update_author_dict_list(j, DOI) else: get_empty_author_dict(DOI) except: problem_dois_list.append(DOI) print(f'something wrong with {DOI}') time.sleep(0.4+random.uniform(0, 0.4)) print(f'{doi_index} is done') if __name__ == '__main__': s = get_s()[0] headers = get_s()[1] PAPERS = pd.read_csv(PAPERS_TO_STUDY, header=None) DOIS = PAPERS[0].tolist() random_dois = random.sample(DOIS, 10) random_dois.append('10.1109/VIS.1999.10000') doi_year_dict, doi_title_dict = get_dicts(VISPUBDATA_PLUS) headers = { 'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582' } author_dict_list = [] paper_dict_list = [] problem_dois_list = [] # main(random_dois) main(DOIS) author_df = pd.DataFrame(author_dict_list) paper_df = pd.DataFrame(paper_dict_list) author_df.to_csv(IEEE_AUTHOR_DF, index=False) paper_df.to_csv(IEEE_PAPER_DF, index=False) with open(PROBLEM_DOIS, 'w') as f: for doi in problem_dois_list: f.write("%s\n" % doi) |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 | import sys import pandas as pd import re import numpy as np import csv import difflib IEEE_AUTHOR = sys.argv[1] OPENALEX_AUTHOR = sys.argv[2] PAPERS_TO_STUDY = sys.argv[3] VISPUBDATA = sys.argv[4] MERGED_AUTHOR_DF = sys.argv[5] def get_dicts(VISPUBDATA): # get year_dict and title_dict vispd = pd.read_csv(VISPUBDATA) dois = vispd.loc[:, "DOI"].tolist() titles = vispd.loc[:, "Title"].tolist() years = vispd.loc[:, "Year"].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) return doi_year_dict, doi_title_dict def read_txt(INPUT): """read txt files and return a list """ raw = open(INPUT, "r") reader = csv.reader(raw) allRows = [row for row in reader] data = [i[0] for i in allRows] return data def update_ieee_orig(DF): # df here is iee_orig """update ieee_org ieee_org is wrong in '10.1109/TVCG.2008.157' as it contains an additional author that shouldn't be there; also, ieee_org lacks author info for '10.1109/VIS.1999.10000'. What this function does is to delete the additional author in '10.1109/TVCG.2008.157' and update info in that paper. Then, I added author data manually for '10.1109/VIS.1999.10000'. """ DF = DF.drop(DF[DF.DOI == '10.1109/VIS.1999.10000'].index) row_to_drop = DF.index[DF.DOI == '10.1109/TVCG.2008.157'].tolist()[0] df_dropped = DF.drop([row_to_drop]) df_dropped.loc[df_dropped.DOI == '10.1109/TVCG.2008.157', 'Number of Authors'] -= 1 df_dropped.loc[df_dropped.DOI == '10.1109/TVCG.2008.157', 'Author Position'] -= 1.0 df = df_dropped FILL_DATA = [ { 'Year': 1999, 'DOI': '10.1109/VIS.1999.10000', 'Title': 'Progressive Compression of Arbitrary Triangular Meshes', 'Number of Authors': 3, 'Author Position': 1, 'Author Name': 'Daniel Cohen-Or', 'Author ID': np.NaN, 'Author Affiliation': 'Tel Aviv University', 'One Affiliation': True, }, { 'Year': 1999, 'DOI': '10.1109/VIS.1999.10000', 'Title': 'Progressive Compression of Arbitrary Triangular Meshes', 'Number of Authors': 3, 'Author Position': 2, 'Author Name': 'David Levin', 'Author ID': np.NaN, 'Author Affiliation': 'Tel Aviv University', 'One Affiliation': True, }, { 'Year': 1999, 'DOI': '10.1109/VIS.1999.10000', 'Title': 'Progressive Compression of Arbitrary Triangular Meshes', 'Number of Authors': 3, 'Author Position': 3, 'Author Name': 'Offir Remez', 'Author ID': np.NaN, 'Author Affiliation': 'Tel Aviv University', 'One Affiliation': True, } ] fill_data_df = pd.DataFrame(FILL_DATA) df = df.append(fill_data_df, ignore_index = True) return df def get_diff_dois(IEEE, ALEX): # ieee, alex # return a list of DOIs where alex is wrong in Number of Authors DOIS = list(set(IEEE.DOI)) diff_dois = [] for doi in DOIS: ieee_n = IEEE[IEEE.DOI == doi]['Number of Authors'].tolist()[0] alex_n = ALEX[ALEX.DOI == doi]['Number of Authors'].tolist()[0] if ieee_n != alex_n: diff_dois.append(doi) return diff_dois def get_alex_new(IEEE, ALEX, DIFF_DOIS): """ For DOIs where alex is wrong in Number of Authors, get correct data from IEEE first Drop the rows where alex is wrong from alex, and append the correct ieee data to alex_dropped Returns: alex_new, where data of Number of Authors is correct """ df_to_append = IEEE[IEEE.DOI.isin(DIFF_DOIS)].iloc[:, 0:6] alex_dropped = ALEX.drop(ALEX[ALEX.DOI.isin(DIFF_DOIS)].index) alex_new = alex_dropped.append(df_to_append, ignore_index = True) return alex_new def get_sorted_dfs(IEEE, ALEX_NEW, PAPERS): """sort ieee and alex author df by paper index and author position I added a variable 'Paper Index' to both ieee and alex_new. I also added a prefix of 'IEEE ' in ieee. Then I sort the two datasets by 'Paper Index' and 'Author Position'. Returns: two dataframes, ieee_sorted, and alex_new_sorted """ IEEE['Paper Index'] = [PAPERS.index(i) for i in IEEE.DOI.tolist()] ALEX_NEW['Paper Index'] = [PAPERS.index(i) for i in ALEX_NEW.DOI.tolist()] IEEE = IEEE.add_prefix('IEEE ') alex_new_sorted = ALEX_NEW.sort_values( by=['Paper Index', 'Author Position'], ).reset_index(drop=True) ieee_sorted = IEEE.sort_values( by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True) return ieee_sorted, alex_new_sorted def get_concat_df(IEEE, ALEX, PAPERS): # ieee_sorted, alex_sorted """check https://stackoverflow.com/a/13680953 for details """ fuzzy_match_df_list = [] mismatch_doi_list = [] for doi in PAPERS: df1 = IEEE[IEEE['IEEE DOI'] == doi] df2 = ALEX[ALEX['DOI'] == doi] try: kwargs = {'IEEE Author Name': df2['Author Name'].apply( lambda x: difflib.get_close_matches( x, df1['IEEE Author Name'], cutoff=0.6)[0]) } except: kwargs = {'IEEE Author Name': df1['IEEE Author Name']} mismatch_doi_list.append(doi) df2 = df2.assign(**kwargs) df = df1.merge(df2, on='IEEE Author Name', how='inner') fuzzy_match_df_list.append(df) print(f'in {len(mismatch_doi_list)} dois, fuzzy matching was not successful, so I assumed author position in merging') df = pd.concat(fuzzy_match_df_list, ignore_index=True) return df def flatten(t): """convert list of lists to a list of items""" """source: https://stackoverflow.com/a/952952""" return [item for sublist in t for item in sublist] def update_with_vispubdata_author_data(VISPD, DF): # vispd, concat_df ieee_wrong = [ '10.1109/INFVIS.2005.1532150', '10.1109/VISUAL.2005.1532819', '10.1109/VISUAL.2005.1532794', '10.1109/VISUAL.1992.235178', ] correct_author_num = [5, 2, 5, 4] correct_author_num_dict = dict(zip(ieee_wrong, correct_author_num)) vispd_names = VISPD.loc[VISPD.DOI.isin(ieee_wrong), 'AuthorNames-Deduped'].tolist() dois = flatten([np.repeat(doi, correct_author_num_dict[doi]) for doi in ieee_wrong]) years = [doi_year_dict[x] for x in dois] titles = [doi_title_dict[x] for x in dois] author_names = flatten([x.split(';') for x in vispd_names]) author_nums = flatten([np.repeat(i, i) for i in correct_author_num]) author_positions = flatten([range(1, i+1) for i in correct_author_num]) paper_index = [papers.index(doi) for doi in dois] DF_TO_FILL = pd.DataFrame({ 'IEEE DOI': dois, 'DOI': dois, 'IEEE Year': years, 'Year': years, 'IEEE Title': titles, 'Title': titles, 'IEEE Number of Authors': author_nums, 'IEEE Author Position': author_positions, 'IEEE Author Name': author_names, 'Number of Authors': author_nums, 'Author Position': author_positions, 'Author Name': author_names, 'IEEE Paper Index': paper_index, 'Paper Index': paper_index, }) df_dropped = DF.drop(DF[DF['IEEE DOI'].isin(ieee_wrong)].index) df_new = df_dropped.append(DF_TO_FILL, ignore_index=True) df_new = df_new.sort_values( by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True) return df_new def update_country_code(DF, DOI, NEW_DATA): DF.loc[DF['DOI'] == DOI, 'First Institution Country Code By Hand'] = NEW_DATA # this is to change openalex author names to be the same as IEEE author names # DF.loc[DF['DOI'] == DOI, 'Author Name'] = DF.loc[DF['DOI'] == DOI, 'IEEE Author Name'] return DF def update_country_code_by_raw_string(DF, RAW_STRING, NEW_DATA): DF.loc[DF['Raw Affiliation String'] == RAW_STRING, 'First Institution Country Code By Hand'] = NEW_DATA return DF def update_type(DF, DOI, NEW_DATA): DF.loc[DF['DOI'] == DOI, 'First Institution Type By Hand'] = NEW_DATA return DF def update_type_by_raw_string(DF, RAW_STRING, NEW_DATA): DF.loc[DF['Raw Affiliation String'] == RAW_STRING, 'First Institution Type By Hand'] = NEW_DATA return DF def update_affiliations(DF, DOI, NEW_DATA): # update both ieee author affiliation, alex first institution names and raw string DF.loc[DF['DOI'] == DOI, 'IEEE Author Affiliation'] = NEW_DATA DF.loc[DF['DOI'] == DOI, 'First Institution Name'] = NEW_DATA DF.loc[DF['DOI'] == DOI, 'Raw Affiliation String'] = NEW_DATA return DF def update_author_name(DF, DOI, NEW_DATA): DF.loc[DF['DOI'] == DOI, 'IEEE Author Name'] = NEW_DATA return DF def update_concat_df(DF): # DF here is concat_df """Update data for specific DOIs Return: still concat_df, but updated """ # '10.1109/VISUAL.1996.568115', update_country_code( DF, '10.1109/VISUAL.1996.568115', ['US']*3, ) update_type( DF, '10.1109/VISUAL.1996.568115', ['company']*2 + ['facility'], ) update_affiliations( DF, '10.1109/VISUAL.1996.568115', ['MRJ, Inc']*2 + ['NASA Ames Research Center'] ) # '10.1109/VISUAL.2000.885735' update_country_code( DF, '10.1109/VISUAL.2000.885735', np.repeat('NL', 6), ) update_type( DF, '10.1109/VISUAL.2000.885735', ['government']*2 + ['education']*4, ) update_affiliations( DF, '10.1109/VISUAL.2000.885735', np.append( np.repeat( 'Center for Mathematics and Computer Science, CWI, Amsterdam, Netherlands', 2), np.repeat( 'Swammerdam Inst. for Life Sciences, BioCentrum Amsterdam, Amsterdam, Netherlands', 4) ) ) # '10.1109/VISUAL.1996.568143', update_country_code( DF, '10.1109/VISUAL.1996.568143', ['US']*6, ) update_type( DF, '10.1109/VISUAL.1996.568143', ['education']*6, ) update_affiliations( DF, '10.1109/VISUAL.1996.568143', ['Ohio State University, Columbus, OH, USA']*6 ) # '10.1109/VISUAL.1999.809936', update_country_code( DF, '10.1109/VISUAL.1999.809936', ['US']*3, ) update_type( DF, '10.1109/VISUAL.1999.809936', ['education']*3, ) update_affiliations( DF, '10.1109/VISUAL.1999.809936', ['Worcester Polytechnic Institute, Worcester, MA, USA']*3 ) # '10.1109/INFVIS.2002.1173147', # IEEE Xplore got author name wrong update_country_code( DF, '10.1109/INFVIS.2002.1173147', ['SE', 'US', 'SE'], ) update_type( DF, '10.1109/INFVIS.2002.1173147', ['education']*3, ) update_affiliations( DF, '10.1109/INFVIS.2002.1173147', [ 'Dept. of Information Science, Uppsala University, Uppsala, Sweden', 'Dept. of Psychology, Indiana University, Bloomington, Indiana, USA', 'Dept. of Information Science, Uppsala University, Uppsala, Sweden', ] ) update_author_name( DF, '10.1109/INFVIS.2002.1173147', ['M. Lind', 'G.P. Bingham', 'C. Forsell'], ) # '10.1109/VISUAL.1992.235175', update_country_code( DF, '10.1109/VISUAL.1992.235175', ['US']*12, ) update_type( DF, '10.1109/VISUAL.1992.235175', ['company']*3 + ['government']*2 + ['education']*6 + ['company']*1 ) update_affiliations( DF, '10.1109/VISUAL.1992.235175', [ 'Unisys Corporation', 'Sterling Software', 'Unisys Corporation', 'U.S. Environmental Protection Agency, United States', 'U.S. Environmental Protection Agency', 'University of Alabama in Huntsville (UAH), United States', 'Florida State University, United States', 'Florida State University, United States', 'University of Wisconsin, Madison, WI, United States', 'University of Wisconsin, Madison, WI, United States', 'University of Wisconsin, Madison, WI, United States', 'IBM T.J. Watson Research Center, United States', ] ) # '10.1109/TVCG.2006.182', update_country_code( DF, '10.1109/TVCG.2006.182', ['US']*5, ) update_type( DF, '10.1109/TVCG.2006.182', ['education']*5, ) update_affiliations( DF, '10.1109/TVCG.2006.182', ['Brown University, United States']*5, ) # '10.1109/TVCG.2015.2467971', update_country_code( DF, '10.1109/TVCG.2015.2467971', ['US']*5, ) update_type( DF, '10.1109/TVCG.2015.2467971', ['education']*5, ) update_affiliations( DF, '10.1109/TVCG.2015.2467971', ['University of North Carolina at Charlotte, NC, United States']*5, ) # '10.1109/SciVis.2015.7429489', # author affilitions listed on ieee are all WRONG!!! # I found the authors' correct affilition on their ieee author id pages update_country_code( DF, '10.1109/SciVis.2015.7429489', ['DE']*5, ) update_type( DF, '10.1109/SciVis.2015.7429489', ['education']*5, ) update_affiliations( DF, '10.1109/SciVis.2015.7429489', ['Technical University of Munich, Germany']*5, ) # '10.1109/VISUAL.2005.1532821', update_country_code( DF, '10.1109/VISUAL.2005.1532821', ['AT', 'HR', 'AT', 'AT', 'US'], ) update_type( DF, '10.1109/VISUAL.2005.1532821', ['company']*4 + ['education']*1 ) update_affiliations( DF, '10.1109/VISUAL.2005.1532821', ['VRVis Research Center Vienna, Austria'] + ['AVL-AST Zagreb, Croatia'] + [ 'VRVis Research Center Vienna, Austria']*2 + ['Virginia Tech'] ) # '10.1109/VISUAL.2000.885692', update_country_code( DF, '10.1109/VISUAL.2000.885692', ['US']*6, ) update_type( DF, '10.1109/VISUAL.2000.885692', ['education']*6, ) update_affiliations( DF, '10.1109/VISUAL.2000.885692', ['University of Utah, Salt Lake City, UT, USA']*4 + ['Vanderbilt University, USA'] + [ 'University of Utah, Salt Lake City, UT, USA'], ) # '10.1109/VISUAL.1999.809912', update_country_code( DF, '10.1109/VISUAL.1999.809912', ['DE']*4, ) update_type( DF, '10.1109/VISUAL.1999.809912', ['education']*2 + ['healthcare']*2, ) update_affiliations( DF, '10.1109/VISUAL.1999.809912', ['WSUGRIS, University of Tubingen, Tubingen, Germany']*2 + [ 'Department of Neuroradiology, University Hospital Tubingen, Tubingen, Germany']*2 , ) # '10.1109/VISUAL.1999.809929', update_country_code( DF, '10.1109/VISUAL.1999.809929', ['US']*4, ) update_type( DF, '10.1109/VISUAL.1999.809929', ['company']*4, ) update_affiliations( DF, '10.1109/VISUAL.1999.809929', ['IBM T.J. Watson Research Center, United States']*3 + [ 'UBS Group AG'] , ) # '10.1109/VISUAL.1999.809884', # In this paper, openalex got country wrong and ieee got some of the affiliation wrong update_country_code( DF, '10.1109/VISUAL.1999.809884', ['DE']*5, ) update_type( DF, '10.1109/VISUAL.1999.809884', ['nonprofit']*4 + ['education']*1, ) update_affiliations( DF, '10.1109/VISUAL.1999.809884', ['German National Research Centre for Information Technology, Germany']*4 + [ 'Department of Physics & Astronomy, University of Heidelberg, Germany'] , ) # '10.1109/VISUAL.1999.809920', # openalex got country wrong update_country_code( DF, '10.1109/VISUAL.1999.809920', ['DE']*5, ) # '10.1109/VISUAL.1993.398911', # openalex got this paper country wrong for the last two authors update_country_code( DF, '10.1109/VISUAL.1993.398911', ['RU']*4 + ['DE']*2, ) # '10.1109/VISUAL.2005.1532816', # ieee xplore got author positions and author affiliations wrong update_author_name( DF, '10.1109/VISUAL.2005.1532816', [ 'Gregor Schlosser', 'J ̈urgen Hesser', 'Frank Zeilfelder', 'Christian Rossl', 'Reinhard Manner', 'Gunther Nurnberger', 'Hans-Peter Seidel', ], ) update_country_code( DF, '10.1109/VISUAL.2005.1532816', ['DE']*7, ) update_type( DF, '10.1109/VISUAL.2005.1532816', ['education']*3 + ['nonprofit']*1 + ['education']*2 + ['nonprofit']*1, ) update_affiliations( DF, '10.1109/VISUAL.2005.1532816', ['ICM, Universitäten Mannheim und Heidelberg, Mannheim, Germany']*2 + ['Institut für Mathematik, Universität Mannheim, Mannheim, Germany'] + ['Max Planck Institut für Informatik, Saarbruecken, Germany'] + ['ICM, Universitäten Mannheim und Heidelberg, Mannheim, Germany'] + ['Institut für Mathematik, Universität Mannheim, Mannheim, Germany'] + ['Max Planck Institut für Informatik, Saarbruecken, Germany'], ) # '10.1109/VAST.2016.7883507', # This is the paper where i don't have ieee author affilition or openalex raw string, # but i have openalex first institution name. # Another note: Information on IEEE about the first two authors of this paper is WRONG! update_country_code( DF, '10.1109/VAST.2016.7883507', ['DE']*5, ) update_type( DF, '10.1109/VAST.2016.7883507', ['education']*5, ) update_affiliations( DF, '10.1109/VAST.2016.7883507', ['University of Stuttgart, Germany']*5 ) # '10.1109/VISUAL.2004.38', update_country_code( DF, '10.1109/VISUAL.2004.38', ['CN']*1 + ['US']*3, ) update_type( DF, '10.1109/VISUAL.2004.38', ['education']*3 + ['company']*1, ) update_affiliations( DF, '10.1109/VISUAL.2004.38', ['Zhejiang University, China'] + ['Carnegie Mellon University, United States'] + [ 'Massachusetts Institute Of Technology, United States'] + [ 'Mitsubishi Electric Research Laboratories, United States'] ) """The following are cases where i have raw string, but not type or country code""" # '10.1109/TVCG.2006.195', update_country_code( DF, '10.1109/TVCG.2006.195', ['NL']*3 ) update_type( DF, '10.1109/TVCG.2006.195', ['education']*2 + ['government']*1, ) update_affiliations( DF, '10.1109/TVCG.2006.195', ['Swammerdam Institute for Life Sciences (SILS), University of Amsterdam, Netherlands']*2 + [ 'Center for Mathematics and Computer Science (CWI), Netherlands' ]*1 ) # '10.1109/VISUAL.1996.567752', update_country_code( DF, '10.1109/VISUAL.1996.567752', ['US']*3 ) update_type( DF, '10.1109/VISUAL.1996.567752', ['company']*3 ) update_affiliations( DF, '10.1109/VISUAL.1996.567752', ['GE Corporate Research & Development, United States']*3, ) # '10.1109/VISUAL.1999.809907', update_country_code( DF, '10.1109/VISUAL.1999.809907', ['NL']*2 ) update_type( DF, '10.1109/VISUAL.1999.809907', ['government']*2 ) update_affiliations( DF, '10.1109/VISUAL.1999.809907', ['Center for Mathematics and Computer Science (CWI), Netherlands']*2, ) # '10.1109/VISUAL.2004.88', update_country_code( DF, '10.1109/VISUAL.2004.88', ['DE']*2 ) update_type( DF, '10.1109/VISUAL.2004.88', ['nonprofit'] + ['education'] ) update_affiliations( DF, '10.1109/VISUAL.2004.88', ['Caesar Research Center, Bonn, Germany'] + [ 'Interdisciplinary Center for Scientific Computing, Heidelberg, Germany'], ) # '10.1109/VISUAL.2004.113', update_type_by_raw_string( DF, 'DLR Goettingen', ['government'] ) update_country_code_by_raw_string( DF, 'DLR Goettingen', 'DE' ) # '10.1109/VISUAL.2000.885722', update_type_by_raw_string( DF, 'ETH Zentrum, CH - 8092 Switzerland', 'education' ) update_country_code_by_raw_string( DF, 'ETH Zentrum, CH - 8092 Switzerland', 'CH' ) # '10.1109/VISUAL.2000.885715', update_country_code( DF, '10.1109/VISUAL.2000.885715', ['DE']*3 + ['NL'] + ['DE'] + ['NL'] ) update_type( DF, '10.1109/VISUAL.2000.885715', ['education']*6, ) update_affiliations( DF, '10.1109/VISUAL.2000.885715', ['University of Bonn, Bonn, Germany'] * 3 + ['Eindhoven University of Technology'] + [ 'University of Bonn, Bonn, Germany'] + ['Eindhoven University of Technology'] ) # '10.1109/VISUAL.2000.885731', update_country_code( DF, '10.1109/VISUAL.2000.885731', ['US']*6, ) update_type( DF, '10.1109/VISUAL.2000.885731', ['education']*6, ) update_affiliations( DF, '10.1109/VISUAL.2000.885731', ['Brown University, United States']*6, ) # '10.1109/VISUAL.1996.568133', update_country_code( DF, '10.1109/VISUAL.1996.568133', ['US']*7, ) update_type( DF, '10.1109/VISUAL.1996.568133', ['healthcare'] + ['education'] + ['facility']*2 + ['healthcare'] + ['education']*2, ) update_affiliations( DF, '10.1109/VISUAL.1996.568133', ['National Jewish Center for Immunology and Respiratory Medicine, United States'] + [ 'University of New Mexico, United States'] + [ 'Sandia National Laboratories, United States']*2 + [ 'National Jewish Center for Immunology and Respiratory Medicine, United States'] + [ 'State University of New York at Stony Brook, United States'] + [ 'University of New Mexico, United States'] ) # '10.1109/VISUAL.2005.1532808', update_country_code( DF, '10.1109/VISUAL.2005.1532808', ['DE'], ) update_type( DF, '10.1109/VISUAL.2005.1532808', ['education'], ) update_affiliations( DF, '10.1109/VISUAL.2005.1532808', ['University of Stuttgart'] ) # '10.1109/VISUAL.1998.745350', update_country_code( DF, '10.1109/VISUAL.1998.745350', ['US']*6, ) update_type( DF, '10.1109/VISUAL.1998.745350', ['facility']*6, ) update_affiliations( DF, '10.1109/VISUAL.1998.745350', ['Naval Reseach Lab, Washington, D.C.']*6 ) # '10.1109/VISUAL.2005.1532776', update_country_code( DF, '10.1109/VISUAL.2005.1532776', ['US']*7, ) update_type( DF, '10.1109/VISUAL.2005.1532776', ['company']*3 + ['facility']*2 + ['company']*2, ) update_affiliations( DF, '10.1109/VISUAL.2005.1532776', ['Kitware, United States']*3 + [ 'Sandia National Laboratories, United States']*2 + [ 'Simmetrix, United States']*2, ) # '10.1109/VISUAL.1996.568150', update_country_code( DF, '10.1109/VISUAL.1996.568150', ['NL']*4, ) update_type( DF, '10.1109/VISUAL.1996.568150', ['nonprofit'] + ['government']*2 + ['education'] ) update_affiliations( DF, '10.1109/VISUAL.1996.568150', ['Netherlands Energy Research Foundation, Netherlands'] + [ 'Centre for Mathematics and Computer Science (CWI), Netherlands']*2 + [ 'Vrije Universiteit, Netherlands'] ) # '10.1109/VISUAL.1990.146398', update_country_code( DF, '10.1109/VISUAL.1990.146398', ['US']*4, ) update_type( DF, '10.1109/VISUAL.1990.146398', ['government'] + ['company']*3 ) update_affiliations( DF, '10.1109/VISUAL.1990.146398', ['NASA Ames Research Center, Moffett Field, CA, USA'] + [ 'Sterling Software, United States'] + [ 'Crossfield Marketing, United States'] + [ 'Crystal River Engineering, Inc., Groveland, CA, USA'] ) # '10.1109/VISUAL.1996.568120', update_country_code( DF, '10.1109/VISUAL.1996.568120', ['US']*3, ) update_type( DF, '10.1109/VISUAL.1996.568120', ['education']*3 ) update_affiliations( DF, '10.1109/VISUAL.1996.568120', ['University of Illinois at Chicago, United States'] + [ 'University of Chicago, United States'] + [ 'University of Illinois at Chicago, United States'] ) """BELOW ARE WHERE I FILL AUTHOR DATA FOR ROWS WHERE DATA WAS FROM VISPUBDATA RAW""" # '10.1109/INFVIS.2005.1532150', update_country_code( DF, '10.1109/INFVIS.2005.1532150', ['US']*5, ) update_type( DF, '10.1109/INFVIS.2005.1532150', ['education']*5, ) update_affiliations( DF, '10.1109/INFVIS.2005.1532150', ['Stanford University, United States']*5, ) # '10.1109/VISUAL.2005.1532819', update_country_code( DF, '10.1109/VISUAL.2005.1532819', ['CA']*2, ) update_type( DF, '10.1109/VISUAL.2005.1532819', ['education']*2, ) update_affiliations( DF, '10.1109/VISUAL.2005.1532819', ['University of Alberta, Canada']*2, ) # '10.1109/VISUAL.2005.1532794', update_country_code( DF, '10.1109/VISUAL.2005.1532794', ['US']*5, ) update_type( DF, '10.1109/VISUAL.2005.1532794', ['facility'] + ['education']*3 + ['facility'], ) update_affiliations( DF, '10.1109/VISUAL.2005.1532794', ['Oak Ridge National Lab, United States'] + [ 'The University of Tennessee, United States']*3 + [ 'Oak Ridge National Lab, United States'], ) # '10.1109/VISUAL.1992.235178', update_country_code( DF, '10.1109/VISUAL.1992.235178', ['US']*4, ) update_type( DF, '10.1109/VISUAL.1992.235178', ['education']*4, ) update_affiliations( DF, '10.1109/VISUAL.1992.235178', ['University of Utah, United States']*4, ) ## IEEE Website updates the name of Sehi LYi but this update is ## different from the name shown on PDF. I changed it back. # '10.1109/TVCG.2021.3114876', update_author_name( DF, '10.1109/TVCG.2021.3114876', ["Sehi L'Yi", 'Qianwen Wang', 'Fritz Lekschas', 'Nils Gehlenborg'], ) ## I found the in this paper, Some authors' affiliations contain two institutions update_country_code( DF, '10.1109/TVCG.2011.207', ['DE']*4, ) update_type( DF, '10.1109/TVCG.2011.207', ['company'] + ['education']*1 + ['company']*2, ) update_affiliations( DF, '10.1109/TVCG.2011.207', ['Fraunhofer MEVIS, Germany'] + [ 'Center of Complex Systems and Visualization (CeVis), University of Bremen, Germany']*1 + [ 'Fraunhofer MEVIS, Germany']*2, ) ## I found that in this paper, the first author has two affiliations update_country_code( DF, '10.1109/INFVIS.2004.1', ['FR']*3, ) update_type( DF, '10.1109/INFVIS.2004.1', ['education']*1 + ['nonprofit']*1 + ['education']*1 ) update_affiliations( DF, '10.1109/INFVIS.2004.1', ['ecole des mines de nantes nantes france'] + ['INRIA']*1 + ['ecole des mines de nantes nantes france'], ) return DF def manual_update(DF, DOI, AUTHOR_NAME, COL_TO_CHANGE, TEXT): """This is to manually update errors in rows where ieee author info is nan and where openalex author info is complete """ DF.loc[(DF['DOI'] == DOI) & (DF['IEEE Author Name'] == AUTHOR_NAME), COL_TO_CHANGE] = TEXT def manual_update_concat_df(DF): # DF here is concat_df manual_update( DF, '10.1109/VISUAL.1997.663848', 'R. Machiraju', 'Raw Affiliation String', 'Mississippi State University, Mississippi, United States' ) manual_update( DF, '10.1109/VISUAL.2004.128', 'E. Parkinson', 'Raw Affiliation String', 'VA Tech Hydro Corporation, Swizerland', ) manual_update( DF, '10.1109/VISUAL.2004.128', 'E. Parkinson', 'First Institution Type', 'company' ) manual_update( DF, '10.1109/VISUAL.2004.128', 'E. Parkinson', 'First Institution Country Code', 'CH', ) manual_update( DF, '10.1109/INFVIS.1999.801864', 'J. Sean', 'IEEE Author Name', 'Jeffrey Senn', ) manual_update( DF, '10.1109/INFVIS.1999.801864', 'J. Sean', 'Author Name', 'Jeffrey Senn', ) manual_update( DF, '10.1109/TVCG.2019.2934260', 'Andrew J. Solis', 'Raw Affiliation String', 'University of Texas Austin, Texas, United States', ) manual_update( DF, '10.1109/TVCG.2019.2934260', 'Andrew J. Solis', 'First Institution Name', 'University of Texas Austin', ) def get_concat_df_filled(DF): # DF here is concat_df """ find out who don't have affilition, and fill the data manually Get the subset of concat_df where there does not exist any affiliation name. Then drop this subset from concat_df Update this subset's IEEE Author Affiliation with fill_dict, and then append this updated subset to concat_df_dropped Returns: concat_df_filled, where all authors have at least one affiliation name """ fill_dict = { 'K.I. Joy': 'University of California, Davis, United States', 'H. Pfister': 'Department of Computer Science, State University of New York at Stony Brook, United States', 'A.J. Kolojechick': 'Carnegie Mellon University,School of Computer Science,Pittsburgh,United States', 'M. Roth': 'Computer Graphics Research Group, Deptartment of Computer Science, ETH Zurich, Switzerland', 'P.C. Wong': 'Pacific Northwest National Laboratory, United States', 'H. Foote': 'Pacific Northwest National Laboratory, United States', 'W. Strasser': 'Computer Graphics Lab, University of Tubingen, Germany', 'M. Tuveri': 'Center for Advanced Studies, Research and Development in Sardinia, Cagliari, Italy', 'N. Fanst': 'Georgia Institute of Technology, United States', 'Heike Janicke': 'Image and Signal Processing Group at the Universi ̈at Leipzig, Germany', 'A. Vilanova': 'Institute of Computer Graphics, Vienna University of Technology, Austria', 'P. Thiansathaporn': 'Department of Physics & Astronomy, University of North Carolina, Chapel Hill, United States', 'B. Hegedust': 'Institute of Computer Graphics, Vienna University of Technology, Austria', 'W.C. Flowers': 'Massachusetts Institute of Technology, United States', 'G. Turk': 'GVU Center, College of Computing, Georgia Institute of Technology, United States', 'P. Ermest': 'Philips Medical Systems, Best, Netherlands', 'T. Moller': 'Department Of Computer And Information Science, The Ohio State University, Columbus, Ohio, United States', 'K. Fostiropoulos': 'German National Research Centre for Information Technology, Germany', 'F. Sobieczky': 'University of Göttingen, Germany', 'W. Bertelheimer': 'Bayerische Motoren Werke AG (BMW) Corporation, Germany', } to_fill_df = DF[( DF['IEEE Author Affiliation'].isnull()) & ( DF['Raw Affiliation String'].isnull()) & ( DF['First Institution Name'].isnull()) ] rows_to_drop = DF.index[( DF['IEEE Author Affiliation'].isnull()) & ( DF['Raw Affiliation String'].isnull()) & ( DF['First Institution Name'].isnull()) ] concat_df_dropped = DF.drop(rows_to_drop) if concat_df_dropped.shape[0] + to_fill_df.shape[0] == DF.shape[0]: print('concat_df_dropped has correct row numbers') else: print('concat_df_dropped has incorrect row numbers') name_list = to_fill_df['IEEE Author Name'].tolist() kwargs = {'IEEE Author Affiliation' : lambda x: [fill_dict[i] for i in name_list]} to_fill_df = to_fill_df.assign(**kwargs) concat_df_filled = concat_df_dropped.append( to_fill_df, ignore_index=True).sort_values( by=['IEEE Paper Index', 'IEEE Author Position'], ).reset_index(drop=True) return concat_df_filled def recode_to_edu(DF): # df here is concat_df_filled # openalex got these institutions' type wrong. they should be education. edu_recode_list = [ 'Paris Diderot University', 'Paris Descartes University', 'École Polytechnique Fédérale de Lausanne', 'Johns Hopkins University School of Medicine' ] DF.loc[ DF['First Institution Name'].isin(edu_recode_list), 'First Institution Type' ] = 'education' return DF def get_alex_raw_string_correct(DF): # DF here is concat_df_filled """if openalex raw string is wrong, correct/update it with ieee author affliation """ openalex_raw_string_wrong = [ '10.1109/VISUAL.1999.809920', '10.1109/VISUAL.1999.809884', '10.1109/VISUAL.1993.398911', ] DF.loc[DF.DOI.isin(openalex_raw_string_wrong), 'Raw Affiliation String'] = DF.loc[ DF.DOI.isin(openalex_raw_string_wrong)]['IEEE Author Affiliation'] return DF def binary_type(row): if row['First Institution Type'] == 'education': binary_type = 'education' elif row['First Institution Type'] in [ 'facility', 'government', 'company', 'healthcare', 'archive', 'nonprofit','other' ]: binary_type = 'non-education' else: binary_type = np.NaN return binary_type def binary_type_by_hand(row): '''This is to transform type handcoded by me to binary type ''' if row['First Institution Type By Hand'] == 'education': binary_type = 'education' elif row['First Institution Type By Hand'] in [ 'facility', 'government', 'company', 'healthcare', 'archive', 'nonprofit', 'other', # just in case I have input these by hand: 'noneducation', 'non-education' ]: binary_type = 'non-education' else: binary_type = np.NaN return binary_type def add_binary_type(DF): # DF here is concat_df_filled DF['Binary Institution Type'] = DF.apply(binary_type, axis=1) DF['Binary Institution Type By Hand'] = DF.apply(binary_type_by_hand, axis=1) return DF def check_delete_rename(DF): # DF here is concat_df_filled # check paper index, author num, and author positions if DF['IEEE Paper Index'].tolist() == DF['Paper Index'].tolist(): print('Two paper index vectors are equal') else: print('Something wrong with paper index vectors') if DF['IEEE Number of Authors'].tolist() == DF['Number of Authors'].tolist(): print('Two author num vectors are equal') else: print('Something wrong with author num vectors') if DF['IEEE Author Position'].tolist() == DF['Author Position'].tolist(): print('Two author position vectors are equal') else: print('Something wrong with author position vectors\ , but this is expected as it indicates that the fuzzy matching above works.') # delete useless columns DF.drop(['Year', 'DOI', 'Title', 'IEEE Paper Index', 'Paper Index'], axis=1, inplace=True) # add a column called IEEE Author Affiliation Filled. It is bascially the same as # ieee author affiliation. The only difference is that if ieee is nan, # i get the data from openalex raw string DF['IEEE Author Affiliation Filled'] = np.where( DF['IEEE Author Affiliation'].notnull(), DF['IEEE Author Affiliation'], DF['Raw Affiliation String'], ) # rename columns DF.rename(columns={ 'IEEE Year': 'Year', 'IEEE DOI': 'DOI', 'IEEE Title': 'Title', 'IEEE Author Affiliation': 'IEEE Author Affiliation Updated', 'First Institution Name': 'First Institution Name Updated', 'Raw Affiliation String': 'Raw Affiliation String Updated', # 'First Institution Type': 'First Institution Type Updated', # 'First Institution Country Code': 'First Institution Country Code Updated', }, inplace=True) return DF def main(): ieee = update_ieee_orig(ieee_orig) diff_dois = get_diff_dois(ieee, alex) alex_new = get_alex_new(ieee, alex, diff_dois) ieee_sorted, alex_sorted = get_sorted_dfs(ieee, alex_new, papers) concat_df = get_concat_df(ieee_sorted, alex_sorted, papers) concat_df = update_with_vispubdata_author_data(vispd, concat_df) concat_df = update_concat_df(concat_df) manual_update_concat_df(concat_df) concat_df_filled = get_concat_df_filled(concat_df) concat_df_filled = recode_to_edu(concat_df_filled) concat_df_filled = get_alex_raw_string_correct(concat_df_filled) concat_df_filled = add_binary_type(concat_df_filled) concat_df_filled = check_delete_rename(concat_df_filled) return concat_df_filled if __name__ == '__main__': vispd = pd.read_csv(VISPUBDATA) doi_year_dict, doi_title_dict = get_dicts(VISPUBDATA) ieee_orig = pd.read_csv(IEEE_AUTHOR) alex = pd.read_csv(OPENALEX_AUTHOR) papers = read_txt(PAPERS_TO_STUDY) df = main() df.to_csv(MERGED_AUTHOR_DF, index=False) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 | import pandas as pd import numpy as np import requests import random import math import re import sys import time from urllib3.util.retry import Retry from requests.adapters import HTTPAdapter import json OPENALEX_PAPER_DF = sys.argv[1] OPENALEX_CITATION_AUTHOR_DF = sys.argv[2] OPENALEX_CITATION_CONCEPT_DF = sys.argv[3] OPENALEX_CITATION_PAPER_DF = sys.argv[4] def get_dicts(OPENALEX_PAPER_DF): # vispd_openalex_match here is OPENALEX_PAPER_DF df = pd.read_csv(OPENALEX_PAPER_DF) dois = df['DOI'].tolist() urls = df['Citation API URL'].tolist() openalex_ids = df['OpenAlex ID'].tolist() years = df['Year'].tolist() titles = df['Title'].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) doi_url_dict = dict(zip(dois, urls)) doi_openalexID_dict = dict(zip(dois, openalex_ids)) return [dois, urls, doi_year_dict, doi_title_dict, doi_url_dict, doi_openalexID_dict] def get_s(): # set retry if status codes in [ 500, 502, 503, 504, 429] # als return headers s = requests.Session() retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[ 500, 502, 503, 504, 429], ) s.mount('http://', HTTPAdapter(max_retries=retries)) headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", 'Accept': 'application/json', } return s, headers def get_concept_dict_list_from_concepts(doi, result, concepts): """returns a list of dicts """ openalex_year = result['publication_year'] openalex_id = re.sub('https://openalex.org/', '', result['id']) openalex_title = result['display_name'] openalex_doi = result['doi'] concept_dict_list = [] num_concepts = len(concepts) for i in concepts: concept_index = concepts.index(i) + 1 concept_name = i['display_name'] openalex_concept_id = i['id'] wikidata_url = i['wikidata'] level = i['level'] score = i['score'] concept_dict = { 'Cited Ppaer Year': doi_year_dict[doi], 'Cited Paper DOI': doi, 'Cited Paper Title': doi_title_dict[doi], 'Cited Paper OpenAlex ID': doi_openalexID_dict[doi], 'Citation Paper Year': openalex_year, 'Citation Paper OpenAlex ID': openalex_id, 'Citation Ppaer OpenAlex Title': openalex_title, 'Citation Paper OpenAlex DOI': openalex_doi, 'Number of Concepts': num_concepts, 'Index of Concept': concept_index, 'Concept': concept_name, 'Concept ID': openalex_concept_id, 'Wikidata': wikidata_url, 'Level': level, 'Score': score, } concept_dict_list.append(concept_dict) return concept_dict_list def get_author_dict_list_from_authors(doi, result, authors): """returns a list of dicts """ openalex_year = result['publication_year'] openalex_id = re.sub('https://openalex.org/', '', result['id']) openalex_title = result['display_name'] openalex_doi = result['doi'] author_dict_list = [] num_authors = len(authors) for i in authors: author = i['author'] author_name = author['display_name'] author_position = authors.index(i) + 1 position_type = i['author_position'] openalex_author_id = author['id'] author_orcid = author['orcid'] raw_affiliation_string = i['raw_affiliation_string'] if len(i['institutions']) == 0: num_institutions = np.NaN first_institution = np.NaN institution_name = np.NaN institution_id = np.NaN ror = np.NaN country_code = np.NaN institution_type = np.NaN else: num_institutions = len(i['institutions']) first_institution = i['institutions'][0] # Check whether the institution object is empty # this is because, in the first citation of 10.1109/TVCG.2007.70599 # the first author's institution is empty, which causes errors if first_institution: institution_name = first_institution['display_name'] institution_id = first_institution['id'] ror = first_institution['ror'] country_code = first_institution['country_code'] institution_type = first_institution['type'] else: institution_name = np.NaN institution_id = np.NaN ror = np.NaN country_code = np.NaN institution_type = np.NaN author_dict = { 'Cited Ppaer Year': doi_year_dict[doi], 'Cited Paper DOI': doi, 'Cited Paper Title': doi_title_dict[doi], 'Cited Paper OpenAlex ID': doi_openalexID_dict[doi], 'Citation Paper Year': openalex_year, 'Citation Paper OpenAlex ID': openalex_id, 'Citation Ppaer OpenAlex Title': openalex_title, 'Citation Paper OpenAlex DOI': openalex_doi, 'Number of Authors': num_authors, 'Author Name': author_name, 'Author Position': author_position, 'Author Position Type': position_type, 'OpenAlex Author ID': openalex_author_id, 'Author ORCID': author_orcid, 'Number of Affiliations': num_institutions, 'First Institution Name': institution_name, 'Raw Affiliation String': raw_affiliation_string, 'First Institution ID': institution_id, 'First Institution ROR': ror, 'First Institution Type': institution_type, 'First Institution Country Code': country_code } author_dict_list.append(author_dict) return author_dict_list def get_paper_dict_from_json_result(j, doi): """returns a dict """ authors = j['authorships'] num_authors = len(authors) concepts = j['concepts'] num_concepts = len(concepts) openalex_year = j['publication_year'] openalex_id = re.sub('https://openalex.org/', '', j['id']) openalex_title = j['display_name'] openalex_doi = j['doi'] openalex_publication_date = j['publication_date'] venue = j['host_venue'] openalex_venue_id = venue['id'] openalex_url = venue['url'] openalex_venue_name = venue['display_name'] openalex_publisher = venue['publisher'] publication_type = j['type'] openalex_first_page = j['biblio']['first_page'] openalex_last_page = j['biblio']['last_page'] # num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None # else int(openalex_last_page) - int(openalex_first_page) + 1) num_references = len(j['referenced_works']) num_citations = j['cited_by_count'] # cited_by_api_url is a little bit complicated because in the results of title query # it returns a list whereas it returns a str in doi query cited_url = j['cited_by_api_url'] cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0] num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url) paper_dict = { 'Cited Ppaer Year': doi_year_dict[doi], 'Cited Paper DOI': doi, 'Cited Paper Title': doi_title_dict[doi], 'Cited Paper OpenAlex ID': doi_openalexID_dict[doi], 'OpenAlex Year': openalex_year, 'OpenAlex Publication Date': openalex_publication_date, 'Citation Paper OpenAlex ID': openalex_id, 'Citation Paper OpenAlex Title': openalex_title, 'Citation Paper OpenAlex DOI': openalex_doi, 'Citation Paper OpenAlex URL': openalex_url, 'OpenAlex Venue ID': openalex_venue_id, 'OpenAlex Venue Name': openalex_venue_name, 'OpenAlex Publisher': openalex_publisher, 'Publication Type': publication_type, 'OpenAlex First Page': openalex_first_page, 'OpenAlex Last Page': openalex_last_page, # 'Number of Pages': num_pages, 'Number of References': num_references, 'Number of Authors': num_authors, 'Number of Concepts': num_concepts, 'Number of Citations': num_citations, 'Citation API URL': cited_by_api_url, 'Number of Citation API URLs': num_cited_by_api_url, } return paper_dict def get_empty_dict_list(doi): dict_list = [{ 'Cited Ppaer Year': doi_year_dict[doi], 'Cited Paper DOI': doi, 'Cited Paper Title': doi_title_dict[doi], 'Cited Paper OpenAlex ID': doi_openalexID_dict[doi], }] return dict_list def get_empty_dict(doi): a_dict = { 'Cited Ppaer Year': doi_year_dict[doi], 'Cited Paper DOI': doi, 'Cited Paper Title': doi_title_dict[doi], 'Cited Paper OpenAlex ID': doi_openalexID_dict[doi], } return a_dict def get_json_result(url, s, headers): """if 404 or other error codes, retry This function prevents error codes. I am pretty sure that every api_cited_url will get a status_code of 200, that's why I am confident to use this function Also, it should be noted that if the status code is 404, then s.get(url).json() will throw an error. So i don't need to check the status code in this function. """ try: j = s.get(url, headers=headers).json() except: time.sleep(1) return get_json_result(url, s, headers) else: return j def main(DOIS, s, headers): for doi in DOIS: # to make sure the api-url is not nan: if doi_url_dict[doi] == doi_url_dict[doi]: url = doi_url_dict[doi] + '&per-page=50' j0 = get_json_result(url, s, headers) count = j0['meta']['count'] per_page = 50 total_pages = math.ceil(count/per_page) # checking whether results are empty if count > 0: # for every page for i in range(1,total_pages+1): list_of_concept_dict_lists = [] list_of_author_dict_lists = [] paper_dict_list = [] j = get_json_result(url + f'&page={i}', s, headers=headers) results = j['results'] # for every result in a page for result in results: concepts = result['concepts'] authors = result['authorships'] concept_dict_list = get_concept_dict_list_from_concepts(doi, result, concepts) author_dict_list = get_author_dict_list_from_authors(doi, result, authors) paper_dict = get_paper_dict_from_json_result(result, doi) list_of_concept_dict_lists.append(concept_dict_list) list_of_author_dict_lists.append(author_dict_list) paper_dict_list.append(paper_dict) lists_concepts.append(list_of_concept_dict_lists) lists_authors.append(list_of_author_dict_lists) list_of_paper_dict_lists.append(paper_dict_list) time.sleep(0.2) # if empty results: else: list_of_concept_dict_lists = [] list_of_author_dict_lists = [] paper_dict_list = [] concept_dict_list = get_empty_dict_list(doi) author_dict_list = get_empty_dict_list(doi) paper_dict = get_empty_dict(doi) list_of_concept_dict_lists.append(concept_dict_list) list_of_author_dict_lists.append(author_dict_list) paper_dict_list.append(paper_dict) lists_concepts.append(list_of_concept_dict_lists) lists_authors.append(list_of_author_dict_lists) list_of_paper_dict_lists.append(paper_dict_list) else: list_of_concept_dict_lists = [] list_of_author_dict_lists = [] paper_dict_list = [] concept_dict_list = get_empty_dict_list(doi) author_dict_list = get_empty_dict_list(doi) paper_dict = get_empty_dict(doi) list_of_concept_dict_lists.append(concept_dict_list) list_of_author_dict_lists.append(author_dict_list) paper_dict_list.append(paper_dict) lists_concepts.append(list_of_concept_dict_lists) lists_authors.append(list_of_author_dict_lists) list_of_paper_dict_lists.append(paper_dict_list) print(f'{DOIS.index(doi) + 1} is done') time.sleep(0.5) if __name__ == '__main__': # I don't need to worry papers having no citations. # This is because even if there is no citation, there is still a cited_api_url # and the result count in that cited_api_url will be zero. # I have solved this issue in main() dois = get_dicts(OPENALEX_PAPER_DF)[0] random_dois = random.sample(dois, 10) urls = get_dicts(OPENALEX_PAPER_DF)[1] doi_year_dict = get_dicts(OPENALEX_PAPER_DF)[2] doi_title_dict = get_dicts(OPENALEX_PAPER_DF)[3] doi_url_dict = get_dicts(OPENALEX_PAPER_DF)[4] doi_openalexID_dict = get_dicts(OPENALEX_PAPER_DF)[5] lists_concepts = [] # list of lists of concept dict lists lists_authors = [] # list of lists of author dict lists list_of_paper_dict_lists = [] # list of paper dict lists s = get_s()[0] headers = get_s()[1] main(dois, s, headers) author_df_initiate = pd.DataFrame() concept_df_initiate = pd.DataFrame() def build_df_from_lists(lists, df): for i in lists: df1 = pd.concat([pd.DataFrame(l) for l in i], ignore_index=True) df = df.append(df1, ignore_index=True) return df author_df = build_df_from_lists(lists_authors, author_df_initiate) concept_df = build_df_from_lists(lists_concepts, concept_df_initiate) paper_df = pd.concat( [pd.DataFrame(l) for l in list_of_paper_dict_lists], ignore_index=True) author_df.to_csv(OPENALEX_CITATION_AUTHOR_DF, index=False) concept_df.to_csv(OPENALEX_CITATION_CONCEPT_DF, index=False) paper_df.to_csv(OPENALEX_CITATION_PAPER_DF, index=False) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 | import pandas as pd import numpy as np import requests import random import math import csv import re import sys import time from urllib3.util.retry import Retry from requests.adapters import HTTPAdapter PAPERS_TO_STUDY = sys.argv[1] VISPUBDATA_PLUS = sys.argv[2] OPENALEX_PAPER_DF = sys.argv[3] OPENALEX_AUTHOR_DF = sys.argv[4] OPENALEX_CONCEPT_DF = sys.argv[5] OPENALEX_REFERENCE_DF = sys.argv[6] TITEL_QUERY_EMPTY_DOI_QUERY_404_DFS = sys.argv[7] TITLE_QUERY_404_DFS = sys.argv[8] DOI_QUERY_404_DFS = sys.argv[9] def read_txt(INPUT): """read txt files and return a list """ raw = open(INPUT, "r") reader = csv.reader(raw) allRows = [row for row in reader] data = [i[0] for i in allRows] return data def get_dicts(VISPUBDATA_PLUS): # get year_dict and title_dict vispd_plus = pd.read_csv(VISPUBDATA_PLUS) dois = vispd_plus.loc[:, "DOI"].tolist() titles = vispd_plus.loc[:, "Title"].tolist() years = vispd_plus.loc[:, "Year"].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) return [doi_year_dict, doi_title_dict] def get_concept_dict_list_from_concepts(doi, concepts): """returns a list of dicts """ concept_dict_list = [] num_concepts = len(concepts) # first check whether the list concepts is empty: if concepts: for i in concepts: concept_index = concepts.index(i) + 1 concept_name = i['display_name'] openalex_concept_id = i['id'] wikidata_url = i['wikidata'] level = i['level'] score = i['score'] concept_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], 'Number of Concepts': num_concepts, 'Index of Concept': concept_index, 'Concept': concept_name, 'Concept ID': openalex_concept_id, 'Wikidata': wikidata_url, 'Level': level, 'Score': score, } concept_dict_list.append(concept_dict) # if concept list is empty, 'number of concepts' will be NaN else: concept_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], } concept_dict_list.append(concept_dict) return concept_dict_list def get_reference_dict_list_from_referenced_works(doi, referenced_works): reference_dict_list = [] num_references = len(referenced_works) # first check whether the list of referenced works is empty if referenced_works: for i in referenced_works: reference_index = referenced_works.index(i) + 1 reference_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], 'Number of References': num_references, 'Index of Reference': reference_index, 'Reference': i, } reference_dict_list.append(reference_dict) # if refs list is empty, 'number of references' will be NaN else: reference_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], } reference_dict_list.append(reference_dict) return reference_dict_list def get_author_dict_list_from_authors(doi, authors): """returns a list of dicts """ author_dict_list = [] num_authors = len(authors) # first check whether authors is empty if authors: for i in authors: author = i['author'] author_name = author['display_name'] author_position = authors.index(i) + 1 position_type = i['author_position'] openalex_author_id = author['id'] author_orcid = author['orcid'] raw_affiliation_string = i['raw_affiliation_string'] if len(i['institutions']) == 0: num_institutions = np.NaN first_institution = np.NaN institution_name = np.NaN institution_id = np.NaN ror = np.NaN country_code = np.NaN institution_type = np.NaN else: num_institutions = len(i['institutions']) first_institution = i['institutions'][0] institution_name = first_institution['display_name'] institution_id = first_institution['id'] ror = first_institution['ror'] country_code = first_institution['country_code'] institution_type = first_institution['type'] author_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], 'Number of Authors': num_authors, 'Author Name': author_name, 'Author Position': author_position, 'Author Position Type': position_type, 'OpenAlex Author ID': openalex_author_id, 'Author ORCID': author_orcid, 'Number of Affiliations': num_institutions, 'First Institution Name': institution_name, 'Raw Affiliation String': raw_affiliation_string, 'First Institution ID': institution_id, 'First Institution ROR': ror, 'First Institution Type': institution_type, 'First Institution Country Code': country_code } author_dict_list.append(author_dict) # if authors list is empty, 'number of authors' will be NaN else: author_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], } author_dict_list.append(author_dict) return author_dict_list def get_paper_dict_from_json_result(j, doi): """returns a dict """ authors = j['authorships'] num_authors = len(authors) concepts = j['concepts'] num_concepts = len(concepts) openalex_id = re.sub('https://openalex.org/', '', j['id']) openalex_title = j['display_name'] openalex_year = j['publication_year'] openalex_publication_date = j['publication_date'] openalex_doi = j['doi'] venue = j['host_venue'] openalex_venue_id = venue['id'] openalex_url = venue['url'] openalex_venue_name = venue['display_name'] openalex_publisher = venue['publisher'] publication_type = j['type'] openalex_first_page = j['biblio']['first_page'] openalex_last_page = j['biblio']['last_page'] num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None else int(openalex_last_page) - int(openalex_first_page) + 1) num_references = len(j['referenced_works']) num_citations = j['cited_by_count'] # cited_by_api_url is a little bit complicated because in the results of title query # it returns a list whereas it returns a str in doi query cited_url = j['cited_by_api_url'] cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0] num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url) paper_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], 'OpenAlex Year': openalex_year, 'OpenAlex Publication Date': openalex_publication_date, 'OpenAlex ID': openalex_id, 'OpenAlex Title': openalex_title, 'OpenAlex DOI': openalex_doi, 'OpenAlex URL': openalex_url, 'OpenAlex Venue ID': openalex_venue_id, 'OpenAlex Venue Name': openalex_venue_name, 'OpenAlex Publisher': openalex_publisher, 'Publication Type': publication_type, 'OpenAlex First Page': openalex_first_page, 'OpenAlex Last Page': openalex_last_page, 'Number of Pages': num_pages, 'Number of References': num_references, 'Number of Authors': num_authors, 'Number of Concepts': num_concepts, 'Number of Citations': num_citations, 'Citation API URL': cited_by_api_url, 'Number of Citation API URLs': num_cited_by_api_url, } return paper_dict def get_empty_dict_list(doi): dict_list = [{ 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], }] return dict_list def get_empty_paper_dict(doi): paper_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], } return paper_dict def get_title_query_response(doi): title = doi_title_dict[doi] title_to_query = re.sub(r'\:|\?|\&|\,', '', title) response = requests.get( 'https://api.openalex.org/works?filter=title.search:' + title_to_query) return response, title_to_query def check_results_count(response): j = response.json() count = j['meta']['count'] return j, count def get_doi_query_response(doi): response = requests.get("https://api.openalex.org/works/doi:" + doi) return response def get_data(doi, doi_index): # if doi not in to_query_by_doi, query title first if doi not in to_query_by_doi: # query title first: response = get_title_query_response(doi)[0] # if the response.status_code is in retry_code, then there is something wrong # I will sleep for a while and try again. Note that if the status_code is 404, # I put it to no_matching (see below, if status_code != 200), rather than retryihng while response.status_code in retry_code: print(f'Title query has errors for {doi_index} : {doi_title_dict[doi]}. Error status code is {response.status_code}. Retrying...') time.sleep(3) response = get_title_query_response(doi)[0] # if title query succeeds: if response.status_code == 200: # get json and check results count: j = check_results_count(response)[0] count = check_results_count(response)[1] # if count is non-zero: if count > 0: # if doi not in special_result_index_dict, use index of 0 # otherwise, use the value corresponding to the key if doi not in list(special_result_index_dict.keys()): correct_result = j['results'][0] else: correct_index = special_result_index_dict[doi] correct_result = j['results'][correct_index] authors = correct_result['authorships'] concepts = correct_result['concepts'] referenced_works = correct_result['referenced_works'] paper_dict = get_paper_dict_from_json_result(correct_result, doi) author_dict_list = get_author_dict_list_from_authors(doi, authors) concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts) reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works) # if count is zero, query doi instead else: # get doi query response: response2 = get_doi_query_response(doi) # if status code is in retry_code, retry while response2.status_code in retry_code: print(f'doi query has error for {doi_index} : {doi}, error status code is {response2.status_code}, retrying...') time.sleep(3) response2 = get_doi_query_response(doi) # if doi query succeeds: if response2.status_code == 200: j2 = response2.json() authors = j2['authorships'] concepts = j2['concepts'] referenced_works = j2['referenced_works'] paper_dict = get_paper_dict_from_json_result(j2, doi) author_dict_list = get_author_dict_list_from_authors(doi, authors) concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts) reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works) # if doi query fails, add the doi to no_result_bad_doi list else: error_status_code(response2.status_code) title_query_empty_doi_query_404_list.append(doi) paper_dict = get_empty_paper_dict(doi) author_dict_list = get_empty_dict_list(doi) concept_dict_list = get_empty_dict_list(doi) reference_dict_list = get_empty_dict_list(doi) print(f'doi query fails for {doi_index} : {doi}') # if title query fails (most likely status code 404), which is very unlikely! # add it to no_title_matching else: title_query_404_list.append(doi) error_status_code.append(response.status_code) paper_dict = get_empty_paper_dict(doi) author_dict_list = get_empty_dict_list(doi) concept_dict_list = get_empty_dict_list(doi) reference_dict_list = get_empty_dict_list(doi) print(f'title query fails for {doi_index} : {doi_title_dict[doi]}') # if doi in to_query_by_doi, use doi query else: # get doi query response: response0 = get_doi_query_response(doi) # if status code is in retry_code, retry while response0.status_code in retry_code: print(f'doi query for {doi_index} : {doi} has error, status code is {response0.status_code}, retrying...') time.sleep(3) response0 = get_doi_query_response(doi) # if doi query succeeds: if response0.status_code == 200: j0 = response0.json() authors = j0['authorships'] concepts = j0['concepts'] referenced_works = j0['referenced_works'] paper_dict = get_paper_dict_from_json_result(j0, doi) author_dict_list = get_author_dict_list_from_authors(doi, authors) concept_dict_list = get_concept_dict_list_from_concepts(doi, concepts) reference_dict_list = get_reference_dict_list_from_referenced_works(doi, referenced_works) # if doi query fails, add the doi to no_doi_matching else: error_status_code.append(response0.status_code) doi_query_404_list.append(doi) paper_dict = get_empty_paper_dict(doi) author_dict_list = get_empty_dict_list(doi) concept_dict_list = get_empty_dict_list(doi) reference_dict_list = get_empty_dict_list(doi) print(f'doi query fails for {doi_index} : {doi}') list_of_paper_dicts.append(paper_dict) list_of_author_dict_lists.append(author_dict_list) list_of_concept_dict_lists.append(concept_dict_list) list_of_reference_dict_lists.append(reference_dict_list) def main(DOIS): for doi in DOIS: doi_index = DOIS.index(doi) + 1 get_data(doi, doi_index) print(f'{doi_index} is done') time.sleep(0.5) print(list(set(error_status_code))) if __name__ == '__main__': papers_to_study = read_txt(PAPERS_TO_STUDY) random_papers_to_study = random.sample(papers_to_study, 10) doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0] doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1] list_of_paper_dicts = [] list_of_author_dict_lists = [] list_of_concept_dict_lists = [] list_of_reference_dict_lists = [] title_query_empty_doi_query_404_list = [] title_query_404_list = [] doi_query_404_list = [] retry_code = [ 500, 502, 503, 504, 429] error_status_code = [] to_query_by_doi = [ '10.1109/VISUAL.2001.964489', '10.1109/VISUAL.1996.568113', '10.1109/VISUAL.1999.809896', '10.1109/VISUAL.1991.175771', '10.1109/VISUAL.1998.745302', '10.1109/VISUAL.1993.398868', '10.1109/INFVIS.2005.1532128', '10.1109/VISUAL.1993.398859', '10.1109/VISUAL.1991.175795', '10.1109/VISUAL.2003.1250401', '10.1109/VISUAL.1991.175789', '10.1109/VISUAL.2000.885739', '10.1109/TVCG.2014.2346922', '10.1109/VISUAL.1999.809871', '10.1109/VISUAL.1996.567807', '10.1109/VISUAL.2000.885692', '10.1109/VISUAL.1991.175777', '10.1109/VISUAL.1998.745315', '10.1109/VISUAL.1997.663909', '10.1109/VISUAL.2000.885697', '10.1109/VISUAL.2001.964504', '10.1109/TVCG.2006.168', '10.1109/TVCG.2007.70617', '10.1109/VISUAL.1997.663910', '10.1109/VISUAL.1997.663931', '10.1109/VISUAL.2002.1183792', '10.1109/VISUAL.1992.235201', '10.1109/VISUAL.1996.568128', '10.1109/VISUAL.1997.663923', '10.1109/VAST.2011.6102441', '10.1109/VISUAL.2000.885732', '10.1109/VISUAL.2001.964522', '10.1109/VISUAL.2005.1532812', '10.1109/VISUAL.1998.745350', '10.1109/INFVIS.2001.963282', '10.1109/VISUAL.1995.480804', '10.1109/VISUAL.2005.1532847', '10.1109/INFVIS.1996.559229', '10.1109/VISUAL.2000.885738', '10.1109/VISUAL.1991.175800', '10.1109/VISUAL.1993.398865', '10.1109/VISUAL.1993.398866', '10.1109/VISUAL.1998.745348', '10.1109/VISUAL.1993.398867', '10.1109/VISUAL.1997.663925', '10.1109/VISUAL.1993.398900', '10.1109/VISUAL.1992.235181', '10.1109/VISUAL.1992.235195', '10.1109/VISUAL.2000.885719', '10.1109/VISUAL.1991.175816', '10.1109/VISUAL.1990.146414', '10.1109/VISUAL.1993.398861', '10.1109/VISUAL.1993.398872', '10.1109/VISUAL.1994.346292', '10.1109/VISUAL.1994.346295', '10.1109/VISUAL.1994.346297', '10.1109/VISUAL.1994.346301', '10.1109/VISUAL.1999.809913', '10.1109/VISUAL.2001.964546', '10.1109/VISUAL.2003.1250404', '10.1109/TVCG.2014.2346442', '10.1109/TVCG.2020.3028948', '10.1109/TVCG.2020.3030363', '10.1109/TVCG.2020.3030364', '10.1109/tvcg.2021.3114784', '10.1109/tvcg.2021.3114780', '10.1109/tvcg.2021.3114782', '10.1109/tvcg.2021.3114783', '10.1109/tvcg.2021.3114836', '10.1109/TVCG.2021.3064037', '10.1109/TVCG.2021.3114849', '10.1109/TVCG.2021.3114842', '10.1109/TVCG.2021.3114766', '10.1109/TVCG.2021.3114777' ] special_result_index_dict = { '10.1109/VISUAL.1992.235194': 4, } main(papers_to_study) paper_df = pd.DataFrame(list_of_paper_dicts) author_df = pd.concat( [pd.DataFrame(l) for l in list_of_author_dict_lists], ignore_index=True) concept_df = pd.concat( [pd.DataFrame(l) for l in list_of_concept_dict_lists], ignore_index=True) reference_df = pd.concat( [pd.DataFrame(l) for l in list_of_reference_dict_lists], ignore_index=True) paper_df.to_csv(OPENALEX_PAPER_DF, index=False) author_df.to_csv(OPENALEX_AUTHOR_DF, index=False) concept_df.to_csv(OPENALEX_CONCEPT_DF, index=False) reference_df.to_csv(OPENALEX_REFERENCE_DF, index=False) with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_DFS, 'w') as f: for doi in title_query_empty_doi_query_404_list: f.write("%s\n" % doi) with open(TITLE_QUERY_404_DFS, 'w') as f: for doi in title_query_404_list: f.write("%s\n" % doi) with open(DOI_QUERY_404_DFS, 'w') as f: for doi in doi_query_404_list: f.write("%s\n" % doi) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | import pandas as pd import numpy as np import requests import random import re import sys import time from urllib3.util.retry import Retry from requests.adapters import HTTPAdapter OPENALEX_REFERENCE_DF = sys.argv[1] OPENALEX_REFERENCE_PAPER_DF_UNIQUE = sys.argv[2] OPENALEX_REFERENCE_AUTHOR_DF_UNIQUE = sys.argv[3] OPENALEX_REFERENCE_CONCEPT_DF_UNIQUE = sys.argv[4] OPENALEX_REFERENCE_PAPER_DF = sys.argv[5] OPENALEX_REFERENCE_AUTHOR_DF = sys.argv[6] OPENALEX_REFERENCE_CONCEPT_DF = sys.argv[7] OPENALEX_REFERENCE_ERROR_DF = sys.argv[8] def get_unique_ref_urls(ref_df): # ref_df here is OPENALEX_REFERENCE_DF # returns a list: unique reference paper urls ref = pd.read_csv(ref_df).dropna(subset=['Number of References']) unique_ref_urls = list(set(ref.Reference.tolist())) return ref, unique_ref_urls def get_s(): # set retry if status codes in [ 500, 502, 503, 504, 429] # als return headers s = requests.Session() retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[ 500, 502, 503, 504, 429], ) s.mount('http://', HTTPAdapter(max_retries=retries)) headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", 'Accept': 'application/json', } return s, headers def get_paper_dict_from_json_result(j, url, paper_dict_list): """returns a dict """ authors = j['authorships'] num_authors = len(authors) concepts = j['concepts'] num_concepts = len(concepts) openalex_id = re.sub('https://openalex.org/', '', j['id']) openalex_title = j['display_name'] openalex_year = j['publication_year'] openalex_publication_date = j['publication_date'] openalex_doi = j['doi'] venue = j['host_venue'] openalex_venue_id = venue['id'] openalex_url = venue['url'] openalex_venue_name = venue['display_name'] openalex_publisher = venue['publisher'] publication_type = j['type'] openalex_first_page = j['biblio']['first_page'] openalex_last_page = j['biblio']['last_page'] # num_pages = (np.NaN if openalex_first_page is None or openalex_last_page is None # else int(openalex_last_page) - int(openalex_first_page) + 1) num_references = len(j['referenced_works']) num_citations = j['cited_by_count'] # cited_by_api_url is a little bit complicated because in the results of title query # it returns a list whereas it returns a str in doi query cited_url = j['cited_by_api_url'] cited_by_api_url = cited_url if type(cited_url) is str else cited_url[0] num_cited_by_api_url = 1 if type(cited_url) is str else len(cited_url) paper_dict = { 'Reference': re.sub('//api.', '//', url), 'OpenAlex Year': openalex_year, 'OpenAlex Publication Date': openalex_publication_date, 'OpenAlex ID': openalex_id, 'OpenAlex Title': openalex_title, 'OpenAlex DOI': openalex_doi, 'OpenAlex URL': openalex_url, 'OpenAlex Venue ID': openalex_venue_id, 'OpenAlex Venue Name': openalex_venue_name, 'OpenAlex Publisher': openalex_publisher, 'Publication Type': publication_type, 'OpenAlex First Page': openalex_first_page, 'OpenAlex Last Page': openalex_last_page, # 'Number of Pages': num_pages, 'Number of References for Reference paper': num_references, 'Number of Citations': num_citations, 'Number of Authors': num_authors, 'Number of Concepts': num_concepts, 'Citation API URL': cited_by_api_url, 'Number of Citation API URLs': num_cited_by_api_url, } paper_dict_list.append(paper_dict) return paper_dict_list def get_author_dict_list_from_authors(j, url, author_dict_list): """returns a list of dicts """ openalex_id = re.sub('https://openalex.org/', '', j['id']) openalex_title = j['display_name'] openalex_year = j['publication_year'] authors = j['authorships'] num_authors = len(authors) for i in authors: author = i['author'] author_name = author['display_name'] author_position = authors.index(i) + 1 position_type = i['author_position'] openalex_author_id = author['id'] author_orcid = author['orcid'] raw_affiliation_string = i['raw_affiliation_string'] if len(i['institutions']) == 0: num_institutions = np.NaN first_institution = np.NaN institution_name = np.NaN institution_id = np.NaN ror = np.NaN country_code = np.NaN institution_type = np.NaN else: num_institutions = len(i['institutions']) first_institution = i['institutions'][0] institution_name = first_institution['display_name'] institution_id = first_institution['id'] ror = first_institution['ror'] country_code = first_institution['country_code'] institution_type = first_institution['type'] author_dict = { 'Reference': re.sub('//api.', '//', url), 'Reference OpenAlex Year': openalex_year, 'Reference OpenAlex ID': openalex_id, 'Reference OpenAlex Title': openalex_title, 'Number of Authors': num_authors, 'Author Name': author_name, 'Author Position': author_position, 'Author Position Type': position_type, 'OpenAlex Author ID': openalex_author_id, 'Author ORCID': author_orcid, 'Number of Affiliations': num_institutions, 'First Institution Name': institution_name, 'Raw Affiliation String': raw_affiliation_string, 'First Institution ID': institution_id, 'First Institution ROR': ror, 'First Institution Type': institution_type, 'First Institution Country Code': country_code } author_dict_list.append(author_dict) return author_dict_list def get_concept_dict_list_from_concepts(j, url, concept_dict_list): """returns a list of dicts """ openalex_id = re.sub('https://openalex.org/', '', j['id']) openalex_title = j['display_name'] openalex_year = j['publication_year'] concepts = j['concepts'] num_concepts = len(concepts) for i in concepts: concept_index = concepts.index(i) + 1 concept_name = i['display_name'] openalex_concept_id = i['id'] wikidata_url = i['wikidata'] level = i['level'] score = i['score'] concept_dict = { 'Reference': re.sub('//api.', '//', url), 'Reference OpenAlex Year': openalex_year, 'Reference OpenAlex ID': openalex_id, 'Reference OpenAlex Title': openalex_title, 'Number of Concepts': num_concepts, 'Index of Concept': concept_index, 'Concept': concept_name, 'Concept ID': openalex_concept_id, 'Wikidata': wikidata_url, 'Level': level, 'Score': score, } concept_dict_list.append(concept_dict) return concept_dict_list def main(URLS, s, headers): for url in URLS: url_index = URLS.index(url) + 1 api_url = re.sub('https://', 'https://api.', url) response = s.get(api_url, headers=headers) # if the response.status_code is in retry_code, then there is something wrong # I will sleep for a while and try again. Note that if the status_code is 404, # I except it and put it in error_url_dict while response.status_code in retry_code: print(f'doi query {url_index} : {api_url} has error, status code is {response.status_code}, retrying...') time.sleep(3) response = s.get(api_url, headers=headers) # note that if the error code is 404, which means the following `response.jons()` will fail, # then that url will NOT be included in paper_dict, author_dict, or concept list #. Instead, that url will be put in error_url_dict #. This is not a problem because later when I merge with REF, the merged file #. will show NaN for 'number of concepts'.... #. In fact, even if I create empty dicts for those urls with 404 status codes, # the final merged output will be the same. try: j = response.json() get_paper_dict_from_json_result(j, url, paper_dict_list) get_author_dict_list_from_authors(j, url, author_dict_list) get_concept_dict_list_from_concepts(j, url, concept_dict_list) print(f'{url_index} / {len(URLS)} is done') except: error_url_dict = { 'Error URL': url, 'Error Status Code': response.status_code, } error_url_dict_list.append(error_url_dict) print(f'{url} : {response.status_code}') time.sleep(0.5) if __name__ == '__main__': s = get_s()[0] headers = get_s()[1] # REF is openalex_reference_df with rows omitted whose 'number of reference' is missing REF = get_unique_ref_urls(OPENALEX_REFERENCE_DF)[0] URLS = get_unique_ref_urls(OPENALEX_REFERENCE_DF)[1] random_urls = URLS[0:11] paper_dict_list = [] author_dict_list = [] concept_dict_list = [] error_url_dict_list = [] retry_code = [ 500, 502, 503, 504, 429] main(URLS, s, headers) paper_df = pd.DataFrame(paper_dict_list) author_df = pd.DataFrame(author_dict_list) concept_df = pd.DataFrame(concept_dict_list) error_df = pd.DataFrame(error_url_dict_list) ref_paper_df = REF.merge(paper_df, on="Reference", how='left') ref_author_df = REF.merge(author_df, on="Reference", how='left') ref_concept_df = REF.merge(concept_df, on="Reference", how='left') paper_df.to_csv(OPENALEX_REFERENCE_PAPER_DF_UNIQUE, index=False) author_df.to_csv(OPENALEX_REFERENCE_AUTHOR_DF_UNIQUE, index=False) concept_df.to_csv(OPENALEX_REFERENCE_CONCEPT_DF_UNIQUE, index=False) ref_paper_df.to_csv(OPENALEX_REFERENCE_PAPER_DF, index=False) ref_author_df.to_csv(OPENALEX_REFERENCE_AUTHOR_DF, index=False) ref_concept_df.to_csv(OPENALEX_REFERENCE_CONCEPT_DF, index=False) error_df.to_csv(OPENALEX_REFERENCE_ERROR_DF, index=False) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import pandas as pd import csv import sys VISPD_PLUS_GOOD_PAPERS = sys.argv[1] PAPERS_TO_STUDY = sys.argv[2] def read_txt(INPUT): """read txt files and return a list """ raw = open(INPUT, "r") reader = csv.reader(raw) allRows = [row for row in reader] data = [i[0] for i in allRows] return data def get_papers_to_study(INPUT): # INPUT here is vispd_plus_good_papers vispd_plus_good_papers = read_txt(INPUT) to_exclude_from_analysis = [ '10.1109/VISUAL.1990.146412', # this one simply cannot be found by either title or doi query '10.1109/VISUAL.2003.1250379', # this one is wrong match and I can't find a way to locate it on openalex ] papers_to_study = [ x for x in vispd_plus_good_papers if x not in to_exclude_from_analysis ] return papers_to_study papers_to_study = get_papers_to_study(VISPD_PLUS_GOOD_PAPERS) with open(PAPERS_TO_STUDY, 'w') as f: for doi in papers_to_study: f.write("%s\n" % doi) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | import sys import pandas as pd import requests from bs4 import BeautifulSoup TITLES_2021 = sys.argv[1] def get_page(url): r = requests.get(url) soup = BeautifulSoup(r.content, 'lxml') page = soup.find('article') return page page = get_page('http://ieeevis.org/year/2021/info/papers-sessions') def get_all_title_str(page): """all_title_str contains both full and short papers' titles """ strong_elements = page.find_all('strong') time_str_elements = [ x for x in strong_elements if 'CDT' in x.string or 'October' in x.string ] all_title_str = [x.string for x in strong_elements if x not in time_str_elements] return all_title_str all_title_str = get_all_title_str(page) def get_str_to_exclude(page): """i obtain the list of short paper titles First, I obtain both 'strong' and 'em'. Then, I obtain the index of the line that contain 'short papers:' That will serve as the "starting index" later. Then, for each line that contain 'short papers:', i obtain the index of the immediate line that contains 'session chair:'. That index will serve as the "end index". For each "starting" and "end" pair, I got the elements in between an extract their string. These include all short papers' titlees. """ strong_and_em = page.find_all(['strong', 'em']) short_paper_em_idx = [ strong_and_em.index(i) for i in strong_and_em if 'Short Papers:' in i.string ] session_chair_em_idx = [ strong_and_em.index(i) for i in strong_and_em if 'Session Chair:' in i.string ] end_idx_list = [] for idx in short_paper_em_idx: end_idx = session_chair_em_idx.index(idx+1) end_idx_list.append(session_chair_em_idx[end_idx+1]) start_end_dic = dict(zip(short_paper_em_idx, end_idx_list)) str_to_exclude_list = [] for start in start_end_dic.keys(): to_exclude = strong_and_em[start:start_end_dic[start]] str_to_exclude = [x.string for x in to_exclude] str_to_exclude_list.append(str_to_exclude) str_to_exclude_list_flattened = [ item for sublist in str_to_exclude_list for item in sublist ] return str_to_exclude_list_flattened str_to_exclude = get_str_to_exclude(page) title_str = [x for x in all_title_str if x not in str_to_exclude] title_str.remove( 'Jurassic Mark: Inattentional Blindness for a Datasaurus Reveals that Visualizations are Explored, not Seen' ) # This paper changed its title for publication on TCVG title_replace_dict = { 'IRVINE: Using Interactive Clustering and Labeling to Analyze Correlation Patterns: A Design Study from the Manufacturing of Electrical Engines': 'IRVINE: A Design Study on Analyzing Correlation Patterns of Electrical Engines', } def replace_title(TITLES, DIC): for i,n in enumerate(TITLES): if n in DIC.keys(): TITLES[i] = DIC[n] return TITLES title_str = replace_title(title_str, title_replace_dict) if len(title_str) == 170: print('title_str has 170 elements. everything correct') else: print('something is wrong. the length of title_str is not 170') df = pd.DataFrame(title_str, columns=['title']) df.to_csv(TITLES_2021, index=False) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | import requests import csv import pandas as pd import random import re import time from urllib3.util.retry import Retry from requests.adapters import HTTPAdapter import sys import time VISPD_PLUS_GOOD_PAPERS = sys.argv[1] VISPUBDATA_PLUS = sys.argv[2] VISPD_OPENALEX_MATCH_1 = sys.argv[3] TITEL_QUERY_EMPTY_DOI_QUERY_404_1 = sys.argv[4] TITLE_QUERY_404_1 = sys.argv[5] def read_txt(INPUT): """read txt files and return a list """ raw = open(INPUT, "r") reader = csv.reader(raw) allRows = [row for row in reader] data = [i[0] for i in allRows] return data def get_dicts(VISPUBDATA_PLUS): # get year_dict and title_dict vispd_plus = pd.read_csv(VISPUBDATA_PLUS) dois = vispd_plus.loc[:, "DOI"].tolist() titles = vispd_plus.loc[:, "Title"].tolist() years = vispd_plus.loc[:, "Year"].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) return [doi_year_dict, doi_title_dict] # def get_s(): # # set retry if status codes in [ 500, 502, 503, 504, 429] # # als return headers # s = requests.Session() # retries = Retry(total=5, # backoff_factor=0.1, # status_forcelist=[ 500, 502, 503, 504, 429], # ) # s.mount('http://', HTTPAdapter(max_retries=retries)) # headers = { # "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", # 'Accept': 'application/json', # } # return s, headers def get_title_query_response(doi): title_original = doi_title_dict[doi] title = re.sub(r'\:|\?|\&|\,', '', title_original) response = requests.get( 'https://api.openalex.org/works?filter=title.search:' + title) return response def check_results_count(response): j = response.json() count = j['meta']['count'] return j, count def get_doi_query_response(doi): response = requests.get("https://api.openalex.org/works/doi:" + doi) return response def get_paper_dict_from_json_result(j, doi): openalex_id = j['id'] openalex_title = j['display_name'] openalex_year = j['publication_year'] openalex_doi = j['doi'] venue = j['host_venue'] openalex_venue = venue['id'] openalex_url = venue['url'] openalex_journal = venue['display_name'] openalex_publisher = venue['publisher'] openalex_first_page = j['biblio']['first_page'] openalex_last_page = j['biblio']['last_page'] paper_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], 'OpenAlex Year': openalex_year, 'OpenAlex ID': openalex_id, 'OpenAlex Title': openalex_title, 'OpenAlex DOI': openalex_doi, 'OpenAlex URL': openalex_url, 'OpenAlex Venue': openalex_venue, 'OpenAlex Journal': openalex_journal, 'OpenAlex Publisher': openalex_publisher, 'OpenAlex First Page': openalex_first_page, 'OpenAlex Last Page': openalex_last_page, } return paper_dict def get_empty_paper_dict(doi): paper_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], } return paper_dict def get_paper_dict_list(doi, doi_index): # query title first: response = get_title_query_response(doi) while response.status_code in retry_code: print(f'title query for {doi_index} : {doi} has error. Error status code is {response.status_code}. Retrying...') time.sleep(1) response = get_title_query_response(doi) # if title query succeeds: if response.status_code == 200: # get json and check results count: j = check_results_count(response)[0] count = check_results_count(response)[1] # if count is non-zero: if count > 0: first_result = j['results'][0] paper_dict = get_paper_dict_from_json_result(first_result, doi) # if count is zero, use doi query instead else: # get doi query response: response2 = get_doi_query_response(doi) while response2.status_code in retry_code: print(f'doi query for {doi_index} : {doi} has error. Error status code is {response2.status_code}. Retrying...') time.sleep(1) response2 = get_doi_query_response(doi) # if doi query succeeds: if response2.status_code == 200: j2 = response2.json() paper_dict = get_paper_dict_from_json_result(j2, doi) # empty title query, and 404 for doi query: else: error_status_code.append(response2.status_code) title_query_empty_doi_query_404_list.append(doi) paper_dict = get_empty_paper_dict(doi) print(f'doi query is not successful for {doi_index} : {doi}, whose title is {doi_title_dict[doi]}') # if title query fails: else: title_query_404_list.append(doi) error_status_code.append(response.status_code) # error_status_code.append([doi, response.status_code]) paper_dict = get_empty_paper_dict(doi) print(f'title query is not successful for {doi_index} : {doi_title_dict[doi]}') paper_dict_list.append(paper_dict) def main(DOIS): for doi in DOIS: doi_index = DOIS.index(doi) + 1 get_paper_dict_list(doi, doi_index) print(f'{doi_index} is done') time.sleep(0.5) print(list(set(error_status_code))) if __name__ == '__main__': # note on 2022-01-21: it's not a bug here but it might be error-prone: # i defined s here and then i used it direclty in the function of `main` # without "importing" the parameter, like `main(vispd_plus_good_papers, s)` # it's working, but as I said, it might be error prone vispd_plus_good_papers = read_txt(VISPD_PLUS_GOOD_PAPERS) doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0] doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1] retry_code = [ 500, 502, 503, 504, 429] paper_dict_list = [] title_query_empty_doi_query_404_list = [] title_query_404_list = [] error_status_code = [] # s = get_s()[0] # headers = get_s()[1] main(vispd_plus_good_papers) paper_df = pd.DataFrame(paper_dict_list) paper_df.to_csv(VISPD_OPENALEX_MATCH_1, index=False) with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_1, 'w') as f: for doi in title_query_empty_doi_query_404_list: f.write("%s\n" % doi) with open(TITLE_QUERY_404_1, 'w') as f: for doi in title_query_404_list: f.write("%s\n" % doi) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 | import requests import csv import pandas as pd import random import re import time from urllib3.util.retry import Retry from requests.adapters import HTTPAdapter import sys import time VISPD_PLUS_GOOD_PAPERS = sys.argv[1] VISPUBDATA_PLUS = sys.argv[2] VISPD_OPENALEX_MATCH_2 = sys.argv[3] TITEL_QUERY_EMPTY_DOI_QUERY_404_2 = sys.argv[4] TITLE_QUERY_404_2 = sys.argv[5] DOI_QUERY_404_2 = sys.argv[6] def read_txt(INPUT): """read txt files and return a list """ raw = open(INPUT, "r") reader = csv.reader(raw) allRows = [row for row in reader] data = [i[0] for i in allRows] return data def get_dicts(VISPUBDATA_PLUS): # get year_dict and title_dict vispd_plus = pd.read_csv(VISPUBDATA_PLUS) dois = vispd_plus.loc[:, "DOI"].tolist() titles = vispd_plus.loc[:, "Title"].tolist() years = vispd_plus.loc[:, "Year"].tolist() doi_year_dict = dict(zip(dois, years)) doi_title_dict = dict(zip(dois, titles)) return [doi_year_dict, doi_title_dict] def get_title_query_response(doi): title_original = doi_title_dict[doi] title = re.sub(r'\:|\?|\&|\,', '', title_original) response = requests.get( 'https://api.openalex.org/works?filter=title.search:' + title) return response def check_results_count(response): j = response.json() count = j['meta']['count'] return j, count def get_doi_query_response(doi): response = requests.get("https://api.openalex.org/works/doi:" + doi) return response def get_paper_dict_from_json_result(j, doi): openalex_id = j['id'] openalex_title = j['display_name'] openalex_year = j['publication_year'] openalex_doi = j['doi'] venue = j['host_venue'] openalex_venue = venue['id'] openalex_url = venue['url'] openalex_journal = venue['display_name'] openalex_publisher = venue['publisher'] openalex_first_page = j['biblio']['first_page'] openalex_last_page = j['biblio']['last_page'] paper_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], 'OpenAlex Year': openalex_year, 'OpenAlex ID': openalex_id, 'OpenAlex Title': openalex_title, 'OpenAlex DOI': openalex_doi, 'OpenAlex URL': openalex_url, 'OpenAlex Venue': openalex_venue, 'OpenAlex Journal': openalex_journal, 'OpenAlex Publisher': openalex_publisher, 'OpenAlex First Page': openalex_first_page, 'OpenAlex Last Page': openalex_last_page, } return paper_dict def get_empty_paper_dict(doi): paper_dict = { 'Year': doi_year_dict[doi], 'DOI': doi, 'Title': doi_title_dict[doi], } return paper_dict def update_paper_dict_list(doi, doi_index): if doi not in to_query_by_doi: # query title first: response = get_title_query_response(doi) # if status code is in retry_code, retry: while response.status_code in retry_code: print(f'title query for {doi_index} : {doi} is having errors, error status code is {response.status_code}, retrying...') time.sleep(1) response = get_title_query_response(doi) # if title query succeeds: if response.status_code == 200: # get json and check results count: j = check_results_count(response)[0] count = check_results_count(response)[1] # if count is non-zero: if count > 0: # if doi not in special_result_index_dict, use index of 0 if doi not in list(special_result_index_dict.keys()): first_result = j['results'][0] paper_dict = get_paper_dict_from_json_result(first_result, doi) else: correct_index = special_result_index_dict[doi] correct_result = j['results'][correct_index] paper_dict = get_paper_dict_from_json_result(correct_result, doi) # if count is zero, use doi query instead else: # get doi query response: response2 = get_doi_query_response(doi) # if status code is in retry_code, retry: while response2.status_code in retry_code: print(f'doi query for {doi_index} : {doi} is having errors, error status code is {response2.status_code}, retrying...') time.sleep(1) response2 = get_doi_query_response(doi) # if doi query succeeds: if response2.status_code == 200: j2 = response2.json() paper_dict = get_paper_dict_from_json_result(j2, doi) # if doi query fails, add the list to no_result list else: # empty title query results and bad doi query error_status_code.append(response2.status_code) title_query_empty_doi_query_404_list.append(doi) paper_dict = get_empty_paper_dict(doi) print(f'doi query is fails for {doi_index} : {doi}, whose title is {doi_title_dict[doi]}') # if title query fails: else: title_query_404_list.append(doi) error_status_code.append(response.status_code) paper_dict = get_empty_paper_dict(doi) print(f'title query fails for {doi_index} : {doi_title_dict[doi]}') else: response0 = get_doi_query_response(doi) # if status code is in retry_code, retry while response0.status_code in retry_code: print(f'doi query for {doi_index} : {doi} is having errors, error status code is {response0.status_code}, retrying...') time.sleep(3) response0 = get_doi_query_response(doi) # if doi query succeeds: if response0.status_code == 200: j0 = response0.json() paper_dict = get_paper_dict_from_json_result(j0, doi) # if doi query fails: else: error_status_code.append(response0.status_code) doi_query_404_list.append(doi) paper_dict = get_empty_paper_dict(doi) print(f'doi query fails for {doi_index} : {doi}') paper_dict_list.append(paper_dict) def main(DOIS): for doi in DOIS: doi_index = DOIS.index(doi) + 1 update_paper_dict_list(doi, doi_index) print(f'{doi_index} is done') time.sleep(0.5) print(list(set(error_status_code))) if __name__ == '__main__': # note on 2022-01-21: it's not a bug here but it might be error-prone: # i defined s here and then i used it direclty in the function of `main` # without "importing" the parameter, like `main(vispd_plus_good_papers, s)` # it's working, but as I said, it might be error prone vispd_plus_good_papers = read_txt(VISPD_PLUS_GOOD_PAPERS) doi_year_dict = get_dicts(VISPUBDATA_PLUS)[0] doi_title_dict = get_dicts(VISPUBDATA_PLUS)[1] retry_code = [ 500, 502, 503, 504, 429] paper_dict_list = [] title_query_empty_doi_query_404_list = [] title_query_404_list = [] doi_query_404_list = [] error_status_code = [] to_query_by_doi = [ '10.1109/VISUAL.2001.964489', '10.1109/VISUAL.1996.568113', '10.1109/VISUAL.1999.809896', '10.1109/VISUAL.1991.175771', '10.1109/VISUAL.1998.745302', '10.1109/VISUAL.1993.398868', '10.1109/INFVIS.2005.1532128', '10.1109/VISUAL.1993.398859', '10.1109/VISUAL.1991.175795', '10.1109/VISUAL.2003.1250401', '10.1109/VISUAL.1991.175789', '10.1109/VISUAL.2000.885739', '10.1109/TVCG.2014.2346922', '10.1109/VISUAL.1999.809871', '10.1109/VISUAL.1996.567807', '10.1109/VISUAL.2000.885692', '10.1109/VISUAL.1991.175777', '10.1109/VISUAL.1998.745315', '10.1109/VISUAL.1997.663909', '10.1109/VISUAL.2000.885697', '10.1109/VISUAL.2001.964504', '10.1109/TVCG.2006.168', '10.1109/TVCG.2007.70617', '10.1109/VISUAL.1997.663910', '10.1109/VISUAL.1997.663931', '10.1109/VISUAL.2002.1183792', '10.1109/VISUAL.1992.235201', '10.1109/VISUAL.1996.568128', '10.1109/VISUAL.1997.663923', '10.1109/VAST.2011.6102441', '10.1109/VISUAL.2000.885732', '10.1109/VISUAL.2001.964522', '10.1109/VISUAL.2005.1532812', '10.1109/VISUAL.1998.745350', '10.1109/INFVIS.2001.963282', '10.1109/VISUAL.1995.480804', '10.1109/VISUAL.2005.1532847', '10.1109/INFVIS.1996.559229', '10.1109/VISUAL.2000.885738', '10.1109/VISUAL.1991.175800', '10.1109/VISUAL.1993.398865', '10.1109/VISUAL.1993.398866', '10.1109/VISUAL.1998.745348', '10.1109/VISUAL.1993.398867', '10.1109/VISUAL.1997.663925', '10.1109/VISUAL.1993.398900', '10.1109/VISUAL.1992.235181', '10.1109/VISUAL.1992.235195', '10.1109/VISUAL.2000.885719', '10.1109/VISUAL.1991.175816', '10.1109/VISUAL.1990.146414', '10.1109/VISUAL.1993.398861', '10.1109/VISUAL.1993.398872', '10.1109/VISUAL.1994.346292', '10.1109/VISUAL.1994.346295', '10.1109/VISUAL.1994.346297', '10.1109/VISUAL.1994.346301', '10.1109/VISUAL.1999.809913', '10.1109/VISUAL.2001.964546', '10.1109/VISUAL.2003.1250404', '10.1109/TVCG.2014.2346442', '10.1109/TVCG.2020.3028948', '10.1109/TVCG.2020.3030363', '10.1109/TVCG.2020.3030364', '10.1109/tvcg.2021.3114784', '10.1109/tvcg.2021.3114780', '10.1109/tvcg.2021.3114782', '10.1109/tvcg.2021.3114783', '10.1109/tvcg.2021.3114836', '10.1109/TVCG.2021.3064037', '10.1109/TVCG.2021.3114849', '10.1109/TVCG.2021.3114842', '10.1109/TVCG.2021.3114766', '10.1109/TVCG.2021.3114777' ] special_result_index_dict = { '10.1109/VISUAL.1992.235194': 4, } main(vispd_plus_good_papers) paper_df = pd.DataFrame(paper_dict_list) paper_df.to_csv(VISPD_OPENALEX_MATCH_2, index=False) with open(TITEL_QUERY_EMPTY_DOI_QUERY_404_2, 'w') as f: for doi in title_query_empty_doi_query_404_list: f.write("%s\n" % doi) with open(TITLE_QUERY_404_2, 'w') as f: for doi in title_query_404_list: f.write("%s\n" % doi) with open(DOI_QUERY_404_2, 'w') as f: for doi in doi_query_404_list: f.write("%s\n" % doi) |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import pandas as pd import sys VISPUBDATA_PLUS = sys.argv[1] VISPD_PLUS_GOOD_PAPERS = sys.argv[2] def get_vispd_plus_good_papers(INPUT): """get the list of good dois """ vispd_plus = pd.read_csv(INPUT) jc = ['J', 'C'] good_papers = vispd_plus[ (vispd_plus.PaperType.isin(jc)) | (vispd_plus.Year > 2020) ] dois = good_papers.loc[:, "DOI"].tolist() # remove the invalid DOI dois.remove('10.0000/00000001') return dois vispd_plus_good_papers = get_vispd_plus_good_papers(VISPUBDATA_PLUS) with open(VISPD_PLUS_GOOD_PAPERS, 'w') as f: for doi in vispd_plus_good_papers: f.write("%s\n" % doi) |
9 10 11 12 13 14 15 16 17 18 19 20 | import sys import pandas as pd DOIS_2021 = sys.argv[1] VISPUBDATA = sys.argv[2] VISPUBDATA_PLUS = sys.argv[3] if __name__ == '__main__': dois_2021_df = pd.read_csv(DOIS_2021) vispd = pd.read_csv(VISPUBDATA) vispd_plus = vispd.append(dois_2021_df, ignore_index=True) vispd_plus.to_csv(VISPUBDATA_PLUS, index=False) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | import pandas as pd import urllib import requests from bs4 import BeautifulSoup import re import csv import random import numpy as np import time import sys INPUT = sys.argv[1] OUT_FNAME = sys.argv[2] def get_wos_id_from_doi(doi): url = f'http://ws.isiknowledge.com/cps/openurl/service?url_ver=Z39.88-2004&rft_id=info:doi/{doi}' headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", } response = requests.get(url=url, headers=headers) wos_url = response.history[-1].url wos_id_list = re.findall(r'(?<=2FWOS%3A)(.*)(?=%3F)', wos_url) if wos_id_list: wos_id = wos_id_list[0] else: wos_id = np.NaN doi_wos_dict = { 'DOI': doi, 'WOS ID': wos_id } doi_wos_dict_list.append(doi_wos_dict) def get_dois(INPUT): good_dois = open(INPUT, 'r') reader = csv.reader(good_dois) allRows = [row for row in reader] dois = [i[0] for i in allRows] return dois def build_df_from_dict_list(df, dict_list): """build df from a list of dictionaries Arguments: df: an empty df you just initiated dict_list: a list of dictionaries containing data you want to form a df Returns: The updated df """ for i in dict_list: df_1 = pd.DataFrame([i]) df = df.append(df_1, ignore_index=True) return df def main(): for doi in dois: get_wos_id_from_doi(doi) time.sleep(2+random.uniform(0, 2)) print(f'{dois.index(doi) + 1} is done') if __name__ == '__main__': # initiate a list of dicts doi_wos_dict_list = [] dois = get_dois(INPUT) main() # initiate a dataframe doi_wos_df_initiate = pd.DataFrame(columns=['DOI', 'WOS ID']) doi_wos_df = build_df_from_dict_list( doi_wos_df_initiate, doi_wos_dict_list) doi_wos_df.to_csv(OUT_FNAME, index=False) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import sys import pandas as pd import itertools from collections import Counter HT_CLEANED_AUTHOR_DF = sys.argv[1] AUTHOR_CHORD_DF = sys.argv[2] TS_AUTHOR_CHORD_DF = sys.argv[3] def get_dic(DF): # DF here is HT_CLEANED_AUTHOR_DF """get the dictionary of bicode counts""" tuple_list = [] for group in DF.groupby('DOI'): country_codes = list(set(group[1]['Affiliation Country Code'])) if len(country_codes) > 1: tuples = [x for x in itertools.combinations(country_codes, 2)] tuple_list.append(tuples) bicode = list(itertools.chain(*tuple_list)) bicode_counts = Counter(bicode) bicode_counts_dic = dict(bicode_counts) return bicode_counts_dic def get_chord_df(DIC): # DIC here is bicode_counts_dic """ Return: A dataframe containig three columns: source, targe, value. Even though I am using `source`, and `target`, this is an undirected ntework. """ chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value']) chord_df['source'] = chord_df.pairs.apply(lambda x: x[0]) chord_df['target'] = chord_df.pairs.apply(lambda x: x[1]) chord_df_sorted = chord_df[ ['source', 'target', 'value']].sort_values( by='value', ascending=False).reset_index(drop=True) return chord_df_sorted def get_ts_chord_df(DF, ts_chord_data): # DF here is HT_CLEANED_AUTHOR_DF """ get timeseries data. groupby year first. get each year's data and then concatenate """ for year_group in DF.groupby("Year"): bicode_counts_dic = get_dic(year_group[1]) chord_df = pd.DataFrame( bicode_counts_dic.items(), columns=['pairs','value']) chord_df['year'] = year_group[0] chord_df['source'] = chord_df.pairs.apply(lambda x: x[0]) chord_df['target'] = chord_df.pairs.apply(lambda x: x[1]) chord_df_sorted = chord_df[ ['source', 'target', 'value', 'year']].sort_values( by='value', ascending=False).reset_index(drop=True) ts_chord_data.append(chord_df_sorted) ts_chord_df = pd.concat( ts_chord_data, ignore_index=True) return ts_chord_df def rename_countries(DF): """to convert country codes to name""" DF.replace({ 'CH': 'Switzerland', 'CN': 'China', 'DE': 'Germany', 'CA': 'Canada', 'FR': 'France', 'NL': 'Netherlands', 'AT': 'Austria', 'AU': 'Australia', }, inplace=True ) return DF if __name__ == '__main__': HT_CLEANED_AUTHOR_DF = pd.read_csv(HT_CLEANED_AUTHOR_DF) ts_chord_data = [] bicode_counts_dic = get_dic(HT_CLEANED_AUTHOR_DF) chord_df = get_chord_df(bicode_counts_dic) chord_df.to_csv(AUTHOR_CHORD_DF, index=False) ts_chord_df = get_ts_chord_df(HT_CLEANED_AUTHOR_DF, ts_chord_data) ts_chord_df.to_csv(TS_AUTHOR_CHORD_DF, index=False) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 | import pandas as pd import sys import numpy as np import itertools from collections import Counter VISPUBDATA_PLUS = sys.argv[1] OPENALEX_CONCEPT_DF = sys.argv[2] REFERENCE_CONCEPT_DF = sys.argv[3] CITATION_CONCEPT_DF = sys.argv[4] SANKEY_AGGREGATED_DF = sys.argv[5] SANKEY_TS_DF = sys.argv[6] def get_vis_doi_concept_dic(DF, LEVEL): # DF here is OPENALEX_CONCEPT_DF vis_levelns_df = DF[DF.Level == LEVEL].reset_index(drop=True) max_score_leveln = [] for group in vis_levelns_df.groupby('DOI'): max_score = max(group[1]['Score']) df_to_append = group[1][group[1]['Score'] == max_score] max_score_leveln.append(df_to_append) vis_leveln_df = pd.concat(max_score_leveln, ignore_index=True) vis_leveln_doi_concept_dic = dict( zip(vis_leveln_df.DOI, vis_leveln_df.Concept)) return vis_leveln_doi_concept_dic def get_leveln_df(DF, LEVEL, ID_NAME): """ inputs: DF is either s REF_DF or CIT_DF ID_NAME is either REF_ID_NAME or CIT_ID_NAME Returns: a dataframe of two columns: 1. IEEE VIS papers' DOI 2. REF/CIT papers' concept """ dfs = [] levelns_df = DF[DF.Level == LEVEL] # keep only the highest score concept for group in levelns_df.groupby(ID_NAME): dff = group[1].sort_values(by='Score', ascending=False) max_score = max(dff['Score']) dff_to_append = dff[dff['Score'] == max_score] dfs.append(dff_to_append) leveln_df = pd.concat(dfs, ignore_index=True)[['DOI', 'Concept', ID_NAME]] return leveln_df def get_leveln_output_df(DF, VIS_DOI_CONCEPT_DIC, YEAR_DICT, YEAR_KEY, SUFFIX): """ inputs: DF is either REF_LEVELN_DF or CIT_LEVELN_DF YEAR_DICT is either DOI_YEAR_DICT or CIT_ID_YEAR_DICT YEAR_KEY is either REF_YEAR_KEY, or CIT_YEAR_KEY SUFFIX is either REF_SUFFIX or CIT_SUFFIX The purpose of this step: 1. map DOI to IEEE VIS concept 2. get the year when this citation happens """ DF['IEEE VIS Concept'] = DF.DOI.apply( lambda x: VIS_DOI_CONCEPT_DIC[ x] if x in VIS_DOI_CONCEPT_DIC.keys() else np.NaN ) DF['Year'] = DF[YEAR_KEY].apply(lambda x: YEAR_DICT[x]) leveln_df_nonan = DF[DF['IEEE VIS Concept'].notnull()] leveln_df_output = leveln_df_nonan.drop( columns=['DOI']).reset_index(drop=True) if SUFFIX == REF_SUFFIX: leveln_df_output['Concept'] = leveln_df_output[ 'Concept'].apply(lambda s: s + REF_SUFFIX) else: leveln_df_output['Concept'] = leveln_df_output[ 'Concept'].apply(lambda s: s + CIT_SUFFIX) leveln_df_output['IEEE VIS Concept'] = leveln_df_output[ 'IEEE VIS Concept'].apply(lambda s: s + "(v)") return leveln_df_output def get_leveln_aggregated(SOURCE, DF, LEVEL): """ inputs: SOURCE is either 'REF' or 'CIT' DF is either REF_LEVELN_OUTPUT or CIT_LEVELN_OUTPUT """ if SOURCE == 'REF': tuples = list(zip( DF['Concept'], DF['IEEE VIS Concept'], )) else: tuples = list(zip( DF['IEEE VIS Concept'], DF['Concept'], )) biconcept_counts = Counter(tuples) dic = dict(biconcept_counts) sankey_df = pd.DataFrame(dic.items(), columns=['pairs','value']) sankey_df['level'] = LEVEL sankey_df['source'] = sankey_df.pairs.apply(lambda x: x[0]) sankey_df['target'] = sankey_df.pairs.apply(lambda x: x[1]) sankey_df_sorted = sankey_df[ ['source', 'target', 'value', 'level']].sort_values( by='value', ascending=False).reset_index(drop=True) sankey_df_sorted['rank'] = sankey_df_sorted.index + 1 return sankey_df_sorted def get_ts_year_group_data(SOURCE, DF, LEVEL): """ inputs: SOURCE is either 'REF' or 'CIT' DF is year_group This is much the same as the get_leveln_aggregated() function """ if SOURCE == 'REF': tuples = list(zip( DF[1]['Concept'], DF[1]['IEEE VIS Concept'], )) else: tuples = list(zip( DF[1]['IEEE VIS Concept'], DF[1]['Concept'], )) biconcept_counts = Counter(tuples) dic = dict(biconcept_counts) sankey_df = pd.DataFrame(dic.items(), columns=['pairs','value']) sankey_df['level'] = LEVEL sankey_df['source'] = sankey_df.pairs.apply(lambda x: x[0]) sankey_df['target'] = sankey_df.pairs.apply(lambda x: x[1]) sankey_df_sorted = sankey_df[ ['source', 'target', 'value', 'level']].sort_values( by='value', ascending=False).reset_index(drop=True) sankey_df_sorted['rank'] = sankey_df_sorted.index + 1 sankey_df_sorted['year'] = DF[0] return sankey_df_sorted if __name__ == '__main__': VISPUBDATA_PLUS = pd.read_csv(VISPUBDATA_PLUS) OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF) REF_DF = pd.read_csv(REFERENCE_CONCEPT_DF) CIT_DF = pd.read_csv(CITATION_CONCEPT_DF) REF_ID_NAME = 'Reference OpenAlex ID' CIT_ID_NAME = 'Citation Paper OpenAlex ID' REF_DF = REF_DF[REF_DF[REF_ID_NAME].notnull()] CIT_DF = CIT_DF[CIT_DF[CIT_ID_NAME].notnull()] CIT_DF.rename(columns = {'Cited Paper DOI': 'DOI'}, inplace=True) DOI_YEAR_DICT = dict(zip( VISPUBDATA_PLUS.DOI, VISPUBDATA_PLUS.Year )) CIT_ID_YEAR_DICT = dict(zip( CIT_DF[CIT_ID_NAME], CIT_DF['Citation Paper Year'] )) REF_YEAR_KEY = 'DOI' CIT_YEAR_KEY = CIT_ID_NAME # Set parameters START_LEVEL = 0 END_LEVEL = 3 CUTOFF = 500 REF_SUFFIX = '(r)' CIT_SUFFIX = '(c)' # initiate dfs REF_LEVELN_AGGREGATED_DFS = [] CIT_LEVELN_AGGREGATED_DFS = [] REF_LEVELN_TS_DFS = [] CIT_LEVELN_TS_DFS = [] for LEVEL in range(START_LEVEL, END_LEVEL + 1): VIS_DOI_CONCEPT_DIC = get_vis_doi_concept_dic( OPENALEX_CONCEPT_DF, LEVEL ) # REFERENCE -> VIS REF_LEVELN_DF = get_leveln_df( REF_DF, LEVEL, REF_ID_NAME, ) REF_LEVELN_OUTPUT = get_leveln_output_df( REF_LEVELN_DF, VIS_DOI_CONCEPT_DIC, DOI_YEAR_DICT, REF_YEAR_KEY, REF_SUFFIX, ) REF_LEVELN_AGGREGATED = get_leveln_aggregated( 'REF', REF_LEVELN_OUTPUT, LEVEL, ) REF_LEVELN_AGGREGATED_DFS.append(REF_LEVELN_AGGREGATED) # TIMESERIES: REF_LEVELN_YEAR_GROUP_DFS = [] for year_group in REF_LEVELN_OUTPUT.groupby('Year'): year_group_data = get_ts_year_group_data( 'REF', year_group, LEVEL ) REF_LEVELN_YEAR_GROUP_DFS.append(year_group_data) REF_LEVELN_TS_DF = pd.concat( REF_LEVELN_YEAR_GROUP_DFS, ignore_index = True, ) REF_LEVELN_TS_DFS.append(REF_LEVELN_TS_DF) # VIS -> CITATION CIT_LEVELN_DF = get_leveln_df( CIT_DF, LEVEL, CIT_ID_NAME, ) CIT_LEVELN_OUTPUT = get_leveln_output_df( CIT_LEVELN_DF, VIS_DOI_CONCEPT_DIC, CIT_ID_YEAR_DICT, CIT_YEAR_KEY, CIT_SUFFIX, ) CIT_LEVELN_AGGREGATED = get_leveln_aggregated( 'CIT', CIT_LEVELN_OUTPUT, LEVEL, ) CIT_LEVELN_AGGREGATED_DFS.append(CIT_LEVELN_AGGREGATED) # TIMESERIES: CIT_LEVELN_YEAR_GROUP_DFS = [] for year_group in CIT_LEVELN_OUTPUT.groupby('Year'): year_group_data = get_ts_year_group_data( 'CIT', year_group, LEVEL, ) CIT_LEVELN_YEAR_GROUP_DFS.append(year_group_data) CIT_LEVELN_TS_DF = pd.concat( CIT_LEVELN_YEAR_GROUP_DFS, ignore_index = True, ) CIT_LEVELN_TS_DFS.append(CIT_LEVELN_TS_DF) print(f'level {LEVEL} is done') # GET AGGREGATED_DF ref_aggregated = pd.concat( REF_LEVELN_AGGREGATED_DFS, ignore_index = True, ) ref_aggregated['source name'] = 'REF' cit_aggregated = pd.concat( CIT_LEVELN_AGGREGATED_DFS, ignore_index = True, ) cit_aggregated['source name'] = 'VIS' aggregated_df = pd.concat( [ref_aggregated, cit_aggregated], ignore_index = True, ) # GET TS_DF ref_timeseries = pd.concat( REF_LEVELN_TS_DFS, ignore_index = True, ) ref_timeseries['source name'] = 'REF' cit_timeseries = pd.concat( CIT_LEVELN_TS_DFS, ignore_index = True, ) cit_timeseries['source name'] = 'VIS' ts_df = pd.concat( [ref_timeseries, cit_timeseries], ignore_index = True, ) # Write to file aggregated_df.to_csv(SANKEY_AGGREGATED_DF, index=False) ts_df.to_csv(SANKEY_TS_DF, index=False) print('sankey data has been saved!') |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | import sys import numpy as np import pandas as pd from collections import Counter OPENALEX_PAPER_DF = sys.argv[1] OPENALEX_CONCEPT_DF = sys.argv[2] TOP_CONCEPTS_TRENDS_DF = sys.argv[3] def get_year_count_dic(DF): # DF here is openalex_paper_df """I want proportion. So I need first know total number of pubs each year""" year_count_df = DF.groupby( 'Year').size().to_frame('count').reset_index() year_count_dic = dict( zip(year_count_df['Year'], year_count_df['count'])) return year_count_dic def get_top_concepts_rank_and_total(DF, LEVEL, CUTOFF): # DF here is OPENALEX_CONCEPT_DF """get the top concepts, its rank, and its historical total """ # filter by specific level lvl = DF[DF.Level == LEVEL] # get the total frequency of the concepts within that level lvl_df = lvl.groupby(['Concept', 'Concept ID']).size().to_frame( 'frequency').reset_index().sort_values( by='frequency', ascending=False).head(CUTOFF) # get the rank of each of the top 10 concepts within that level # generate two dics: one for rank, and the other for total lvl_df['rank'] = range(1, CUTOFF+1) top_concepts = lvl_df['Concept'] concept_rank_dic = dict(zip(lvl_df['Concept'], lvl_df['rank'])) concept_historical_total_dic = dict(zip(lvl_df['Concept'], lvl_df['frequency'])) return top_concepts, concept_rank_dic, concept_historical_total_dic def get_ts_for_top(DF, TOP_CONCEPTS): # DF here is OPENALEX_CONCEPT_DF """ get timeseries data for top concepts Returns: a dataframe where in each row I have a concept, a year, and the total frequency of that concept in that year """ top_concepts_ts_df = DF[DF.Concept.isin(TOP_CONCEPTS)].groupby( ['Concept', 'Year']).size().to_frame( 'Concept Yearly Frequency').reset_index() return top_concepts_ts_df def update_dfs( DF, i, TOP_RANK_DIC, TOP_TOTAL_DIC, YEAR_COUNT_DIC, DFS ): # DF here is TOP_CONCEPTS_TS_DF LEVEL = i dfss = [] start = 1990 ; end = 2021 year_idx = range(start, end+1) for group in DF.groupby('Concept'): """Normalize each concept in each level by the same time range, i.e., 1990-2021""" year_frequency_dic = dict( zip(group[1]['Year'], group[1]['Concept Yearly Frequency'])) concepts = [group[1].iloc[0, :].Concept] * len(year_idx) frequencies = [ year_frequency_dic[ x] if x in year_frequency_dic.keys() else 0 for x in year_idx] time_series_df = pd.DataFrame( list(zip(concepts, year_idx, frequencies)), columns = [f'concept_{LEVEL}', f'year_{LEVEL}', f'yearly frequency_{LEVEL}']) time_series_df[f'rank_{LEVEL}'] = time_series_df[f'concept_{LEVEL}'].apply( lambda x: TOP_RANK_DIC[x]) time_series_df[f'level_{LEVEL}'] = LEVEL time_series_df[f'concept historical total_{LEVEL}'] = time_series_df[ f'concept_{LEVEL}'].apply( lambda x: TOP_TOTAL_DIC[x]) time_series_df[f'yearly vis total_{LEVEL}'] = time_series_df[f'year_{LEVEL}'].apply( lambda x: YEAR_COUNT_DIC[x]) time_series_df[f'proportion_{LEVEL}'] = time_series_df[ f'yearly frequency_{LEVEL}'] / time_series_df[f'yearly vis total_{LEVEL}'] # time_series_df is for each concept within each level # dfss is to contain all concepts data within a level dfss.append(time_series_df.reset_index(drop=True)) level_df_to_append = pd.concat(dfss, ignore_index = True) level_df_to_append.sort_values(by=[f'rank_{LEVEL}', f'year_{LEVEL}'], inplace=True) DFS.append(level_df_to_append.reset_index(drop=True)) if __name__ == '__main__': # Set parameters START_LEVEL = 0 END_LEVEL = 3 # CUTOFF = 30 CUTOFF = 10 OPENALEX_PAPER_DF = pd.read_csv(OPENALEX_PAPER_DF) OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF) YEAR_COUNT_DIC = get_year_count_dic(OPENALEX_PAPER_DF) DFS = [] for i in range(START_LEVEL, END_LEVEL+1): TOP_CONCEPTS, TOP_RANK_DIC, TOP_TOTAL_DIC = get_top_concepts_rank_and_total( OPENALEX_CONCEPT_DF, i, CUTOFF ) TOP_CONCEPTS_TS_DF = get_ts_for_top( OPENALEX_CONCEPT_DF, TOP_CONCEPTS ) update_dfs( TOP_CONCEPTS_TS_DF, i, TOP_RANK_DIC, TOP_TOTAL_DIC, YEAR_COUNT_DIC, DFS ) # concat, validate, and write to file dff = pd.concat(DFS, axis=1) print(dff['year_1'].tolist() == dff['year_2'].tolist()) print(dff['year_1'].tolist() == dff['year_3'].tolist()) print(dff['rank_1'].tolist() == dff['rank_3'].tolist()) print(dff['rank_1'].tolist() == dff['rank_2'].tolist()) dff.to_csv(TOP_CONCEPTS_TRENDS_DF, index = False) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | import sys import numpy as np import pandas as pd import itertools from collections import Counter OPENALEX_CONCEPT_DF = sys.argv[1] AGGREGATED_COOCCURANCE_DF = sys.argv[2] TS_AGGREGATED_COOCCURANCE_DF = sys.argv[3] def get_level_df(DF, LEVEL): # subset by level level_df = DF[DF.Level == LEVEL].reset_index(drop=True) return level_df def get_dic(LEVEL_DF): # DF here is OPENALEX_CONCEPT_DF """get the dictionary of biconcept counts""" # initiate a tuple list tuple_list = [] # for each ieeevis paper, get combinations of level concepts if more than two # level concepts exist for group in LEVEL_DF.groupby('DOI'): concepts = list(set(group[1].Concept)) if len(concepts) > 1: tuples = [x for x in itertools.combinations(concepts, 2)] tuple_list.append(tuples) # get biconcepts dictionary biconcepts = list(itertools.chain(*tuple_list)) biconcept_counts_dic = dict(Counter(biconcepts)) return biconcept_counts_dic def update_data(DIC, LEVEL, CUTOFF, DATA): # DIC: biconcept_counts_dic # DATA: cooccurance_aggregated_data chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value']) chord_df['level'] = LEVEL chord_df['source'] = chord_df.pairs.apply(lambda x: x[0]) chord_df['target'] = chord_df.pairs.apply(lambda x: x[1]) chord_df = chord_df[ ['source', 'target', 'value', 'level']].sort_values( by='value', ascending=False).reset_index(drop=True) chord_df = chord_df[chord_df['value'] >= CUTOFF] DATA.append(chord_df) def update_ts_data(DIC, YEAR, LEVEL, CUTOFF, DATA): """get timeseries chord dataframe""" chord_df = pd.DataFrame(DIC.items(), columns=['pairs','value']) chord_df['year'] = YEAR chord_df['level'] = LEVEL chord_df['source'] = chord_df.pairs.apply(lambda x: x[0]) chord_df['target'] = chord_df.pairs.apply(lambda x: x[1]) chord_df = chord_df[ ['source', 'target', 'value', 'year', 'level']].sort_values( by='value', ascending=False).reset_index(drop=True) chord_df = chord_df[chord_df['value'] >= CUTOFF] DATA.append(chord_df) if __name__ == '__main__': OPENALEX_CONCEPT_DF = pd.read_csv(OPENALEX_CONCEPT_DF) """set parameters """ CUTOFF = 1 # cutoff number for cooccurance START = 0 # top level END = 3 # lowest level """Get Aggregated data """ # Aggregated data, involving data of all levles cooccurance_aggregated_data = [] # iterate through all levels for LEVEL in range(START, END + 1): LEVEL_DF = get_level_df(OPENALEX_CONCEPT_DF, LEVEL) biconcept_counts_dic = get_dic(LEVEL_DF) update_data( biconcept_counts_dic, LEVEL, CUTOFF, cooccurance_aggregated_data) # write to file aggregated_df = pd.concat(cooccurance_aggregated_data, ignore_index=True) aggregated_df.to_csv(AGGREGATED_COOCCURANCE_DF, index=False) """Get Timeseries data """ cooccurance_timeseries_aggregated_data = [] for LEVEL in range(START, END + 1): # initiate time series data for each level # it will collect each year's data within the current LEVEL cooccurance_timeseries_data = [] LEVEL_DF = get_level_df(OPENALEX_CONCEPT_DF, LEVEL) for YEAR_GROUP in LEVEL_DF.groupby('Year'): biconcept_counts_dic = get_dic(YEAR_GROUP[1]) update_ts_data( biconcept_counts_dic, YEAR_GROUP[0], LEVEL, CUTOFF, cooccurance_timeseries_data ) # this is the final data for each level cooccurance_timeseries_df = pd.concat( cooccurance_timeseries_data, ignore_index=True) # append this level's data to aggregated data list cooccurance_timeseries_aggregated_data.append(cooccurance_timeseries_df) # write to file ts_aggregated_df = pd.concat( cooccurance_timeseries_aggregated_data, ignore_index=True) ts_aggregated_df.to_csv(TS_AGGREGATED_COOCCURANCE_DF, index=False) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | import requests from bs4 import BeautifulSoup as bs import pandas as pd import sys # input IEEE_AUTHOR_DF = sys.argv[1] # output AWARD_PAPER_DF = sys.argv[2] def get_paragraphs(url): r = requests.get(url) if r.status_code == 200: soup = bs(r.text, 'html.parser') article = soup.find('article') paragraphs = list(article.stripped_strings) return paragraphs def rename(x): if 'Honorable Mention Awards' in x: return 'HM' if 'Best Paper Award' in x: return 'BP' if 'Test of Time Award' in x: return 'TT' if 'Best Case Study Award' in x: return 'BCS' raise ValueError("Unknow award:", x) rearranger = lambda x: [x[-1], x[-3], x[-2], x[-4], x[1], x[0]] def get_parsed_results(years, years_idx, paragraphs): results = [] intervals = zip(years_idx, years_idx[1:] + [len(paragraphs)]) # every loop includes a year's awards for idx, (y1, y2) in enumerate(intervals): year = years[idx] paper_info = [] # initialize a list to store a paper info for i in range(y1+1, y2): p = paragraphs[i] if p.endswith(('Awards:', 'Award:')): award = p.replace(':', '') award = rename(award) continue if p.endswith("\nDOI:"): p = p.replace(".\nDOI:", "").replace("Awarded at: ", '') if p == "DOI:": p = 'Vis' # every paper info has four lines: author, title, awarded at, DOI paper_info.append(p) # all DOIs happen to have "/" not used anywhere else if '/' in p and paragraphs[i-1].endswith("DOI:"): paper_info.extend([award, year]) # add award type and year results.append(paper_info) paper_info = [] return list(map(rearranger, results)) def doi_debug(results): df = pd.read_csv(IEEE_AUTHOR_DF) dois = df['DOI'].unique().tolist() dois_lower = [d.lower() for d in dois] for idx, res in enumerate(results): if res[1] in dois: pass elif res[1].lower() in dois_lower: i = dois_lower.index(res[1].lower()) print(res[1] + " has been unified as --> " + dois[i]) results[idx][1] = dois[i] else: print(f"DOI: {res[1]} does not exist in {IEEE_AUTHOR_DF}!") return results def get_2021_tt_papers(): url = 'http://ieeevis.org/year/2021/info/awards/test-of-time-awards' paragraphs = get_paragraphs(url) tracks = ['VAST', 'InfoVis', 'SciVis'] tracks_idx = [paragraphs.index(a) for a in tracks] years, years_idx = [], [] for idx, p in enumerate(paragraphs): p = p.replace(":", "") if p.isdigit(): years.append(int(p)) years_idx.append(idx) def get_track(year_idx): for i in range(-1, -4, -1): if year_idx > tracks_idx[i]: return tracks[i] results = [] award = 'TT' for idx, y_idx in enumerate(years_idx): year = years[idx] title = paragraphs[y_idx+1] author = paragraphs[y_idx+2] doi = paragraphs[y_idx+4] track = get_track(y_idx) results.append([year, doi, award, track, title, author]) return doi_debug(results) def main(): url = 'http://ieeevis.org/year/2022/info/history/best-paper-award' paragraphs = get_paragraphs(url) years = [y for y in range(2021, 1989, -1)] years_idx = [paragraphs.index(str(y)) for y in years] assert len(years) == len(years_idx) results = get_parsed_results(years, years_idx, paragraphs) results = doi_debug(results) results.extend(get_2021_tt_papers()) columns = ['Year', 'DOI', 'Award', 'Track', 'Title', 'Author'] df = pd.DataFrame(results, columns=columns) df.to_csv(AWARD_PAPER_DF, index=False) if __name__ == '__main__': main() |
199 | shell: "python scripts/get_titles_2021.py {output}" |
205 | shell: "python scripts/get_vispd_plus.py {input} {output}" |
211 | shell: "python scripts/get_vispd_plus_good_papers.py {input} {output}" |
217 | shell: "python scripts/get_vispd_openalex_match_1.py {input} {output}" |
223 | shell: "python scripts/get_vispd_openalex_match_2.py {input} {output}" |
229 | shell: "python scripts/get_papers_to_study.py {input} {output}" |
235 | shell: "python scripts/get_openalex_dfs.py {input} {output}" |
241 | shell: "python scripts/get_openalex_citation_dfs.py {input} {output}" |
247 | shell: "python scripts/get_ieee_author_and_paper_title.py {input} {output}" |
253 | shell: "python scripts/get_merged_author_df.py {input} {output}" |
259 | shell: "python scripts/get_openalex_reference_dfs.py {input} {output}" |
265 | shell: "python scripts/scrape_award_papers.py {input} {output}" |
271 | shell: "python scripts/get_gscholar_data.py {input} {output}" |
277 | shell: "python scripts/get_wos_id.py {input} {output}" |
286 | shell: "python scripts/CLASS_country.py {input} {output}" |
295 | shell: "python scripts/CLASS_type.py {input} {output}" |
302 | shell: "python scripts/get_HT_cleaned_author_df.py {input} {output}" |
313 | shell: "python scripts/get_HT_cleaned_paper_df.py {input} {output}" |
318 | shell: "python scripts/plot_data_author_chord_diagram_data.py {input} {output}" |
323 | shell: "python scripts/plot_vis_concepts_cooccurance_data.py {input} {output}" |
328 | shell: "python scripts/plot_top_concepts_trends.py {input} {output}" |
337 | shell: "python scripts/plot_sankey_data.py {input} {output}" |
Support
- Future updates