Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Why this project?
It surprised me that we had over 20 years of data of ICA annual conferences and yet no one has organized it in a way that every researcher has easy access to it. It is a painful effort to scrape all the data manually and I do not expect every scholar to do that.
But why ICA conference data? Why do we need it? What is it good for? I have the following ideas:
-
To inspire new research ideas. Right now, most communication literature came from journal papers (searched mostly in Google Scholar). Findings from conferences may provide a new perspective and inspire new directions.
-
Publications might have biases (For example, https://doi.org/10.1093/hcr/hqz015). Not all research projects end up being published. To circumbent publication bias, it is important to see the topics that are researched but not published (This idea is inspired by Yiwei Xu from Cornell). ICA annual conferences are a good starting point for communication science. Note that ICA annual conferences are peer reviewed and selective. Therefore, even though they are not publisehd, their quality is still quaranteed. This is different from non peer-review preprints.
-
For larger scientometric analysis. The ICA annual conference data that we collected is larger. It contains over 30K papers and 70K authors (a rough guess). This dataset is useful for large scale scientometric analysis. For example, to study the topic evolution of communication studies in the past 20 years or to study academic collaboration or mobility within the filed of communication
-
Contribute to open science. We aim to make our dataset public so that other researchers have equal access to these data (from Yiwei)
-
With these data, we can understand the diversity of communication scholars & research topics better. Right now, we only have acess to journal data, but that is only part of communication scholars and communication research. To get a broader picture and a deeper understanding, we need data about the conference as well.
Data sources
Plans
I am thinking of (1) design an interactive paper explorationn system, (2) clean and make public the the dataset, and (3) write a paper based on preliminary results. I do not plan to do comprehensive analyses based on the data; that is the job for other scholars if they want to use our dataset.
Introduction to this Repository
This repository now has three folders:
-
Data: where all data is stored.
-
Notebooks: this is for exploratory coding. It is mainly useful for me, but maynot be useful for others.
-
Workflow: this is where all the codes are stored, mostly scrapers and data processing scripts.
Data
You do not need to pay any attention to folders of
deprecated
. Right now, all preliminary data is stored in the folder of
interim
.
The folder of
processed
contains data that are ready to analyze and visualize. There are three files now:
-
paper_df.csv
: This is paper data. -
author_df.csv
: author data -
session_df.csv
: session data
Paper Data
Paper data has the following columns:
-
Paper ID
: I assigned an ID to each paper, in the format ofyear-index
-
Title
: the title of this conference paper -
Paper Type
: the type of this presentation, eitherPaper
orPoster
. Note that the ICA website did not distinguish these two types until 2014. Therefore, all presentations prior to 2014 are classified asPaper
, even though they might have beenPoster
instead. -
Abstract
: paper abstract -
Number of Authors
: number of authors in this paper -
Year
: the year when this paper was presented -
Session
: the specific session title -
Division/Unit
: the division (unit) this session is organized by.
Author Data
Author data has the following columns:
-
Paper ID
: I assigned an ID to each paper, in the format ofyear-index
-
Paper Title
: the title of this conference paper -
Year
: the year when this paper was presented -
Number of Authors
: number of authors in this paper -
Author Position
: the position of this author -
Author Name
: author name -
Author Affiliation
: author affiliation
Session Data
Session data has the following columns:
-
year
: the year when this session occurred -
session type
: eitherpaper session
orinteractive paper session
(i.e., poster session) -
session title
: the title of this session -
sub unit
: the division/unit this session is organized by -
chair name
: the name of this session chair -
chair aff
: the affiliation of this session chair
Workflow
I used
snakemake
to manage the workflow. Details are in
Snakefile
.
Scripts
-
scrape_2003_2004.py
scraped all data from 2003 to 2004 -
scrape_2005_2013.py
scraped all data from 2005 to 2013 -
scrape_2014_onward_paper_session.py
scraped data from 2014 to 2018, for the paper sessions. -
scrape_2014_onward_interactive_paper.py
scraped data for posters (extended abstracts) from 2014 to 2018. -
combine_all_data.py
cleaned, organized, and concatenated data.
Code Snippets
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | import pandas as pd import numpy as np import sys PAPER_2003_2004 = sys.argv[1] PAPER_2005_2013 = sys.argv[2] PAPER_2014_2018 = sys.argv[3] INTERACTIVE_PAPER_2014_2018 = sys.argv[4] AUTHOR_2003_2004 = sys.argv[5] AUTHOR_2005_2013 = sys.argv[6] AUTHOR_2014_2018 = sys.argv[7] INTERACTIVE_AUTHOR_2014_2018 = sys.argv[8] SESSION_2014_2018 = sys.argv[9] INTERACTIVE_SESSION_2014_2018 = sys.argv[10] PAPER_DF = sys.argv[11] AUTHOR_DF = sys.argv[12] SESSION_DF = sys.argv[13] if __name__ == '__main__': # import all data paper1 = pd.read_csv(PAPER_2003_2004) paper2 = pd.read_csv(PAPER_2005_2013) paper3 = pd.read_csv(PAPER_2014_2018) paper4 = pd.read_csv(INTERACTIVE_PAPER_2014_2018) author1 = pd.read_csv(AUTHOR_2003_2004) author2 = pd.read_csv(AUTHOR_2005_2013) author3 = pd.read_csv(AUTHOR_2014_2018) author4 = pd.read_csv(INTERACTIVE_AUTHOR_2014_2018) session1 = pd.read_csv(SESSION_2014_2018) session2 = pd.read_csv(INTERACTIVE_SESSION_2014_2018) # add 'Year' to paper 1, paper2, author1, and author2 paper2['Year'] = [i.split('-')[0] for i in paper2['Paper ID']] paper1['Year'] = [i.split('-')[0] for i in paper1['Paper ID']] author1['Year'] = [i.split('-')[0] for i in author1['Paper ID']] author2['Year'] = [i.split('-')[0] for i in author2['Paper ID']] # change author3 and author4 colname author3.columns = [ 'Paper ID', 'Paper Title', 'Year', 'Number of Authors', 'Author Position', 'Author Name', 'Author Affiliation' ] author4.columns = [ 'Paper ID', 'Paper Title', 'Year', 'Number of Authors', 'Author Position', 'Author Name', 'Author Affiliation' ] # author_df author_df = pd.concat([author1, author2, author3, author4], axis = 0) print(f'Author DF is done. Its shape: {author_df.shape}') # create a paper id: author num dict id_num_author_dict = dict(zip(author_df['Paper ID'], author_df['Number of Authors'])) # there are four missing paper ids in author2 paper2_id = paper2['Paper ID'].tolist() author2_id = list(set(author2['Paper ID'])) print(f'Number of paper ids in paper2: {len(paper2_id)}') print(f'Number of paper ids in author2: {len(author2_id)}') missing_paper_id = [x for x in paper2_id if x not in author2_id] print(missing_paper_id) # update dict for x in missing_paper_id: id_num_author_dict[x] = np.nan # add number of authors to paper1 and paper2 paper1['Number of Authors'] = [id_num_author_dict[pid] for pid in paper1['Paper ID']] paper2['Number of Authors'] = [id_num_author_dict[pid] for pid in paper2['Paper ID']] # select cols paper1 = paper1[['Paper ID', 'Title', 'Type', 'Abstract', 'Number of Authors', 'Year']] paper2 = paper2[[ 'Paper ID', 'Title', 'Sumission Type', 'Abstract', 'Number of Authors', 'Year', 'Session', 'Division/Unit' ]] # update colnmaes paper1.columns = ['Paper ID', 'Title', 'Paper Type', 'Abstract', 'Number of Authors', 'Year'] paper2.columns = ['Paper ID', 'Title', 'Paper Type', 'Abstract', 'Number of Authors', 'Year', 'Session', 'Division/Unit'] paper3.columns = ['Paper ID', 'Year', 'Paper Type', 'Title', 'Number of Authors', 'Abstract', 'Session', 'Division/Unit'] paper4.columns = ['Paper ID', 'Year', 'Paper Type', 'Title', 'Number of Authors', 'Abstract', 'Session', 'Division/Unit'] # concatenate paper df paper_df = pd.concat([paper1, paper2, paper3, paper4], axis = 0) # concatenate session df session_df = pd.concat([session1, session2], axis = 0) # write to file paper_df.to_csv(PAPER_DF, index = False) author_df.to_csv(AUTHOR_DF, index = False) session_df.to_csv(SESSION_DF, index = False) print('Files written. All should be in place now.') |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | import pandas as pd import numpy as np import time import re from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import Select import sys PAPER_03_04 = sys.argv[1] AUTHOR_03_04 = sys.argv[2] def click_on_search_papers(): search_papers = wait.until(EC.element_to_be_clickable(( By.CSS_SELECTOR, "div.menu_item__icon_text_window__text > a.mainmenu_text" ))) search_papers.click() def get_papers(): """ get all paper elements in the current page """ papers = driver.find_elements( By.CSS_SELECTOR, 'tr.worksheet_window__row__light, tr.worksheet_window__row__dark' ) return papers def get_paper_meta(paper, year, paper_meta_dict_list): """ get paper index, paper title, and paper_type the author names can be found here but I'll collect later in the view page """ idx = paper.find_element( By.CSS_SELECTOR, 'td[title="##"]').text ## in the format of '0001' paper_id = year + '-' + idx.zfill(4) # summary elements: summary = paper.find_element( By.CSS_SELECTOR, 'td[title="Summary"]' ) title = summary.find_element( By.CSS_SELECTOR, 'a.search_headingtext' ).text submission_type = summary.find_element( By.CSS_SELECTOR, 'td[style="padding: 5px;"]' ).text.lstrip(' Individual Submission type: ') paper_meta_dict = { 'Paper ID': paper_id, 'Title': title, 'Type': submission_type } # update the dict list paper_meta_dict_list.append(paper_meta_dict) return paper_meta_dict def open_view(paper): """ Input: paper element Aim: open a new window and click 'view' """ action = paper.find_element( By.CSS_SELECTOR, 'td[title="Action"]' ) view_link_e = action.find_element( By.CSS_SELECTOR, "li.action_list > a.fieldtext" ) view_link = view_link_e.get_attribute('href') driver.execute_script("window.open('');") driver.switch_to.window(driver.window_handles[1]) driver.get(view_link) def get_title_to_check(paper_meta_dict_list): # there are two 'tr.header font.headingtext' # title is the second one headingtexts = driver.find_elements( By.CSS_SELECTOR, 'tr.header font.headingtext' ) title_to_check = headingtexts[1].text # update the most recent paper_meta_dict_list paper_meta_dict_list[-1]['Title to Check'] = title_to_check return title_to_check def get_authors(paper_meta_dict, author_dict_list): paper_id, title = paper_meta_dict['Paper ID'], paper_meta_dict['Title'] # note that authors_e will return a list since there might be multiple authors authors = driver.find_elements( By.CSS_SELECTOR, 'a.search_fieldtext_name' ) for author in authors: author_idx = authors.index(author) + 1 authorNum = len(authors) author_elements = author.text.split(' (') author_name = author_elements[0] # doc: https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.rstrip # some don't contain '()', i.e., affiliation info try: author_aff = author_elements[1].rstrip(')') except: author_aff = np.nan author_dict = { 'Paper ID': paper_id, 'Paper Title': title, 'Number of Authors': authorNum, 'Author Position': author_idx, 'Author Name': author_name, 'Author Affiliation': author_aff, } author_dict_list.append(author_dict) def get_abstract(paper_meta_dict_list): # obtain abstract in the newly opened page abstract = driver.find_element( By.CSS_SELECTOR, 'blockquote.tight > font.fieldtext' ).text paper_meta_dict_list[-1]['Abstract'] = abstract return abstract def scrape_one_page(year, page_num, paper_meta_dict_list, author_dict_list): papers = get_papers() for paper in papers: ## to test: # for paper in papers[0:1]: paper_idx = papers.index(paper) + 1 paper_meta_dict = get_paper_meta(paper, year, paper_meta_dict_list) open_view(paper) get_title_to_check(paper_meta_dict_list) get_authors(paper_meta_dict, author_dict_list) get_abstract(paper_meta_dict_list) driver.close() driver.switch_to.window(driver.window_handles[0]) print(f'Page {page_num} Paper {paper_idx} is done') time.sleep(0.5) def get_iterators(): iterators = driver.find_elements( By.XPATH, '//div[@class="iterator"][1]/form//a[@class="fieldtext"]' ) return iterators if __name__ == '__main__': # initiate list to contain data paper_meta_dict_list = [] author_dict_list = [] driver = webdriver.Firefox() wait = WebDriverWait(driver, 10) urlBase = 'https://convention2.allacademic.com/one/ica/ica' # scrape 2003~2004 years = range(3,5) for year in years: year = str(year).zfill(2) url = urlBase + year driver.get(url) # year in the form of 2003/2004 year = f'20{year}' print(f'{year} has started!') click_on_search_papers() # to calculate total pages iterators = get_iterators() total_pages = int(iterators[-2].text) for i in range(1,total_pages+1): page_num = i if i >= 10: if year == '2004': print('2004!') select = Select(driver.find_element( By.XPATH, '//div[@class="iterator"][1] // select' )) select.select_by_visible_text('+ 20') else: # if '2003', click on '20' iterators = get_iterators() iterators[-2].click() iterators = get_iterators() for j in iterators: if (j.text == str(i)): j.click() break scrape_one_page( year, page_num, paper_meta_dict_list, author_dict_list ) print(f'page {i} is done') # go back to the first page iterators = get_iterators() iterators[1].click() print('Everything done!') driver.close() driver.quit() print('Writing to file now...') pd.DataFrame(paper_meta_dict_list).to_csv(PAPER_03_04, index = False) pd.DataFrame(author_dict_list).to_csv(AUTHOR_03_04, index = False) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | import pandas as pd import numpy as np import time import math import re from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import Select import sys PAPER_2005_2013 = sys.argv[1] AUTHOR_2005_2013 = sys.argv[2] def click_on_view_program(): all_btn = driver.find_elements( By.CSS_SELECTOR, "div.menu_item__icon_text_window__text > a.mainmenu_text" ) for btn in all_btn: if 'Program' in btn.text: view_program_btn = btn break view_program_btn.click() def click_on_individual_presentations(): ''' To click on 'individual presentations' ''' presentations = wait.until(EC.element_to_be_clickable(( By.XPATH, '//td[@class="tab_topped_window__tab_cell"][2]' ))) presentations.click() def get_papers(): """ get all paper elements in the current page """ papers = driver.find_elements( By.CSS_SELECTOR, 'tr.worksheet_window__row__light, tr.worksheet_window__row__dark' ) return papers def removeprefix(text, prefix): # https://stackoverflow.com/a/16891418 if text.startswith(prefix): return text[len(prefix):] return text def get_paper_meta(paper, year, paper_meta_dict_list): """ get paper index, paper title, and paper_type the author names can be found here but I'll collect later in the view page """ idx = paper.find_element( By.CSS_SELECTOR, 'td[title="##"]').text ## in the format of '0001' paper_id = year + '-' + idx.zfill(4) # summary elements: summary = paper.find_element( By.CSS_SELECTOR, 'td[title="Summary"]' ) title = summary.find_element( By.CSS_SELECTOR, 'a.search_headingtext' ).text summary_info = summary.find_elements( By.CSS_SELECTOR, 'td[style="padding: 5px;"] tr' ) session = np.nan division = np.nan submission_type = np.nan research_areas = np.nan for i in summary_info: if 'In Session Submission' in i.text: session = removeprefix(i.text, ' In Session Submission: ') elif 'Session Submission Division' in i.text: division = removeprefix(i.text, ' Session Submission Division: ') elif 'Session Submission Unit' in i.text: division = removeprefix(i.text, ' Session Submission Unit: ') elif 'Submission type' in i.text: submission_type = removeprefix(i.text, ' Individual Submission type: ') elif 'Research Areas:' in i.text: research_areas = removeprefix(i.text, ' Research Areas: ') paper_meta_dict = { 'Paper ID': paper_id, 'Title': title, 'Session': session, 'Division/Unit': division, 'Sumission Type': submission_type, 'Research Areas': research_areas, } # update the dict list paper_meta_dict_list.append(paper_meta_dict) return paper_meta_dict def open_view(paper): """ Input: paper element Aim: open a new window and click 'view' """ action = paper.find_element( By.CSS_SELECTOR, 'td[title="Action"]' ) view_link_e = action.find_element( By.CSS_SELECTOR, "li.action_list > a.fieldtext" ) view_link = view_link_e.get_attribute('href') driver.execute_script("window.open('');") driver.switch_to.window(driver.window_handles[1]) driver.get(view_link) def get_title_to_check(paper_meta_dict_list): # there are two 'tr.header font.headingtext' # title is the second one headingtexts = driver.find_elements( By.CSS_SELECTOR, 'tr.header font.headingtext' ) title_to_check = headingtexts[1].text # update the most recent paper_meta_dict_list paper_meta_dict_list[-1]['Title to Check'] = title_to_check return title_to_check def get_session_to_check(paper_meta_dict_list): session_to_check = driver.find_element( By.CSS_SELECTOR, 'blockquote.tight > a.search_headingtext' ) session_to_check = session_to_check.text # update the most recent paper_meta_dict_list paper_meta_dict_list[-1]['Session to Check'] = session_to_check return session_to_check def get_authors(paper_meta_dict, author_dict_list): paper_id, title = paper_meta_dict['Paper ID'], paper_meta_dict['Title'] # note that authors_e will return a list since there might be multiple authors authors = driver.find_elements( By.CSS_SELECTOR, 'a.search_fieldtext_name' ) for author in authors: author_idx = authors.index(author) + 1 authorNum = len(authors) author_elements = author.text.split(' (') author_name = author_elements[0] # doc: https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.rstrip # some don't contain '()', i.e., affiliation info try: author_aff = author_elements[1].rstrip(')') except: author_aff = np.nan author_dict = { 'Paper ID': paper_id, 'Paper Title': title, 'Number of Authors': authorNum, 'Author Position': author_idx, 'Author Name': author_name, 'Author Affiliation': author_aff, } author_dict_list.append(author_dict) def get_abstract(paper_meta_dict_list): # abstract abstract = driver.find_elements( By.CSS_SELECTOR, 'blockquote.tight' )[-1] abstract = abstract.text abstract = " ".join(abstract.splitlines()).strip() paper_meta_dict_list[-1]['Abstract'] = abstract return abstract def scrape_one_page(year, page_num, paper_meta_dict_list, author_dict_list): papers = get_papers() for paper in papers: ## to test: # for paper in papers[0:1]: paper_idx = papers.index(paper) + 1 paper_meta_dict = get_paper_meta(paper, year, paper_meta_dict_list) open_view(paper) get_title_to_check(paper_meta_dict_list) get_session_to_check(paper_meta_dict_list) get_authors(paper_meta_dict, author_dict_list) get_abstract(paper_meta_dict_list) driver.close() driver.switch_to.window(driver.window_handles[0]) print(f'Year {year}, Page {page_num} Paper {paper_idx} is done') time.sleep(0.05) def get_iterators(): iterators = driver.find_elements( By.XPATH, '//div[@class="iterator"][1]/form//a[@class="fieldtext"]' ) return iterators if __name__ == '__main__': # initiate list to contain data paper_meta_dict_list = [] author_dict_list = [] driver = webdriver.Firefox() wait = WebDriverWait(driver, 10) urlBase = 'https://convention2.allacademic.com/one/ica/ica' # scrape 2005~2013 years = range(5,14) for year in years: year = str(year).zfill(2) url = urlBase + year driver.get(url) # year in the form of 2003/2004 year = f'20{year}' print(f'{year} has started!') click_on_view_program() click_on_individual_presentations() # to calculate total pages iterators = get_iterators() total_pages = int(iterators[-2].text) for i in range(1,total_pages+1): print(f'page {i} has started') page_num = i if i < 10: pass elif i >= 10 and i < 17: select = Select(driver.find_element( By.XPATH, '//div[@class="iterator"][1] // select' )) select.select_by_visible_text('+ 10') elif i >= 17 and i < 27: select = Select(driver.find_element( By.XPATH, '//div[@class="iterator"][1] // select' )) select.select_by_visible_text('+ 20') elif i >= 27 and i < 37: select = Select(driver.find_element( By.XPATH, '//div[@class="iterator"][1] // select' )) select.select_by_visible_text('+ 30') else: iterators = get_iterators() iterators[-2].click() # this achieves something I never thought about. # when i == 21, after selecting '+ 20', the current iterator is 21 # then, the get_iterators() function will skip the current iterator # since no j is equal to 21, the program won't even go to the for loop # but will directly start `scrape_one_page()` iterators = get_iterators() for j in iterators: if (j.text == str(i)): current_idx = int(j.text) j.click() break scrape_one_page( year, page_num, paper_meta_dict_list, author_dict_list ) iterators = get_iterators() iterators[1].click() print('Everything done!') driver.close() driver.quit() print('Writing to file now...') pd.DataFrame(paper_meta_dict_list).to_csv(PAPER_2005_2013, index = False) pd.DataFrame(author_dict_list).to_csv(AUTHOR_2005_2013, index = False) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 | import pandas as pd import numpy as np import time import re from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import Select import sys import random INTERACTIVVE_SESSION_2014_2018 = sys.argv[1] INTERACTIVVE_AUTHOR_2014_2018 = sys.argv[2] INTERACTIVVE_PAPER_2014_2018 = sys.argv[3] def click_browse_by_session_type(): '''click on "browse by session type" ''' browse_by_session_type = driver.find_elements( By.CSS_SELECTOR, "li.ui-li-has-icon.ui-last-child > a" )[3] browse_by_session_type.click() def click_interactive_paper_session(): '''click "paper session" button ''' paper_session = driver.find_element( By.XPATH, '//li[@class="ui-li-has-count ui-first-child"] //a[@class="ui-btn"]' ) paper_session.click() def get_sessions(): '''These are session links ''' sessions = driver.find_elements( By.CSS_SELECTOR, 'a.ul-li-has-alt-left.ui-btn' ) return sessions def update_session_meta(year, session_tuples): '''update session metadata: session title, session sub unit, session chair name and affiliation ''' session_title_e = driver.find_element( By.CSS_SELECTOR, 'h3' ) session_title = session_title_e.text # sub unit, cosponsor, chair, the presentations h4s = driver.find_elements( By.CSS_SELECTOR, 'h4' ) h4s_texts = [i.text for i in h4s] sub_unit_e_idx = h4s_texts.index('Sub Unit') '''sub unit and chair are very tricky Some examples: year 2015, session "Environmental Journalism: Coverage, Reader Response, and Mediators" in the above example, 'chair' is below 'cosponsor' Another example, year 2015, session 'B.E.S.T.: Organizations, Communication, and Technology' This example is a little bit strange because we have 'abstract' here. However, it does not have the gray area My conclusion is that it seems that the gray box for sub unit is always the first one so I can use the index of '4'. For chair, I need to get its index and add it by 5 ''' try: sub_unit_e = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow' )[4] sub_unit = sub_unit_e.text except: sub_unit = None # if there is no 'Chair', for example, session 200 of 2016, # then there is no need to proceed further. if 'Chair' not in h4s_texts: chair_name = None chair_aff = None else: try: if 'Cosponsor' in h4s_texts: chair_e_idx = 6 else: chair_e_idx = 5 # chair_e_idx = h4s_texts.index('Chair') chair_graybox = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow' )[chair_e_idx] chair_es = chair_graybox.find_elements( By.CSS_SELECTOR, 'li' ) if chair_es: if len(chair_es) == 1: chair_info = chair_es[0].text chair_name = chair_info.split(', ')[0] chair_aff = chair_info.split(', ')[1] # this is to solve the issue of when there are multiple chairs. For example, # year 2018, session 'Research Escalator - Part 1' else: chair_name = '' chair_aff = '' for chair_e in chair_es: chair_info = chair_e.text chair_name_i = chair_info.split(', ')[0] chair_aff_i = chair_info.split(', ')[1] chair_name += chair_name_i chair_aff += chair_aff_i if chair_e != chair_es[-1]: chair_name += '; ' chair_aff += '; ' except: chair_name = None chair_aff = None session_tuples.append(( year, 'Interactive Paper Session', session_title, sub_unit, chair_name, chair_aff, )) # return session title and sub_unit so that I can use them later return session_title, sub_unit def get_author_num(): """This is to get authors element and author number, which I use later in get paper info and author info """ authors_e = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child a.ui-icon-carat-r' )[2:] author_num = len(authors_e) return authors_e, author_num def get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year): '''get author info and update author tuples ''' paper_id = year + '-' + str(paper_id).zfill(4) for author in authors_e: author_position = authors_e.index(author) + 1 # split on the first ', ' only to solve the issue of 'person, aff, dept' try: author_name, author_aff = author.text.split(', ', 1) # For example: # 2016, Gaining Access to Social Capital, Louis Leung has no aff except: author_name = author.text author_aff = None author_tuples.append(( paper_id, paper_title, year, author_num, author_position, author_name, author_aff )) def get_paper_info(paper_tuples, author_num, session_title, sub_unit, year, paper_id): '''get paper info and update paper tuples ''' paper_id = year + '-' + str(paper_id).zfill(4) paper_title_e = driver.find_element( By.CSS_SELECTOR, 'h3' ) paper_title = paper_title_e.text abstract = driver.find_element( By.CSS_SELECTOR, 'blockquote > p' ).text paper_tuples.append(( paper_id, year, 'Poster', paper_title, author_num, abstract, session_title, sub_unit, )) # return paper title so I can use it in get_author_info return paper_title def get_papers(): h4s = driver.find_elements( By.CSS_SELECTOR, 'h4' ) if h4s[-1].get_attribute('innerHTML') == 'Individual Presentations': # I do not know why but the first two selections are not paper elements. I need to remove them. papers = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child a.ui-icon-carat-r' )[2:] # this is to prevent something like the session of # Good Grief! Disasters, Crises, and High-Risk Organizational Environments return papers elif h4s[-1].get_attribute('innerHTML') in ['Respondent', 'Respondents']: papers = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:nth-last-child(3) a.ui-icon-carat-r' )[2:] return papers else: '''Why this happen? You can go to year 2016, session 262 and you'll know that there are no papers. session 103 of year 2014 also have no papers ''' # print('Something went wrong!') print('TEHRE PROBABLY ARE NO PAPERS HERE') # to_scrape_later_tuples.append((year, session_index)) if __name__ == '__main__': driver = webdriver.Firefox() wait = WebDriverWait(driver, 10) urlBase = 'https://convention2.allacademic.com/one/ica/ica' # scrape 2014-2018 # years = range(14,19) years = [14, 15, 16, 17, 18] session_tuples = [] author_tuples = [] paper_tuples = [] for year in years: year = str(year) url = urlBase + year driver.get(url) # year in the form of 2014/2018 year = f'20{year}' print(f'{year} has started!') click_browse_by_session_type() click_interactive_paper_session() sessions = get_sessions() print(f'There are {len(sessions)} sessions.') # randomly choose 10 sessions for testing random_sessions = random.sample(sessions, 5) # to assign paper id. initiate it as 0 and then add 1 each time paper_id = 0 for s in sessions: # for s in random_sessions: session_index = sessions.index(s) s_link = s.get_attribute('href') # open a new window driver.execute_script("window.open('');") # switch to the new window driver.switch_to.window(driver.window_handles[1]) # open the session driver.get(s_link) session_title, sub_unit = update_session_meta(year, session_tuples) if 'preconference:' not in session_title.lower(): print(f'Session {session_index} has started') papers = get_papers() # Sometimes paper is none, for example, year 2016, session # Communication and Technology, Game Studies, and Information Systems Joint Reception if papers: print(f'There are {len(papers)} papers.') for p in papers: # 2016, SESSION 85 HAS TROUBLES try: p_link = p.get_attribute('href') driver.execute_script("window.open('');") driver.switch_to.window(driver.window_handles[2]) driver.get(p_link) authors_e, author_num = get_author_num() paper_title = get_paper_info( paper_tuples, author_num, session_title, sub_unit, year, paper_id ) get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year) except: print('This paper is unavailable.') paper_id += 1 print(f'Paper {papers.index(p) + 1} is done.') time.sleep(0.5+random.uniform(0, 0.5)) # close windown 2 driver.close() # switch to window 1 driver.switch_to.window(driver.window_handles[1]) print(f'Session {session_index} is done.') time.sleep(0.5+random.uniform(0, 0.5)) else: print(f'Session {session_index} is Preconference.') # close window 1 driver.close() # switch to windown 0 driver.switch_to.window(driver.window_handles[0]) print('Everything done!') driver.close() driver.quit() pd.DataFrame(session_tuples, columns = [ 'year', 'session type', 'session title', 'sub unit', 'chair name', 'chair aff', ]).to_csv(INTERACTIVVE_SESSION_2014_2018, index = False) pd.DataFrame(author_tuples, columns = [ 'paper id', 'paper title', 'year', 'author number', 'author position', 'author name', 'author aff' ]).to_csv(INTERACTIVVE_AUTHOR_2014_2018, index = False) pd.DataFrame(paper_tuples, columns = [ 'paper id', 'year', 'paper type', 'paper title', 'author number', 'abstract', 'session title', 'sub unit' ]).to_csv(INTERACTIVVE_PAPER_2014_2018, index = False) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | import pandas as pd import numpy as np import time import re from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import Select import sys import random SESSION_2014_2018 = sys.argv[1] AUTHOR_2014_2018 = sys.argv[2] PAPER_2014_2018 = sys.argv[3] def click_browse_by_session_type(): '''click on "browse by session type" ''' browse_by_session_type = driver.find_elements( By.CSS_SELECTOR, "li.ui-li-has-icon.ui-last-child > a" )[3] browse_by_session_type.click() def click_paper_session(): '''click "paper session" button ''' paper_session = driver.find_element( By.XPATH, '//li[@class="ui-li-has-count"][3] //a[@class="ui-btn"]' ) paper_session.click() def get_sessions(): '''These are session links ''' sessions = driver.find_elements( By.CSS_SELECTOR, 'a.ul-li-has-alt-left.ui-btn' ) return sessions def update_session_meta(year, session_tuples): '''update session metadata: session title, session sub unit, session chair name and affiliation ''' session_title_e = driver.find_element( By.CSS_SELECTOR, 'h3' ) session_title = session_title_e.text # sub unit, cosponsor, chair, the presentations h4s = driver.find_elements( By.CSS_SELECTOR, 'h4' ) h4s_texts = [i.text for i in h4s] sub_unit_e_idx = h4s_texts.index('Sub Unit') '''sub unit and chair are very tricky Some examples: year 2015, session "Environmental Journalism: Coverage, Reader Response, and Mediators" in the above example, 'chair' is below 'cosponsor' Another example, year 2015, session 'B.E.S.T.: Organizations, Communication, and Technology' This example is a little bit strange because we have 'abstract' here. However, it does not have the gray area My conclusion is that it seems that the gray box for sub unit is always the first one so I can use the index of '4'. For chair, I need to get its index and add it by 5 ''' try: sub_unit_e = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow' )[4] sub_unit = sub_unit_e.text except: sub_unit = None # if there is no 'Chair', for example, session 200 of 2016, # then there is no need to proceed further. if 'Chair' not in h4s_texts: chair_name = None chair_aff = None else: try: if 'Cosponsor' in h4s_texts: chair_e_idx = 6 else: chair_e_idx = 5 # chair_e_idx = h4s_texts.index('Chair') chair_graybox = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow' )[chair_e_idx] chair_es = chair_graybox.find_elements( By.CSS_SELECTOR, 'li' ) if chair_es: if len(chair_es) == 1: chair_info = chair_es[0].text chair_name = chair_info.split(', ')[0] chair_aff = chair_info.split(', ')[1] # this is to solve the issue of when there are multiple chairs. For example, # year 2018, session 'Research Escalator - Part 1' else: chair_name = '' chair_aff = '' for chair_e in chair_es: chair_info = chair_e.text chair_name_i = chair_info.split(', ')[0] chair_aff_i = chair_info.split(', ')[1] chair_name += chair_name_i chair_aff += chair_aff_i if chair_e != chair_es[-1]: chair_name += '; ' chair_aff += '; ' except: chair_name = None chair_aff = None session_tuples.append(( year, 'Paper Session', session_title, sub_unit, chair_name, chair_aff, )) # return session title and sub_unit so that I can use them later return session_title, sub_unit def get_author_num(): """This is to get authors element and author number, which I use later in get paper info and author info """ authors_e = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child a.ui-icon-carat-r' )[2:] author_num = len(authors_e) return authors_e, author_num def get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year): '''get author info and update author tuples ''' paper_id = year + '-' + str(paper_id).zfill(4) for author in authors_e: author_position = authors_e.index(author) + 1 # split on the first ', ' only to solve the issue of 'person, aff, dept' try: author_name, author_aff = author.text.split(', ', 1) # For example: # 2016, Gaining Access to Social Capital, Louis Leung has no aff except: author_name = author.text author_aff = None author_tuples.append(( paper_id, paper_title, year, author_num, author_position, author_name, author_aff )) def get_paper_info(paper_tuples, author_num, session_title, sub_unit, year, paper_id): '''get paper info and update paper tuples ''' paper_id = year + '-' + str(paper_id).zfill(4) paper_title_e = driver.find_element( By.CSS_SELECTOR, 'h3' ) paper_title = paper_title_e.text abstract = driver.find_element( By.CSS_SELECTOR, 'blockquote > p' ).text # abstract = " ".join(abstract.splitlines()).strip() paper_tuples.append(( paper_id, year, 'Paper Session', paper_title, author_num, abstract, session_title, sub_unit, )) # return paper title so I can use it in get_author_info return paper_title def get_papers(): h4s = driver.find_elements( By.CSS_SELECTOR, 'h4' ) if h4s[-1].get_attribute('innerHTML') == 'Individual Presentations': # I do not know why but the first two selections are not paper elements. I need to remove them. papers = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child a.ui-icon-carat-r' )[2:] # this is to prevent something like the session of # Good Grief! Disasters, Crises, and High-Risk Organizational Environments return papers elif h4s[-1].get_attribute('innerHTML') in ['Respondent', 'Respondents']: papers = driver.find_elements( By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:nth-last-child(3) a.ui-icon-carat-r' )[2:] return papers else: '''Why this happen? You can go to year 2016, session 262 and you'll know that there are no papers. session 103 of year 2014 also have no papers ''' # print('Something went wrong!') print('TEHRE PROBABLY ARE NO PAPERS HERE') # to_scrape_later_tuples.append((year, session_index)) if __name__ == '__main__': driver = webdriver.Firefox() wait = WebDriverWait(driver, 10) urlBase = 'https://convention2.allacademic.com/one/ica/ica' # scrape 2014-2018 # years = range(14,19) years = [14, 15, 16, 17, 18] # there are always excepts, for example, 2016 session 262 session_tuples = [] author_tuples = [] paper_tuples = [] for year in years: year = str(year) url = urlBase + year driver.get(url) # year in the form of 2014/2018 year = f'20{year}' print(f'{year} has started!') click_browse_by_session_type() click_paper_session() sessions = get_sessions() print(f'There are {len(sessions)} sessions.') # randomly choose 10 sessions for testing random_sessions = random.sample(sessions, 5) # to assign paper id. initiate it as 0 and then add 1 each time paper_id = 0 for s in sessions: # for s in random_sessions: session_index = sessions.index(s) s_link = s.get_attribute('href') # open a new window driver.execute_script("window.open('');") # switch to the new window driver.switch_to.window(driver.window_handles[1]) # open the session driver.get(s_link) session_title, sub_unit = update_session_meta(year, session_tuples) if 'preconference:' not in session_title.lower(): print(f'Session {session_index} has started') papers = get_papers() # Sometimes paper is none, for example, year 2016, session # Communication and Technology, Game Studies, and Information Systems Joint Reception if papers: print(f'There are {len(papers)} papers.') for p in papers: # 2016, SESSION 85 HAS TROUBLES try: p_link = p.get_attribute('href') driver.execute_script("window.open('');") driver.switch_to.window(driver.window_handles[2]) driver.get(p_link) authors_e, author_num = get_author_num() paper_title = get_paper_info( paper_tuples, author_num, session_title, sub_unit, year, paper_id ) get_author_info( authors_e, author_num, author_tuples, paper_title, paper_id, year) except: print('This paper is unavailable.') paper_id += 1 print(f'Paper {papers.index(p) + 1} is done.') time.sleep(0.5+random.uniform(0, 0.5)) # close windown 2 driver.close() # switch to window 1 driver.switch_to.window(driver.window_handles[1]) print(f'Session {session_index} is done.') time.sleep(0.5+random.uniform(0, 0.5)) else: print(f'Session {session_index} is Preconference.') # close window 1 driver.close() # switch to windown 0 driver.switch_to.window(driver.window_handles[0]) print('Everything done!') driver.close() driver.quit() pd.DataFrame(session_tuples, columns = [ 'year', 'session type', 'session title', 'sub unit', 'chair name', 'chair aff', ]).to_csv(SESSION_2014_2018, index = False) pd.DataFrame(author_tuples, columns = [ 'paper id', 'paper title', 'year', 'author number', 'author position', 'author name', 'author aff' ]).to_csv(AUTHOR_2014_2018, index = False) pd.DataFrame(paper_tuples, columns = [ 'paper id', 'year', 'paper type', 'paper title', 'author number', 'abstract', 'session title', 'sub unit' ]).to_csv(PAPER_2014_2018, index = False) |
61 | shell: "python scripts/scrape_2003_2004.py {output}" |
67 | shell: "python scripts/scrape_2005_2013.py {output}" |
74 | shell: "python scripts/scrape_2014_onward_paper.py {output}" |
81 | shell: "python scripts/scrape_2014_onward_interactive_paper.py {output}" |
99 | shell: "python scripts/combine_all_data.py {input} {output}" |
Support
- Future updates
Related Workflows





