ICA Conference Data Explorer

public public 1yr ago 0 bookmarks

ICA Conference Data Explorer

Why this project?

It surprised me that we had over 20 years of data of ICA annual conferences and yet no one has organized it in a way that every researcher has easy access to it. It is a painful effort to scrape all the data manually and I do not expect every scholar to do that.

But why ICA conference data? Why do we need it? What is it good for? I have the following ideas:

  1. To inspire new research ideas. Right now, most communication literature came from journal papers (searched mostly in Google Scholar). Findings from conferences may provide a new perspective and inspire new directions.

  2. Publications might have biases (For example, https://doi.org/10.1093/hcr/hqz015). Not all research projects end up being published. To circumbent publication bias, it is important to see the topics that are researched but not published (This idea is inspired by Yiwei Xu from Cornell). ICA annual conferences are a good starting point for communication science. Note that ICA annual conferences are peer reviewed and selective. Therefore, even though they are not publisehd, their quality is still quaranteed. This is different from non peer-review preprints.

  3. For larger scientometric analysis. The ICA annual conference data that we collected is larger. It contains over 30K papers and 70K authors (a rough guess). This dataset is useful for large scale scientometric analysis. For example, to study the topic evolution of communication studies in the past 20 years or to study academic collaboration or mobility within the filed of communication

  4. Contribute to open science. We aim to make our dataset public so that other researchers have equal access to these data (from Yiwei)

  5. With these data, we can understand the diversity of communication scholars & research topics better. Right now, we only have acess to journal data, but that is only part of communication scholars and communication research. To get a broader picture and a deeper understanding, we need data about the conference as well.

Data sources

Data sources

Plans

I am thinking of (1) design an interactive paper explorationn system, (2) clean and make public the the dataset, and (3) write a paper based on preliminary results. I do not plan to do comprehensive analyses based on the data; that is the job for other scholars if they want to use our dataset.

Introduction to this Repository

This repository now has three folders:

  • Data: where all data is stored.

  • Notebooks: this is for exploratory coding. It is mainly useful for me, but maynot be useful for others.

  • Workflow: this is where all the codes are stored, mostly scrapers and data processing scripts.

Data

You do not need to pay any attention to folders of deprecated . Right now, all preliminary data is stored in the folder of interim .

The folder of processed contains data that are ready to analyze and visualize. There are three files now:

  • paper_df.csv : This is paper data.

  • author_df.csv : author data

  • session_df.csv : session data

Paper Data

Paper data has the following columns:

  • Paper ID : I assigned an ID to each paper, in the format of year-index

  • Title : the title of this conference paper

  • Paper Type : the type of this presentation, either Paper or Poster . Note that the ICA website did not distinguish these two types until 2014. Therefore, all presentations prior to 2014 are classified as Paper , even though they might have been Poster instead.

  • Abstract : paper abstract

  • Number of Authors : number of authors in this paper

  • Year : the year when this paper was presented

  • Session : the specific session title

  • Division/Unit : the division (unit) this session is organized by.

Author Data

Author data has the following columns:

  • Paper ID : I assigned an ID to each paper, in the format of year-index

  • Paper Title : the title of this conference paper

  • Year : the year when this paper was presented

  • Number of Authors : number of authors in this paper

  • Author Position : the position of this author

  • Author Name : author name

  • Author Affiliation : author affiliation

Session Data

Session data has the following columns:

  • year : the year when this session occurred

  • session type : either paper session or interactive paper session (i.e., poster session)

  • session title : the title of this session

  • sub unit : the division/unit this session is organized by

  • chair name : the name of this session chair

  • chair aff : the affiliation of this session chair

Workflow

I used snakemake to manage the workflow. Details are in Snakefile .

Scripts

  • scrape_2003_2004.py scraped all data from 2003 to 2004

  • scrape_2005_2013.py scraped all data from 2005 to 2013

  • scrape_2014_onward_paper_session.py scraped data from 2014 to 2018, for the paper sessions.

  • scrape_2014_onward_interactive_paper.py scraped data for posters (extended abstracts) from 2014 to 2018.

  • combine_all_data.py cleaned, organized, and concatenated data.

Code Snippets

  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
import pandas as pd
import numpy as np
import sys

PAPER_2003_2004 = sys.argv[1]
PAPER_2005_2013 = sys.argv[2]
PAPER_2014_2018 = sys.argv[3]
INTERACTIVE_PAPER_2014_2018 = sys.argv[4]
AUTHOR_2003_2004 = sys.argv[5]
AUTHOR_2005_2013 = sys.argv[6]
AUTHOR_2014_2018 = sys.argv[7]
INTERACTIVE_AUTHOR_2014_2018 = sys.argv[8]
SESSION_2014_2018 = sys.argv[9]
INTERACTIVE_SESSION_2014_2018 = sys.argv[10]
PAPER_DF = sys.argv[11]
AUTHOR_DF = sys.argv[12]
SESSION_DF = sys.argv[13]

if __name__ == '__main__':
	# import all data 
	paper1 = pd.read_csv(PAPER_2003_2004)
	paper2 = pd.read_csv(PAPER_2005_2013)
	paper3 = pd.read_csv(PAPER_2014_2018)
	paper4 = pd.read_csv(INTERACTIVE_PAPER_2014_2018)
	author1 = pd.read_csv(AUTHOR_2003_2004)
	author2 = pd.read_csv(AUTHOR_2005_2013)
	author3 = pd.read_csv(AUTHOR_2014_2018)
	author4 = pd.read_csv(INTERACTIVE_AUTHOR_2014_2018)
	session1 = pd.read_csv(SESSION_2014_2018)
	session2 = pd.read_csv(INTERACTIVE_SESSION_2014_2018)

	# add 'Year' to paper 1, paper2, author1, and author2
	paper2['Year'] = [i.split('-')[0] for i in paper2['Paper ID']]
	paper1['Year'] = [i.split('-')[0] for i in paper1['Paper ID']]
	author1['Year'] = [i.split('-')[0] for i in author1['Paper ID']]
	author2['Year'] = [i.split('-')[0] for i in author2['Paper ID']]

	# change author3 and author4 colname
	author3.columns = [
		'Paper ID', 'Paper Title', 'Year', 
		'Number of Authors', 'Author Position', 
		'Author Name', 'Author Affiliation'
	]
	author4.columns = [
		'Paper ID', 'Paper Title', 'Year', 
		'Number of Authors', 'Author Position', 
		'Author Name', 'Author Affiliation'
	]

	# author_df 
	author_df = pd.concat([author1, author2, author3, author4], axis = 0)

	print(f'Author DF is done. Its shape: {author_df.shape}')

	# create a paper id: author num dict 
	id_num_author_dict = dict(zip(author_df['Paper ID'], author_df['Number of Authors']))

	# there are four missing paper ids in author2
	paper2_id = paper2['Paper ID'].tolist()
	author2_id = list(set(author2['Paper ID']))
	print(f'Number of paper ids in paper2: {len(paper2_id)}')
	print(f'Number of paper ids in author2: {len(author2_id)}')
	missing_paper_id = [x for x in paper2_id if x not in author2_id]
	print(missing_paper_id)

	# update dict
	for x in missing_paper_id:
		id_num_author_dict[x] = np.nan

	# add number of authors to paper1 and paper2
	paper1['Number of Authors'] = [id_num_author_dict[pid] for pid in paper1['Paper ID']]
	paper2['Number of Authors'] = [id_num_author_dict[pid] for pid in paper2['Paper ID']]

	# select cols
	paper1 = paper1[['Paper ID', 'Title', 'Type', 'Abstract', 'Number of Authors', 'Year']]
	paper2 = paper2[[
		'Paper ID', 'Title', 'Sumission Type', 
		'Abstract', 'Number of Authors', 'Year', 'Session', 'Division/Unit'
	]]

	# update colnmaes
	paper1.columns = ['Paper ID', 'Title', 'Paper Type', 
		'Abstract', 'Number of Authors', 'Year']
	paper2.columns = ['Paper ID', 'Title', 'Paper Type', 
		'Abstract', 'Number of Authors', 'Year', 'Session', 'Division/Unit']
	paper3.columns = ['Paper ID', 'Year', 'Paper Type', 'Title', 
		'Number of Authors', 'Abstract', 'Session', 'Division/Unit']
	paper4.columns = ['Paper ID', 'Year', 'Paper Type', 'Title', 
		'Number of Authors', 'Abstract', 'Session', 'Division/Unit']

	# concatenate paper df
	paper_df = pd.concat([paper1, paper2, paper3, paper4], axis = 0)

	# concatenate session df 
	session_df = pd.concat([session1, session2], axis = 0)

	# write to file
	paper_df.to_csv(PAPER_DF, index = False)
	author_df.to_csv(AUTHOR_DF, index = False)
	session_df.to_csv(SESSION_DF, index = False)

	print('Files written. All should be in place now.')
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
import pandas as pd
import numpy as np
import time 
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys

PAPER_03_04 = sys.argv[1]
AUTHOR_03_04 = sys.argv[2]

def click_on_search_papers():
	search_papers = wait.until(EC.element_to_be_clickable((
		By.CSS_SELECTOR, 
		"div.menu_item__icon_text_window__text > a.mainmenu_text"
	)))
	search_papers.click()

def get_papers():
	"""
	get all paper elements in the current page
	"""
	papers = driver.find_elements(
		By.CSS_SELECTOR, 'tr.worksheet_window__row__light, tr.worksheet_window__row__dark'
	)
	return papers

def get_paper_meta(paper, year, paper_meta_dict_list):
	"""
	get paper index, paper title, and paper_type
		the author names can be found here but I'll collect later in the view page
	"""
	idx = paper.find_element(
		By.CSS_SELECTOR, 'td[title="##"]').text
	## in the format of '0001'
	paper_id = year + '-' + idx.zfill(4)
	# summary elements:
	summary = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Summary"]'
	)
	title = summary.find_element(
		By.CSS_SELECTOR, 'a.search_headingtext'
	).text
	submission_type = summary.find_element(
		By.CSS_SELECTOR, 'td[style="padding: 5px;"]'
	).text.lstrip('  Individual Submission type: ')
	paper_meta_dict = {
		'Paper ID': paper_id,
		'Title': title,
		'Type': submission_type
	}
	# update the dict list
	paper_meta_dict_list.append(paper_meta_dict)
	return paper_meta_dict

def open_view(paper):
	"""
	Input:
		paper element
	Aim:
		open a new window and click 'view'
	"""
	action = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Action"]'
	)
	view_link_e = action.find_element(
				By.CSS_SELECTOR, "li.action_list > a.fieldtext"
			)
	view_link = view_link_e.get_attribute('href')
	driver.execute_script("window.open('');")
	driver.switch_to.window(driver.window_handles[1])
	driver.get(view_link)

def get_title_to_check(paper_meta_dict_list):
	# there are two 'tr.header font.headingtext'
	# title is the second one
	headingtexts = driver.find_elements(
		By.CSS_SELECTOR, 'tr.header font.headingtext'
	)
	title_to_check = headingtexts[1].text
	# update the most recent paper_meta_dict_list
	paper_meta_dict_list[-1]['Title to Check'] = title_to_check
	return title_to_check

def get_authors(paper_meta_dict, author_dict_list):
	paper_id, title = paper_meta_dict['Paper ID'], paper_meta_dict['Title']
	# note that authors_e will return a list since there might be multiple authors
	authors = driver.find_elements(
		By.CSS_SELECTOR, 'a.search_fieldtext_name'
	)
	for author in authors:
		author_idx = authors.index(author) + 1
		authorNum = len(authors)
		author_elements = author.text.split(' (')
		author_name = author_elements[0]
		# doc: https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.rstrip
		# some don't contain '()', i.e., affiliation info
		try:
			author_aff = author_elements[1].rstrip(')')
		except:
			author_aff = np.nan
		author_dict = {
			'Paper ID': paper_id,
			'Paper Title': title,
			'Number of Authors': authorNum,
			'Author Position': author_idx,
			'Author Name': author_name,
			'Author Affiliation': author_aff,
		}
		author_dict_list.append(author_dict)

def get_abstract(paper_meta_dict_list):
	# obtain abstract in the newly opened page
	abstract = driver.find_element(
		By.CSS_SELECTOR, 'blockquote.tight > font.fieldtext'
	).text
	paper_meta_dict_list[-1]['Abstract'] = abstract
	return abstract

def scrape_one_page(year, page_num, paper_meta_dict_list, author_dict_list):
	papers = get_papers()
	for paper in papers:
	## to test:
	# for paper in papers[0:1]:
		paper_idx = papers.index(paper) + 1
		paper_meta_dict = get_paper_meta(paper, year, paper_meta_dict_list)
		open_view(paper)
		get_title_to_check(paper_meta_dict_list)
		get_authors(paper_meta_dict, author_dict_list)
		get_abstract(paper_meta_dict_list)
		driver.close()
		driver.switch_to.window(driver.window_handles[0])
		print(f'Page {page_num} Paper {paper_idx} is done')
		time.sleep(0.5)

def get_iterators():
	iterators = driver.find_elements(
		By.XPATH, '//div[@class="iterator"][1]/form//a[@class="fieldtext"]'
	)
	return iterators

if __name__ == '__main__':
	# initiate list to contain data
	paper_meta_dict_list = []
	author_dict_list = []
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2003~2004
	years = range(3,5)
	for year in years:
		year = str(year).zfill(2)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2003/2004
		year = f'20{year}'
		print(f'{year} has started!')
		click_on_search_papers()
		# to calculate total pages
		iterators = get_iterators()
		total_pages = int(iterators[-2].text)
		for i in range(1,total_pages+1):
			page_num = i
			if i >= 10:
				if year == '2004':
					print('2004!')
					select = Select(driver.find_element(
						By.XPATH, '//div[@class="iterator"][1] // select'
					))
					select.select_by_visible_text('+ 20')
				else:
					# if '2003', click on '20'
					iterators = get_iterators()
					iterators[-2].click()
			iterators = get_iterators()
			for j in iterators:
				if (j.text == str(i)):
					j.click()
					break 
			scrape_one_page(
				year,
				page_num, 
				paper_meta_dict_list, 
				author_dict_list
			)
			print(f'page {i} is done')
			# go back to the first page
			iterators = get_iterators()
			iterators[1].click()
	print('Everything done!')
	driver.close()
	driver.quit()
	print('Writing to file now...')
	pd.DataFrame(paper_meta_dict_list).to_csv(PAPER_03_04, index = False)
	pd.DataFrame(author_dict_list).to_csv(AUTHOR_03_04, index = False)
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
import pandas as pd
import numpy as np
import time 
import math
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys

PAPER_2005_2013 = sys.argv[1]
AUTHOR_2005_2013 = sys.argv[2]

def click_on_view_program():
	all_btn = driver.find_elements(
		By.CSS_SELECTOR, 
		"div.menu_item__icon_text_window__text > a.mainmenu_text"
	)
	for btn in all_btn:
		if 'Program' in btn.text:
			view_program_btn = btn 
			break
	view_program_btn.click()

def click_on_individual_presentations():
	'''
	To click on 'individual presentations'
	'''
	presentations = wait.until(EC.element_to_be_clickable((
		By.XPATH,
		'//td[@class="tab_topped_window__tab_cell"][2]'
	)))
	presentations.click()

def get_papers():
	"""
	get all paper elements in the current page
	"""
	papers = driver.find_elements(
		By.CSS_SELECTOR, 'tr.worksheet_window__row__light, tr.worksheet_window__row__dark'
	)
	return papers

def removeprefix(text, prefix):
	# https://stackoverflow.com/a/16891418
	if text.startswith(prefix):
		return text[len(prefix):]
	return text 

def get_paper_meta(paper, year, paper_meta_dict_list):
	"""
	get paper index, paper title, and paper_type
		the author names can be found here but I'll collect later in the view page
	"""
	idx = paper.find_element(
		By.CSS_SELECTOR, 'td[title="##"]').text
	## in the format of '0001'
	paper_id = year + '-' + idx.zfill(4)
	# summary elements:
	summary = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Summary"]'
	)
	title = summary.find_element(
		By.CSS_SELECTOR, 'a.search_headingtext'
	).text
	summary_info = summary.find_elements(
		By.CSS_SELECTOR, 'td[style="padding: 5px;"] tr'
	)
	session = np.nan
	division = np.nan
	submission_type = np.nan
	research_areas = np.nan
	for i in summary_info:
		if 'In Session Submission' in i.text:
			session = removeprefix(i.text, '  In Session Submission: ')
		elif 'Session Submission Division' in i.text:
			division = removeprefix(i.text, '  Session Submission Division: ')
		elif 'Session Submission Unit' in i.text:
			division = removeprefix(i.text, '  Session Submission Unit: ')
		elif 'Submission type' in i.text:
			submission_type = removeprefix(i.text, '  Individual Submission type: ')
		elif 'Research Areas:' in i.text:
			research_areas = removeprefix(i.text, '  Research Areas: ')
	paper_meta_dict = {
		'Paper ID': paper_id,
		'Title': title,
		'Session': session,
		'Division/Unit': division,
		'Sumission Type': submission_type,
		'Research Areas': research_areas,
	}
	# update the dict list
	paper_meta_dict_list.append(paper_meta_dict)
	return paper_meta_dict

def open_view(paper):
	"""
	Input:
		paper element
	Aim:
		open a new window and click 'view'
	"""
	action = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Action"]'
	)
	view_link_e = action.find_element(
				By.CSS_SELECTOR, "li.action_list > a.fieldtext"
			)
	view_link = view_link_e.get_attribute('href')
	driver.execute_script("window.open('');")
	driver.switch_to.window(driver.window_handles[1])
	driver.get(view_link)

def get_title_to_check(paper_meta_dict_list):
	# there are two 'tr.header font.headingtext'
	# title is the second one
	headingtexts = driver.find_elements(
		By.CSS_SELECTOR, 'tr.header font.headingtext'
	)
	title_to_check = headingtexts[1].text
	# update the most recent paper_meta_dict_list
	paper_meta_dict_list[-1]['Title to Check'] = title_to_check
	return title_to_check

def get_session_to_check(paper_meta_dict_list):
	session_to_check = driver.find_element(
		By.CSS_SELECTOR, 'blockquote.tight > a.search_headingtext'
	)
	session_to_check = session_to_check.text
	# update the most recent paper_meta_dict_list
	paper_meta_dict_list[-1]['Session to Check'] = session_to_check
	return session_to_check

def get_authors(paper_meta_dict, author_dict_list):
	paper_id, title = paper_meta_dict['Paper ID'], paper_meta_dict['Title']
	# note that authors_e will return a list since there might be multiple authors
	authors = driver.find_elements(
		By.CSS_SELECTOR, 'a.search_fieldtext_name'
	)
	for author in authors:
		author_idx = authors.index(author) + 1
		authorNum = len(authors)
		author_elements = author.text.split(' (')
		author_name = author_elements[0]
		# doc: https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.rstrip
		# some don't contain '()', i.e., affiliation info
		try:
			author_aff = author_elements[1].rstrip(')')
		except:
			author_aff = np.nan
		author_dict = {
			'Paper ID': paper_id,
			'Paper Title': title,
			'Number of Authors': authorNum,
			'Author Position': author_idx,
			'Author Name': author_name,
			'Author Affiliation': author_aff,
		}
		author_dict_list.append(author_dict)

def get_abstract(paper_meta_dict_list):
	# abstract
	abstract = driver.find_elements(
		By.CSS_SELECTOR, 'blockquote.tight'
	)[-1]
	abstract = abstract.text
	abstract = " ".join(abstract.splitlines()).strip()

	paper_meta_dict_list[-1]['Abstract'] = abstract

	return abstract

def scrape_one_page(year, page_num, paper_meta_dict_list, author_dict_list):
	papers = get_papers()
	for paper in papers:
	## to test:
	# for paper in papers[0:1]:
		paper_idx = papers.index(paper) + 1
		paper_meta_dict = get_paper_meta(paper, year, paper_meta_dict_list)
		open_view(paper)
		get_title_to_check(paper_meta_dict_list)
		get_session_to_check(paper_meta_dict_list)
		get_authors(paper_meta_dict, author_dict_list)
		get_abstract(paper_meta_dict_list)
		driver.close()
		driver.switch_to.window(driver.window_handles[0])
		print(f'Year {year}, Page {page_num} Paper {paper_idx} is done')
		time.sleep(0.05)

def get_iterators():
	iterators = driver.find_elements(
		By.XPATH, '//div[@class="iterator"][1]/form//a[@class="fieldtext"]'
	)
	return iterators

if __name__ == '__main__':
	# initiate list to contain data
	paper_meta_dict_list = []
	author_dict_list = []
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2005~2013
	years = range(5,14)
	for year in years:
		year = str(year).zfill(2)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2003/2004
		year = f'20{year}'
		print(f'{year} has started!')
		click_on_view_program()
		click_on_individual_presentations()
		# to calculate total pages
		iterators = get_iterators()
		total_pages = int(iterators[-2].text)
		for i in range(1,total_pages+1):
			print(f'page {i} has started')
			page_num = i
			if i < 10:
				pass
			elif i >= 10 and i < 17:
				select = Select(driver.find_element(
					By.XPATH, '//div[@class="iterator"][1] // select'
				))
				select.select_by_visible_text('+ 10')
			elif i >= 17 and i < 27:
				select = Select(driver.find_element(
					By.XPATH, '//div[@class="iterator"][1] // select'
				))
				select.select_by_visible_text('+ 20')
			elif i >= 27 and i < 37:
				select = Select(driver.find_element(
					By.XPATH, '//div[@class="iterator"][1] // select'
				))
				select.select_by_visible_text('+ 30')
			else:
				iterators = get_iterators()
				iterators[-2].click()
			# this achieves something I never thought about. 
			# when i == 21, after selecting '+ 20', the current iterator is 21
			# then, the get_iterators() function will skip the current iterator
			# since no j is equal to 21, the program won't even go to the for loop
			# but will directly start `scrape_one_page()`
			iterators = get_iterators()
			for j in iterators:
				if (j.text == str(i)):
					current_idx = int(j.text)
					j.click()
					break 
			scrape_one_page(
				year,
				page_num, 
				paper_meta_dict_list, 
				author_dict_list
			)
			iterators = get_iterators()
			iterators[1].click()
	print('Everything done!')
	driver.close()
	driver.quit()
	print('Writing to file now...')
	pd.DataFrame(paper_meta_dict_list).to_csv(PAPER_2005_2013, index = False)
	pd.DataFrame(author_dict_list).to_csv(AUTHOR_2005_2013, index = False)
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
import pandas as pd
import numpy as np
import time 
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys
import random 

INTERACTIVVE_SESSION_2014_2018 = sys.argv[1]
INTERACTIVVE_AUTHOR_2014_2018 = sys.argv[2]
INTERACTIVVE_PAPER_2014_2018 = sys.argv[3]

def click_browse_by_session_type():
	'''click on "browse by session type"
	'''
	browse_by_session_type = driver.find_elements(
		By.CSS_SELECTOR, "li.ui-li-has-icon.ui-last-child > a"
	)[3]
	browse_by_session_type.click()

def click_interactive_paper_session():
	'''click "paper session" button
	'''
	paper_session = driver.find_element(
		By.XPATH, '//li[@class="ui-li-has-count ui-first-child"] //a[@class="ui-btn"]'
	)
	paper_session.click()

def get_sessions():
	'''These are session links
	'''
	sessions = driver.find_elements(
		By.CSS_SELECTOR, 'a.ul-li-has-alt-left.ui-btn'
	)
	return sessions

def update_session_meta(year, session_tuples):
	'''update session metadata: session title, session sub unit, 
		session chair name and affiliation
	'''
	session_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	session_title = session_title_e.text

	# sub unit, cosponsor, chair, the presentations
	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	h4s_texts = [i.text for i in h4s]
	sub_unit_e_idx = h4s_texts.index('Sub Unit')
	'''sub unit and chair are very tricky
	Some examples: year 2015, session "Environmental Journalism: Coverage, Reader Response, and Mediators"
	  in the above example, 'chair' is below 'cosponsor'
	Another example, year 2015, session 'B.E.S.T.: Organizations, Communication, and Technology'
	  This example is a little bit strange because we have 'abstract' here. However, it does not have the gray area
	My conclusion is that it seems that the gray box for sub unit is always the first one so
	I can use the index of '4'. For chair, I need to get its index and add it by 5
	'''
	try:
		sub_unit_e = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
		)[4]
		sub_unit = sub_unit_e.text
	except:
		sub_unit = None
	# if there is no 'Chair', for example, session 200 of 2016,
	# then there is no need to proceed further. 
	if 'Chair' not in h4s_texts:
		chair_name = None
		chair_aff = None
	else:
		try:
			if 'Cosponsor' in h4s_texts:
				chair_e_idx = 6
			else:
				chair_e_idx = 5
			# chair_e_idx = h4s_texts.index('Chair')
			chair_graybox = driver.find_elements(
				By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
			)[chair_e_idx]
			chair_es = chair_graybox.find_elements(
				By.CSS_SELECTOR, 'li'
			)
			if chair_es:
				if len(chair_es) == 1:
					chair_info = chair_es[0].text
					chair_name = chair_info.split(', ')[0]
					chair_aff = chair_info.split(', ')[1]
				# this is to solve the issue of when there are multiple chairs. For example,
				# year 2018, session 'Research Escalator - Part 1'
				else:
					chair_name = ''
					chair_aff = ''
					for chair_e in chair_es:
						chair_info = chair_e.text
						chair_name_i = chair_info.split(', ')[0]
						chair_aff_i = chair_info.split(', ')[1]
						chair_name += chair_name_i
						chair_aff += chair_aff_i
						if chair_e != chair_es[-1]:
							chair_name += '; '
							chair_aff += '; '
		except:
			chair_name = None
			chair_aff = None

	session_tuples.append((
		year,
		'Interactive Paper Session',
		session_title,
		sub_unit,
		chair_name,
		chair_aff,
	))
	# return session title and sub_unit so that I can use them later
	return session_title, sub_unit

def get_author_num():
	"""This is to get authors element and author number, 
		which I use later in get paper info and author info
	"""
	authors_e = driver.find_elements(
		By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
	)[2:]
	author_num = len(authors_e)
	return authors_e, author_num

def get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year):
	'''get author info and update author tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	for author in authors_e:
		author_position = authors_e.index(author) + 1
		# split on the first ', ' only to solve the issue of 'person, aff, dept'
		try:
			author_name, author_aff = author.text.split(', ', 1)
		# For example:
		# 2016, Gaining Access to Social Capital, Louis Leung has no aff
		except:
			author_name = author.text
			author_aff = None
		author_tuples.append((
			paper_id,
			paper_title,
			year,
			author_num,
			author_position,
			author_name,
			author_aff
		))

def get_paper_info(paper_tuples, author_num, session_title, sub_unit, year, paper_id):
	'''get paper info and update paper tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	paper_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	paper_title = paper_title_e.text
	abstract = driver.find_element(
		By.CSS_SELECTOR, 'blockquote > p'
	).text 
	paper_tuples.append((
		paper_id,
		year,
		'Poster',
		paper_title, 
		author_num, 
		abstract, 
		session_title, 
		sub_unit,
	))
	# return paper title so I can use it in get_author_info
	return paper_title

def get_papers():

	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	if h4s[-1].get_attribute('innerHTML') == 'Individual Presentations':
		# I do not know why but the first two selections are not paper elements. I need to remove them. 
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
		)[2:]
	# this is to prevent something like the session of
	# Good Grief! Disasters, Crises, and High-Risk Organizational Environments
		return papers
	elif h4s[-1].get_attribute('innerHTML') in ['Respondent', 'Respondents']:
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:nth-last-child(3)  a.ui-icon-carat-r'
		)[2:]

		return papers
	else:
		'''Why this happen? You can go to year 2016, session 262 and you'll know that there
		   are no papers. 

		   session 103 of year 2014 also have no papers
		'''
		# print('Something went wrong!')
		print('TEHRE PROBABLY ARE NO PAPERS HERE')
		# to_scrape_later_tuples.append((year, session_index))

if __name__ == '__main__':
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2014-2018
	# years = range(14,19)
	years = [14, 15, 16, 17, 18]
	session_tuples = []
	author_tuples = []
	paper_tuples = []
	for year in years:
		year = str(year)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2014/2018
		year = f'20{year}'
		print(f'{year} has started!')
		click_browse_by_session_type()
		click_interactive_paper_session()
		sessions = get_sessions()
		print(f'There are {len(sessions)} sessions.')

		# randomly choose 10 sessions for testing
		random_sessions = random.sample(sessions, 5)
		# to assign paper id. initiate it as 0 and then add 1 each time
		paper_id = 0
		for s in sessions:
		# for s in random_sessions:
			session_index = sessions.index(s)
			s_link = s.get_attribute('href')
			# open a new window
			driver.execute_script("window.open('');")
			# switch to the new window
			driver.switch_to.window(driver.window_handles[1])
			# open the session
			driver.get(s_link)
			session_title, sub_unit = update_session_meta(year, session_tuples)
			if 'preconference:' not in session_title.lower():
				print(f'Session {session_index} has started')
				papers = get_papers()
				# Sometimes paper is none, for example, year 2016, session
				# Communication and Technology, Game Studies, and Information Systems Joint Reception
				if papers:
					print(f'There are {len(papers)} papers.')
					for p in papers:
						# 2016, SESSION 85 HAS TROUBLES
						try:
							p_link = p.get_attribute('href')
							driver.execute_script("window.open('');")
							driver.switch_to.window(driver.window_handles[2])
							driver.get(p_link)
							authors_e, author_num = get_author_num()
							paper_title = get_paper_info(
								paper_tuples, 
								author_num, 
								session_title, 
								sub_unit, 
								year, 
								paper_id
							)
							get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year)
						except:
							print('This paper is unavailable.')
						paper_id += 1

						print(f'Paper {papers.index(p) + 1} is done.')
						time.sleep(0.5+random.uniform(0, 0.5)) 
						# close windown 2
						driver.close()
						# switch to window 1
						driver.switch_to.window(driver.window_handles[1])

				print(f'Session {session_index} is done.')
				time.sleep(0.5+random.uniform(0, 0.5)) 
			else:
				print(f'Session {session_index} is Preconference.')
			# close window 1
			driver.close()
			# switch to windown 0
			driver.switch_to.window(driver.window_handles[0])

	print('Everything done!')
	driver.close()
	driver.quit()

	pd.DataFrame(session_tuples, columns = [
		'year',
		'session type',
		'session title',
		'sub unit',
		'chair name',
		'chair aff',
		]).to_csv(INTERACTIVVE_SESSION_2014_2018, index = False)
	pd.DataFrame(author_tuples, columns = [
		'paper id',
		'paper title',
		'year',
		'author number',
		'author position',
		'author name',
		'author aff'
		]).to_csv(INTERACTIVVE_AUTHOR_2014_2018, index = False)
	pd.DataFrame(paper_tuples, columns = [
		'paper id',
		'year',
		'paper type',
		'paper title',
		'author number',
		'abstract',
		'session title',
		'sub unit'
		]).to_csv(INTERACTIVVE_PAPER_2014_2018, index = False)
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
import pandas as pd
import numpy as np
import time 
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys
import random 

SESSION_2014_2018 = sys.argv[1]
AUTHOR_2014_2018 = sys.argv[2]
PAPER_2014_2018 = sys.argv[3]

def click_browse_by_session_type():
	'''click on "browse by session type"
	'''
	browse_by_session_type = driver.find_elements(
		By.CSS_SELECTOR, "li.ui-li-has-icon.ui-last-child > a"
	)[3]
	browse_by_session_type.click()

def click_paper_session():
	'''click "paper session" button
	'''
	paper_session = driver.find_element(
		By.XPATH, '//li[@class="ui-li-has-count"][3] //a[@class="ui-btn"]'
	)
	paper_session.click()

def get_sessions():
	'''These are session links
	'''
	sessions = driver.find_elements(
		By.CSS_SELECTOR, 'a.ul-li-has-alt-left.ui-btn'
	)
	return sessions

def update_session_meta(year, session_tuples):
	'''update session metadata: session title, session sub unit, 
		session chair name and affiliation
	'''
	session_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	session_title = session_title_e.text

	# sub unit, cosponsor, chair, the presentations
	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	h4s_texts = [i.text for i in h4s]
	sub_unit_e_idx = h4s_texts.index('Sub Unit')
	'''sub unit and chair are very tricky
	Some examples: year 2015, session "Environmental Journalism: Coverage, Reader Response, and Mediators"
	  in the above example, 'chair' is below 'cosponsor'
	Another example, year 2015, session 'B.E.S.T.: Organizations, Communication, and Technology'
	  This example is a little bit strange because we have 'abstract' here. However, it does not have the gray area
	My conclusion is that it seems that the gray box for sub unit is always the first one so
	I can use the index of '4'. For chair, I need to get its index and add it by 5
	'''
	try:
		sub_unit_e = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
		)[4]
		sub_unit = sub_unit_e.text
	except:
		sub_unit = None
	# if there is no 'Chair', for example, session 200 of 2016,
	# then there is no need to proceed further. 
	if 'Chair' not in h4s_texts:
		chair_name = None
		chair_aff = None
	else:
		try:
			if 'Cosponsor' in h4s_texts:
				chair_e_idx = 6
			else:
				chair_e_idx = 5
			# chair_e_idx = h4s_texts.index('Chair')
			chair_graybox = driver.find_elements(
				By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
			)[chair_e_idx]
			chair_es = chair_graybox.find_elements(
				By.CSS_SELECTOR, 'li'
			)
			if chair_es:
				if len(chair_es) == 1:
					chair_info = chair_es[0].text
					chair_name = chair_info.split(', ')[0]
					chair_aff = chair_info.split(', ')[1]
				# this is to solve the issue of when there are multiple chairs. For example,
				# year 2018, session 'Research Escalator - Part 1'
				else:
					chair_name = ''
					chair_aff = ''
					for chair_e in chair_es:
						chair_info = chair_e.text
						chair_name_i = chair_info.split(', ')[0]
						chair_aff_i = chair_info.split(', ')[1]
						chair_name += chair_name_i
						chair_aff += chair_aff_i
						if chair_e != chair_es[-1]:
							chair_name += '; '
							chair_aff += '; '
		except:
			chair_name = None
			chair_aff = None

	session_tuples.append((
		year,
		'Paper Session',
		session_title,
		sub_unit,
		chair_name,
		chair_aff,
	))
	# return session title and sub_unit so that I can use them later
	return session_title, sub_unit

def get_author_num():
	"""This is to get authors element and author number, 
		which I use later in get paper info and author info
	"""
	authors_e = driver.find_elements(
		By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
	)[2:]
	author_num = len(authors_e)
	return authors_e, author_num

def get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year):
	'''get author info and update author tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	for author in authors_e:
		author_position = authors_e.index(author) + 1
		# split on the first ', ' only to solve the issue of 'person, aff, dept'
		try:
			author_name, author_aff = author.text.split(', ', 1)
		# For example:
		# 2016, Gaining Access to Social Capital, Louis Leung has no aff
		except:
			author_name = author.text
			author_aff = None
		author_tuples.append((
			paper_id,
			paper_title,
			year,
			author_num,
			author_position,
			author_name,
			author_aff
		))

def get_paper_info(paper_tuples, author_num, session_title, sub_unit, year, paper_id):
	'''get paper info and update paper tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	paper_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	paper_title = paper_title_e.text
	abstract = driver.find_element(
		By.CSS_SELECTOR, 'blockquote > p'
	).text 
	# abstract = " ".join(abstract.splitlines()).strip()
	paper_tuples.append((
		paper_id,
		year,
		'Paper Session',
		paper_title, 
		author_num, 
		abstract, 
		session_title, 
		sub_unit,
	))
	# return paper title so I can use it in get_author_info
	return paper_title

def get_papers():

	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	if h4s[-1].get_attribute('innerHTML') == 'Individual Presentations':
		# I do not know why but the first two selections are not paper elements. I need to remove them. 
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
		)[2:]
	# this is to prevent something like the session of
	# Good Grief! Disasters, Crises, and High-Risk Organizational Environments
		return papers
	elif h4s[-1].get_attribute('innerHTML') in ['Respondent', 'Respondents']:
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:nth-last-child(3)  a.ui-icon-carat-r'
		)[2:]

		return papers
	else:
		'''Why this happen? You can go to year 2016, session 262 and you'll know that there
		   are no papers. 

		   session 103 of year 2014 also have no papers
		'''
		# print('Something went wrong!')
		print('TEHRE PROBABLY ARE NO PAPERS HERE')
		# to_scrape_later_tuples.append((year, session_index))

if __name__ == '__main__':
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2014-2018
	# years = range(14,19)
	years = [14, 15, 16, 17, 18]
	# there are always excepts, for example, 2016 session 262

	session_tuples = []
	author_tuples = []
	paper_tuples = []
	for year in years:
		year = str(year)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2014/2018
		year = f'20{year}'
		print(f'{year} has started!')
		click_browse_by_session_type()
		click_paper_session()
		sessions = get_sessions()
		print(f'There are {len(sessions)} sessions.')

		# randomly choose 10 sessions for testing
		random_sessions = random.sample(sessions, 5)

		# to assign paper id. initiate it as 0 and then add 1 each time
		paper_id = 0
		for s in sessions:
		# for s in random_sessions:
			session_index = sessions.index(s)
			s_link = s.get_attribute('href')
			# open a new window
			driver.execute_script("window.open('');")
			# switch to the new window
			driver.switch_to.window(driver.window_handles[1])
			# open the session
			driver.get(s_link)
			session_title, sub_unit = update_session_meta(year, session_tuples)
			if 'preconference:' not in session_title.lower():
				print(f'Session {session_index} has started')
				papers = get_papers()
				# Sometimes paper is none, for example, year 2016, session
				# Communication and Technology, Game Studies, and Information Systems Joint Reception
				if papers:
					print(f'There are {len(papers)} papers.')
					for p in papers:
						# 2016, SESSION 85 HAS TROUBLES
						try:
							p_link = p.get_attribute('href')
							driver.execute_script("window.open('');")
							driver.switch_to.window(driver.window_handles[2])
							driver.get(p_link)
							authors_e, author_num = get_author_num()
							paper_title = get_paper_info(
								paper_tuples, 
								author_num, 
								session_title, 
								sub_unit, 
								year, 
								paper_id
							)
							get_author_info(
								authors_e, author_num, author_tuples, paper_title, paper_id, year)
						except:
							print('This paper is unavailable.')
						paper_id += 1

						print(f'Paper {papers.index(p) + 1} is done.')
						time.sleep(0.5+random.uniform(0, 0.5)) 
						# close windown 2
						driver.close()
						# switch to window 1
						driver.switch_to.window(driver.window_handles[1])

				print(f'Session {session_index} is done.')
				time.sleep(0.5+random.uniform(0, 0.5)) 
			else:
				print(f'Session {session_index} is Preconference.')
			# close window 1
			driver.close()
			# switch to windown 0
			driver.switch_to.window(driver.window_handles[0])

	print('Everything done!')
	driver.close()
	driver.quit()

	pd.DataFrame(session_tuples, columns = [
		'year',
		'session type',
		'session title',
		'sub unit',
		'chair name',
		'chair aff',
		]).to_csv(SESSION_2014_2018, index = False)
	pd.DataFrame(author_tuples, columns = [
		'paper id',
		'paper title',
		'year',
		'author number',
		'author position',
		'author name',
		'author aff'
		]).to_csv(AUTHOR_2014_2018, index = False)
	pd.DataFrame(paper_tuples, columns = [
		'paper id',
		'year',
		'paper type',
		'paper title',
		'author number',
		'abstract',
		'session title',
		'sub unit'
		]).to_csv(PAPER_2014_2018, index = False)
61
shell: "python scripts/scrape_2003_2004.py {output}"
67
shell: "python scripts/scrape_2005_2013.py {output}"
74
shell: "python scripts/scrape_2014_onward_paper.py {output}"
81
shell: "python scripts/scrape_2014_onward_interactive_paper.py {output}"
99
shell: "python scripts/combine_all_data.py {input} {output}"
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/hongtaoh/ica_conf
Name: ica_conf
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...