ICA Conference Data Explorer

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Why this project?

It surprised me that we had over 20 years of data of ICA annual conferences and yet no one has organized it in a way that every researcher has easy access to it. It is a painful effort to scrape all the data manually and I do not expect every scholar to do that.

But why ICA conference data? Why do we need it? What is it good for? I have the following ideas:

To inspire new research ideas. Right now, most communication literature came from journal papers (searched mostly in Google Scholar). Findings from conferences may provide a new perspective and inspire new directions.
Publications might have biases (For example, https://doi.org/10.1093/hcr/hqz015). Not all research projects end up being published. To circumbent publication bias, it is important to see the topics that are researched but not published (This idea is inspired by Yiwei Xu from Cornell). ICA annual conferences are a good starting point for communication science. Note that ICA annual conferences are peer reviewed and selective. Therefore, even though they are not publisehd, their quality is still quaranteed. This is different from non peer-review preprints.
For larger scientometric analysis. The ICA annual conference data that we collected is larger. It contains over 30K papers and 70K authors (a rough guess). This dataset is useful for large scale scientometric analysis. For example, to study the topic evolution of communication studies in the past 20 years or to study academic collaboration or mobility within the filed of communication
Contribute to open science. We aim to make our dataset public so that other researchers have equal access to these data (from Yiwei)
With these data, we can understand the diversity of communication scholars & research topics better. Right now, we only have acess to journal data, but that is only part of communication scholars and communication research. To get a broader picture and a deeper understanding, we need data about the conference as well.

Data sources

Plans

I am thinking of (1) design an interactive paper explorationn system, (2) clean and make public the the dataset, and (3) write a paper based on preliminary results. I do not plan to do comprehensive analyses based on the data; that is the job for other scholars if they want to use our dataset.

Introduction to this Repository

This repository now has three folders:

Data: where all data is stored.
Notebooks: this is for exploratory coding. It is mainly useful for me, but maynot be useful for others.
Workflow: this is where all the codes are stored, mostly scrapers and data processing scripts.

Data

You do not need to pay any attention to folders of deprecated . Right now, all preliminary data is stored in the folder of interim .

The folder of processed contains data that are ready to analyze and visualize. There are three files now:

paper_df.csv : This is paper data.
author_df.csv : author data
session_df.csv : session data

Paper Data

Paper data has the following columns:

Paper ID : I assigned an ID to each paper, in the format of year-index
Title : the title of this conference paper
Paper Type : the type of this presentation, either Paper or Poster . Note that the ICA website did not distinguish these two types until 2014. Therefore, all presentations prior to 2014 are classified as Paper , even though they might have been Poster instead.
Abstract : paper abstract
Number of Authors : number of authors in this paper
Year : the year when this paper was presented
Session : the specific session title
Division/Unit : the division (unit) this session is organized by.

Author Data

Author data has the following columns:

Paper ID : I assigned an ID to each paper, in the format of year-index
Paper Title : the title of this conference paper
Year : the year when this paper was presented
Number of Authors : number of authors in this paper
Author Position : the position of this author
Author Name : author name
Author Affiliation : author affiliation

Session Data

Session data has the following columns:

year : the year when this session occurred
session type : either paper session or interactive paper session (i.e., poster session)
session title : the title of this session
sub unit : the division/unit this session is organized by
chair name : the name of this session chair
chair aff : the affiliation of this session chair

Workflow

I used snakemake to manage the workflow. Details are in Snakefile .

Scripts

scrape_2003_2004.py scraped all data from 2003 to 2004
scrape_2005_2013.py scraped all data from 2005 to 2013
scrape_2014_onward_paper_session.py scraped data from 2014 to 2018, for the paper sessions.
scrape_2014_onward_interactive_paper.py scraped data for posters (extended abstracts) from 2014 to 2018.
combine_all_data.py cleaned, organized, and concatenated data.

Code Snippets

import pandas as pd
import numpy as np
import sys

PAPER_2003_2004 = sys.argv[1]
PAPER_2005_2013 = sys.argv[2]
PAPER_2014_2018 = sys.argv[3]
INTERACTIVE_PAPER_2014_2018 = sys.argv[4]
AUTHOR_2003_2004 = sys.argv[5]
AUTHOR_2005_2013 = sys.argv[6]
AUTHOR_2014_2018 = sys.argv[7]
INTERACTIVE_AUTHOR_2014_2018 = sys.argv[8]
SESSION_2014_2018 = sys.argv[9]
INTERACTIVE_SESSION_2014_2018 = sys.argv[10]
PAPER_DF = sys.argv[11]
AUTHOR_DF = sys.argv[12]
SESSION_DF = sys.argv[13]

if __name__ == '__main__':
	# import all data 
	paper1 = pd.read_csv(PAPER_2003_2004)
	paper2 = pd.read_csv(PAPER_2005_2013)
	paper3 = pd.read_csv(PAPER_2014_2018)
	paper4 = pd.read_csv(INTERACTIVE_PAPER_2014_2018)
	author1 = pd.read_csv(AUTHOR_2003_2004)
	author2 = pd.read_csv(AUTHOR_2005_2013)
	author3 = pd.read_csv(AUTHOR_2014_2018)
	author4 = pd.read_csv(INTERACTIVE_AUTHOR_2014_2018)
	session1 = pd.read_csv(SESSION_2014_2018)
	session2 = pd.read_csv(INTERACTIVE_SESSION_2014_2018)

	# add 'Year' to paper 1, paper2, author1, and author2
	paper2['Year'] = [i.split('-')[0] for i in paper2['Paper ID']]
	paper1['Year'] = [i.split('-')[0] for i in paper1['Paper ID']]
	author1['Year'] = [i.split('-')[0] for i in author1['Paper ID']]
	author2['Year'] = [i.split('-')[0] for i in author2['Paper ID']]

	# change author3 and author4 colname
	author3.columns = [
		'Paper ID', 'Paper Title', 'Year', 
		'Number of Authors', 'Author Position', 
		'Author Name', 'Author Affiliation'
	]
	author4.columns = [
		'Paper ID', 'Paper Title', 'Year', 
		'Number of Authors', 'Author Position', 
		'Author Name', 'Author Affiliation'
	]

	# author_df 
	author_df = pd.concat([author1, author2, author3, author4], axis = 0)

	print(f'Author DF is done. Its shape: {author_df.shape}')

	# create a paper id: author num dict 
	id_num_author_dict = dict(zip(author_df['Paper ID'], author_df['Number of Authors']))

	# there are four missing paper ids in author2
	paper2_id = paper2['Paper ID'].tolist()
	author2_id = list(set(author2['Paper ID']))
	print(f'Number of paper ids in paper2: {len(paper2_id)}')
	print(f'Number of paper ids in author2: {len(author2_id)}')
	missing_paper_id = [x for x in paper2_id if x not in author2_id]
	print(missing_paper_id)

	# update dict
	for x in missing_paper_id:
		id_num_author_dict[x] = np.nan

	# add number of authors to paper1 and paper2
	paper1['Number of Authors'] = [id_num_author_dict[pid] for pid in paper1['Paper ID']]
	paper2['Number of Authors'] = [id_num_author_dict[pid] for pid in paper2['Paper ID']]

	# select cols
	paper1 = paper1[['Paper ID', 'Title', 'Type', 'Abstract', 'Number of Authors', 'Year']]
	paper2 = paper2[[
		'Paper ID', 'Title', 'Sumission Type', 
		'Abstract', 'Number of Authors', 'Year', 'Session', 'Division/Unit'
	]]

	# update colnmaes
	paper1.columns = ['Paper ID', 'Title', 'Paper Type', 
		'Abstract', 'Number of Authors', 'Year']
	paper2.columns = ['Paper ID', 'Title', 'Paper Type', 
		'Abstract', 'Number of Authors', 'Year', 'Session', 'Division/Unit']
	paper3.columns = ['Paper ID', 'Year', 'Paper Type', 'Title', 
		'Number of Authors', 'Abstract', 'Session', 'Division/Unit']
	paper4.columns = ['Paper ID', 'Year', 'Paper Type', 'Title', 
		'Number of Authors', 'Abstract', 'Session', 'Division/Unit']

	# concatenate paper df
	paper_df = pd.concat([paper1, paper2, paper3, paper4], axis = 0)

	# concatenate session df 
	session_df = pd.concat([session1, session2], axis = 0)

	# write to file
	paper_df.to_csv(PAPER_DF, index = False)
	author_df.to_csv(AUTHOR_DF, index = False)
	session_df.to_csv(SESSION_DF, index = False)

	print('Files written. All should be in place now.')

Python Pandas numpy From line 5 of scripts/combine_all_data.py

import pandas as pd
import numpy as np
import time 
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys

PAPER_03_04 = sys.argv[1]
AUTHOR_03_04 = sys.argv[2]

def click_on_search_papers():
	search_papers = wait.until(EC.element_to_be_clickable((
		By.CSS_SELECTOR, 
		"div.menu_item__icon_text_window__text > a.mainmenu_text"
	)))
	search_papers.click()

def get_papers():
	"""
	get all paper elements in the current page
	"""
	papers = driver.find_elements(
		By.CSS_SELECTOR, 'tr.worksheet_window__row__light, tr.worksheet_window__row__dark'
	)
	return papers

def get_paper_meta(paper, year, paper_meta_dict_list):
	"""
	get paper index, paper title, and paper_type
		the author names can be found here but I'll collect later in the view page
	"""
	idx = paper.find_element(
		By.CSS_SELECTOR, 'td[title="##"]').text
	## in the format of '0001'
	paper_id = year + '-' + idx.zfill(4)
	# summary elements:
	summary = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Summary"]'
	)
	title = summary.find_element(
		By.CSS_SELECTOR, 'a.search_headingtext'
	).text
	submission_type = summary.find_element(
		By.CSS_SELECTOR, 'td[style="padding: 5px;"]'
	).text.lstrip('  Individual Submission type: ')
	paper_meta_dict = {
		'Paper ID': paper_id,
		'Title': title,
		'Type': submission_type
	}
	# update the dict list
	paper_meta_dict_list.append(paper_meta_dict)
	return paper_meta_dict

def open_view(paper):
	"""
	Input:
		paper element
	Aim:
		open a new window and click 'view'
	"""
	action = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Action"]'
	)
	view_link_e = action.find_element(
				By.CSS_SELECTOR, "li.action_list > a.fieldtext"
			)
	view_link = view_link_e.get_attribute('href')
	driver.execute_script("window.open('');")
	driver.switch_to.window(driver.window_handles[1])
	driver.get(view_link)

def get_title_to_check(paper_meta_dict_list):
	# there are two 'tr.header font.headingtext'
	# title is the second one
	headingtexts = driver.find_elements(
		By.CSS_SELECTOR, 'tr.header font.headingtext'
	)
	title_to_check = headingtexts[1].text
	# update the most recent paper_meta_dict_list
	paper_meta_dict_list[-1]['Title to Check'] = title_to_check
	return title_to_check

def get_authors(paper_meta_dict, author_dict_list):
	paper_id, title = paper_meta_dict['Paper ID'], paper_meta_dict['Title']
	# note that authors_e will return a list since there might be multiple authors
	authors = driver.find_elements(
		By.CSS_SELECTOR, 'a.search_fieldtext_name'
	)
	for author in authors:
		author_idx = authors.index(author) + 1
		authorNum = len(authors)
		author_elements = author.text.split(' (')
		author_name = author_elements[0]
		# doc: https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.rstrip
		# some don't contain '()', i.e., affiliation info
		try:
			author_aff = author_elements[1].rstrip(')')
		except:
			author_aff = np.nan
		author_dict = {
			'Paper ID': paper_id,
			'Paper Title': title,
			'Number of Authors': authorNum,
			'Author Position': author_idx,
			'Author Name': author_name,
			'Author Affiliation': author_aff,
		}
		author_dict_list.append(author_dict)

def get_abstract(paper_meta_dict_list):
	# obtain abstract in the newly opened page
	abstract = driver.find_element(
		By.CSS_SELECTOR, 'blockquote.tight > font.fieldtext'
	).text
	paper_meta_dict_list[-1]['Abstract'] = abstract
	return abstract

def scrape_one_page(year, page_num, paper_meta_dict_list, author_dict_list):
	papers = get_papers()
	for paper in papers:
	## to test:
	# for paper in papers[0:1]:
		paper_idx = papers.index(paper) + 1
		paper_meta_dict = get_paper_meta(paper, year, paper_meta_dict_list)
		open_view(paper)
		get_title_to_check(paper_meta_dict_list)
		get_authors(paper_meta_dict, author_dict_list)
		get_abstract(paper_meta_dict_list)
		driver.close()
		driver.switch_to.window(driver.window_handles[0])
		print(f'Page {page_num} Paper {paper_idx} is done')
		time.sleep(0.5)

def get_iterators():
	iterators = driver.find_elements(
		By.XPATH, '//div[@class="iterator"][1]/form//a[@class="fieldtext"]'
	)
	return iterators

if __name__ == '__main__':
	# initiate list to contain data
	paper_meta_dict_list = []
	author_dict_list = []
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2003~2004
	years = range(3,5)
	for year in years:
		year = str(year).zfill(2)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2003/2004
		year = f'20{year}'
		print(f'{year} has started!')
		click_on_search_papers()
		# to calculate total pages
		iterators = get_iterators()
		total_pages = int(iterators[-2].text)
		for i in range(1,total_pages+1):
			page_num = i
			if i >= 10:
				if year == '2004':
					print('2004!')
					select = Select(driver.find_element(
						By.XPATH, '//div[@class="iterator"][1] // select'
					))
					select.select_by_visible_text('+ 20')
				else:
					# if '2003', click on '20'
					iterators = get_iterators()
					iterators[-2].click()
			iterators = get_iterators()
			for j in iterators:
				if (j.text == str(i)):
					j.click()
					break 
			scrape_one_page(
				year,
				page_num, 
				paper_meta_dict_list, 
				author_dict_list
			)
			print(f'page {i} is done')
			# go back to the first page
			iterators = get_iterators()
			iterators[1].click()
	print('Everything done!')
	driver.close()
	driver.quit()
	print('Writing to file now...')
	pd.DataFrame(paper_meta_dict_list).to_csv(PAPER_03_04, index = False)
	pd.DataFrame(author_dict_list).to_csv(AUTHOR_03_04, index = False)

Python Pandas numpy selenium From line 5 of scripts/scrape_2003_2004.py

import pandas as pd
import numpy as np
import time 
import math
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys

PAPER_2005_2013 = sys.argv[1]
AUTHOR_2005_2013 = sys.argv[2]

def click_on_view_program():
	all_btn = driver.find_elements(
		By.CSS_SELECTOR, 
		"div.menu_item__icon_text_window__text > a.mainmenu_text"
	)
	for btn in all_btn:
		if 'Program' in btn.text:
			view_program_btn = btn 
			break
	view_program_btn.click()

def click_on_individual_presentations():
	'''
	To click on 'individual presentations'
	'''
	presentations = wait.until(EC.element_to_be_clickable((
		By.XPATH,
		'//td[@class="tab_topped_window__tab_cell"][2]'
	)))
	presentations.click()

def get_papers():
	"""
	get all paper elements in the current page
	"""
	papers = driver.find_elements(
		By.CSS_SELECTOR, 'tr.worksheet_window__row__light, tr.worksheet_window__row__dark'
	)
	return papers

def removeprefix(text, prefix):
	# https://stackoverflow.com/a/16891418
	if text.startswith(prefix):
		return text[len(prefix):]
	return text 

def get_paper_meta(paper, year, paper_meta_dict_list):
	"""
	get paper index, paper title, and paper_type
		the author names can be found here but I'll collect later in the view page
	"""
	idx = paper.find_element(
		By.CSS_SELECTOR, 'td[title="##"]').text
	## in the format of '0001'
	paper_id = year + '-' + idx.zfill(4)
	# summary elements:
	summary = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Summary"]'
	)
	title = summary.find_element(
		By.CSS_SELECTOR, 'a.search_headingtext'
	).text
	summary_info = summary.find_elements(
		By.CSS_SELECTOR, 'td[style="padding: 5px;"] tr'
	)
	session = np.nan
	division = np.nan
	submission_type = np.nan
	research_areas = np.nan
	for i in summary_info:
		if 'In Session Submission' in i.text:
			session = removeprefix(i.text, '  In Session Submission: ')
		elif 'Session Submission Division' in i.text:
			division = removeprefix(i.text, '  Session Submission Division: ')
		elif 'Session Submission Unit' in i.text:
			division = removeprefix(i.text, '  Session Submission Unit: ')
		elif 'Submission type' in i.text:
			submission_type = removeprefix(i.text, '  Individual Submission type: ')
		elif 'Research Areas:' in i.text:
			research_areas = removeprefix(i.text, '  Research Areas: ')
	paper_meta_dict = {
		'Paper ID': paper_id,
		'Title': title,
		'Session': session,
		'Division/Unit': division,
		'Sumission Type': submission_type,
		'Research Areas': research_areas,
	}
	# update the dict list
	paper_meta_dict_list.append(paper_meta_dict)
	return paper_meta_dict

def open_view(paper):
	"""
	Input:
		paper element
	Aim:
		open a new window and click 'view'
	"""
	action = paper.find_element(
		By.CSS_SELECTOR, 'td[title="Action"]'
	)
	view_link_e = action.find_element(
				By.CSS_SELECTOR, "li.action_list > a.fieldtext"
			)
	view_link = view_link_e.get_attribute('href')
	driver.execute_script("window.open('');")
	driver.switch_to.window(driver.window_handles[1])
	driver.get(view_link)

def get_title_to_check(paper_meta_dict_list):
	# there are two 'tr.header font.headingtext'
	# title is the second one
	headingtexts = driver.find_elements(
		By.CSS_SELECTOR, 'tr.header font.headingtext'
	)
	title_to_check = headingtexts[1].text
	# update the most recent paper_meta_dict_list
	paper_meta_dict_list[-1]['Title to Check'] = title_to_check
	return title_to_check

def get_session_to_check(paper_meta_dict_list):
	session_to_check = driver.find_element(
		By.CSS_SELECTOR, 'blockquote.tight > a.search_headingtext'
	)
	session_to_check = session_to_check.text
	# update the most recent paper_meta_dict_list
	paper_meta_dict_list[-1]['Session to Check'] = session_to_check
	return session_to_check

def get_authors(paper_meta_dict, author_dict_list):
	paper_id, title = paper_meta_dict['Paper ID'], paper_meta_dict['Title']
	# note that authors_e will return a list since there might be multiple authors
	authors = driver.find_elements(
		By.CSS_SELECTOR, 'a.search_fieldtext_name'
	)
	for author in authors:
		author_idx = authors.index(author) + 1
		authorNum = len(authors)
		author_elements = author.text.split(' (')
		author_name = author_elements[0]
		# doc: https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.rstrip
		# some don't contain '()', i.e., affiliation info
		try:
			author_aff = author_elements[1].rstrip(')')
		except:
			author_aff = np.nan
		author_dict = {
			'Paper ID': paper_id,
			'Paper Title': title,
			'Number of Authors': authorNum,
			'Author Position': author_idx,
			'Author Name': author_name,
			'Author Affiliation': author_aff,
		}
		author_dict_list.append(author_dict)

def get_abstract(paper_meta_dict_list):
	# abstract
	abstract = driver.find_elements(
		By.CSS_SELECTOR, 'blockquote.tight'
	)[-1]
	abstract = abstract.text
	abstract = " ".join(abstract.splitlines()).strip()

	paper_meta_dict_list[-1]['Abstract'] = abstract

	return abstract

def scrape_one_page(year, page_num, paper_meta_dict_list, author_dict_list):
	papers = get_papers()
	for paper in papers:
	## to test:
	# for paper in papers[0:1]:
		paper_idx = papers.index(paper) + 1
		paper_meta_dict = get_paper_meta(paper, year, paper_meta_dict_list)
		open_view(paper)
		get_title_to_check(paper_meta_dict_list)
		get_session_to_check(paper_meta_dict_list)
		get_authors(paper_meta_dict, author_dict_list)
		get_abstract(paper_meta_dict_list)
		driver.close()
		driver.switch_to.window(driver.window_handles[0])
		print(f'Year {year}, Page {page_num} Paper {paper_idx} is done')
		time.sleep(0.05)

def get_iterators():
	iterators = driver.find_elements(
		By.XPATH, '//div[@class="iterator"][1]/form//a[@class="fieldtext"]'
	)
	return iterators

if __name__ == '__main__':
	# initiate list to contain data
	paper_meta_dict_list = []
	author_dict_list = []
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2005~2013
	years = range(5,14)
	for year in years:
		year = str(year).zfill(2)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2003/2004
		year = f'20{year}'
		print(f'{year} has started!')
		click_on_view_program()
		click_on_individual_presentations()
		# to calculate total pages
		iterators = get_iterators()
		total_pages = int(iterators[-2].text)
		for i in range(1,total_pages+1):
			print(f'page {i} has started')
			page_num = i
			if i < 10:
				pass
			elif i >= 10 and i < 17:
				select = Select(driver.find_element(
					By.XPATH, '//div[@class="iterator"][1] // select'
				))
				select.select_by_visible_text('+ 10')
			elif i >= 17 and i < 27:
				select = Select(driver.find_element(
					By.XPATH, '//div[@class="iterator"][1] // select'
				))
				select.select_by_visible_text('+ 20')
			elif i >= 27 and i < 37:
				select = Select(driver.find_element(
					By.XPATH, '//div[@class="iterator"][1] // select'
				))
				select.select_by_visible_text('+ 30')
			else:
				iterators = get_iterators()
				iterators[-2].click()
			# this achieves something I never thought about. 
			# when i == 21, after selecting '+ 20', the current iterator is 21
			# then, the get_iterators() function will skip the current iterator
			# since no j is equal to 21, the program won't even go to the for loop
			# but will directly start `scrape_one_page()`
			iterators = get_iterators()
			for j in iterators:
				if (j.text == str(i)):
					current_idx = int(j.text)
					j.click()
					break 
			scrape_one_page(
				year,
				page_num, 
				paper_meta_dict_list, 
				author_dict_list
			)
			iterators = get_iterators()
			iterators[1].click()
	print('Everything done!')
	driver.close()
	driver.quit()
	print('Writing to file now...')
	pd.DataFrame(paper_meta_dict_list).to_csv(PAPER_2005_2013, index = False)
	pd.DataFrame(author_dict_list).to_csv(AUTHOR_2005_2013, index = False)

Python Pandas numpy selenium From line 5 of scripts/scrape_2005_2013.py

import pandas as pd
import numpy as np
import time 
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys
import random 

INTERACTIVVE_SESSION_2014_2018 = sys.argv[1]
INTERACTIVVE_AUTHOR_2014_2018 = sys.argv[2]
INTERACTIVVE_PAPER_2014_2018 = sys.argv[3]

def click_browse_by_session_type():
	'''click on "browse by session type"
	'''
	browse_by_session_type = driver.find_elements(
		By.CSS_SELECTOR, "li.ui-li-has-icon.ui-last-child > a"
	)[3]
	browse_by_session_type.click()

def click_interactive_paper_session():
	'''click "paper session" button
	'''
	paper_session = driver.find_element(
		By.XPATH, '//li[@class="ui-li-has-count ui-first-child"] //a[@class="ui-btn"]'
	)
	paper_session.click()

def get_sessions():
	'''These are session links
	'''
	sessions = driver.find_elements(
		By.CSS_SELECTOR, 'a.ul-li-has-alt-left.ui-btn'
	)
	return sessions

def update_session_meta(year, session_tuples):
	'''update session metadata: session title, session sub unit, 
		session chair name and affiliation
	'''
	session_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	session_title = session_title_e.text

	# sub unit, cosponsor, chair, the presentations
	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	h4s_texts = [i.text for i in h4s]
	sub_unit_e_idx = h4s_texts.index('Sub Unit')
	'''sub unit and chair are very tricky
	Some examples: year 2015, session "Environmental Journalism: Coverage, Reader Response, and Mediators"
	  in the above example, 'chair' is below 'cosponsor'
	Another example, year 2015, session 'B.E.S.T.: Organizations, Communication, and Technology'
	  This example is a little bit strange because we have 'abstract' here. However, it does not have the gray area
	My conclusion is that it seems that the gray box for sub unit is always the first one so
	I can use the index of '4'. For chair, I need to get its index and add it by 5
	'''
	try:
		sub_unit_e = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
		)[4]
		sub_unit = sub_unit_e.text
	except:
		sub_unit = None
	# if there is no 'Chair', for example, session 200 of 2016,
	# then there is no need to proceed further. 
	if 'Chair' not in h4s_texts:
		chair_name = None
		chair_aff = None
	else:
		try:
			if 'Cosponsor' in h4s_texts:
				chair_e_idx = 6
			else:
				chair_e_idx = 5
			# chair_e_idx = h4s_texts.index('Chair')
			chair_graybox = driver.find_elements(
				By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
			)[chair_e_idx]
			chair_es = chair_graybox.find_elements(
				By.CSS_SELECTOR, 'li'
			)
			if chair_es:
				if len(chair_es) == 1:
					chair_info = chair_es[0].text
					chair_name = chair_info.split(', ')[0]
					chair_aff = chair_info.split(', ')[1]
				# this is to solve the issue of when there are multiple chairs. For example,
				# year 2018, session 'Research Escalator - Part 1'
				else:
					chair_name = ''
					chair_aff = ''
					for chair_e in chair_es:
						chair_info = chair_e.text
						chair_name_i = chair_info.split(', ')[0]
						chair_aff_i = chair_info.split(', ')[1]
						chair_name += chair_name_i
						chair_aff += chair_aff_i
						if chair_e != chair_es[-1]:
							chair_name += '; '
							chair_aff += '; '
		except:
			chair_name = None
			chair_aff = None

	session_tuples.append((
		year,
		'Interactive Paper Session',
		session_title,
		sub_unit,
		chair_name,
		chair_aff,
	))
	# return session title and sub_unit so that I can use them later
	return session_title, sub_unit

def get_author_num():
	"""This is to get authors element and author number, 
		which I use later in get paper info and author info
	"""
	authors_e = driver.find_elements(
		By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
	)[2:]
	author_num = len(authors_e)
	return authors_e, author_num

def get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year):
	'''get author info and update author tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	for author in authors_e:
		author_position = authors_e.index(author) + 1
		# split on the first ', ' only to solve the issue of 'person, aff, dept'
		try:
			author_name, author_aff = author.text.split(', ', 1)
		# For example:
		# 2016, Gaining Access to Social Capital, Louis Leung has no aff
		except:
			author_name = author.text
			author_aff = None
		author_tuples.append((
			paper_id,
			paper_title,
			year,
			author_num,
			author_position,
			author_name,
			author_aff
		))

def get_paper_info(paper_tuples, author_num, session_title, sub_unit, year, paper_id):
	'''get paper info and update paper tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	paper_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	paper_title = paper_title_e.text
	abstract = driver.find_element(
		By.CSS_SELECTOR, 'blockquote > p'
	).text 
	paper_tuples.append((
		paper_id,
		year,
		'Poster',
		paper_title, 
		author_num, 
		abstract, 
		session_title, 
		sub_unit,
	))
	# return paper title so I can use it in get_author_info
	return paper_title

def get_papers():

	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	if h4s[-1].get_attribute('innerHTML') == 'Individual Presentations':
		# I do not know why but the first two selections are not paper elements. I need to remove them. 
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
		)[2:]
	# this is to prevent something like the session of
	# Good Grief! Disasters, Crises, and High-Risk Organizational Environments
		return papers
	elif h4s[-1].get_attribute('innerHTML') in ['Respondent', 'Respondents']:
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:nth-last-child(3)  a.ui-icon-carat-r'
		)[2:]

		return papers
	else:
		'''Why this happen? You can go to year 2016, session 262 and you'll know that there
		   are no papers. 

		   session 103 of year 2014 also have no papers
		'''
		# print('Something went wrong!')
		print('TEHRE PROBABLY ARE NO PAPERS HERE')
		# to_scrape_later_tuples.append((year, session_index))

if __name__ == '__main__':
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2014-2018
	# years = range(14,19)
	years = [14, 15, 16, 17, 18]
	session_tuples = []
	author_tuples = []
	paper_tuples = []
	for year in years:
		year = str(year)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2014/2018
		year = f'20{year}'
		print(f'{year} has started!')
		click_browse_by_session_type()
		click_interactive_paper_session()
		sessions = get_sessions()
		print(f'There are {len(sessions)} sessions.')

		# randomly choose 10 sessions for testing
		random_sessions = random.sample(sessions, 5)
		# to assign paper id. initiate it as 0 and then add 1 each time
		paper_id = 0
		for s in sessions:
		# for s in random_sessions:
			session_index = sessions.index(s)
			s_link = s.get_attribute('href')
			# open a new window
			driver.execute_script("window.open('');")
			# switch to the new window
			driver.switch_to.window(driver.window_handles[1])
			# open the session
			driver.get(s_link)
			session_title, sub_unit = update_session_meta(year, session_tuples)
			if 'preconference:' not in session_title.lower():
				print(f'Session {session_index} has started')
				papers = get_papers()
				# Sometimes paper is none, for example, year 2016, session
				# Communication and Technology, Game Studies, and Information Systems Joint Reception
				if papers:
					print(f'There are {len(papers)} papers.')
					for p in papers:
						# 2016, SESSION 85 HAS TROUBLES
						try:
							p_link = p.get_attribute('href')
							driver.execute_script("window.open('');")
							driver.switch_to.window(driver.window_handles[2])
							driver.get(p_link)
							authors_e, author_num = get_author_num()
							paper_title = get_paper_info(
								paper_tuples, 
								author_num, 
								session_title, 
								sub_unit, 
								year, 
								paper_id
							)
							get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year)
						except:
							print('This paper is unavailable.')
						paper_id += 1

						print(f'Paper {papers.index(p) + 1} is done.')
						time.sleep(0.5+random.uniform(0, 0.5)) 
						# close windown 2
						driver.close()
						# switch to window 1
						driver.switch_to.window(driver.window_handles[1])

				print(f'Session {session_index} is done.')
				time.sleep(0.5+random.uniform(0, 0.5)) 
			else:
				print(f'Session {session_index} is Preconference.')
			# close window 1
			driver.close()
			# switch to windown 0
			driver.switch_to.window(driver.window_handles[0])

	print('Everything done!')
	driver.close()
	driver.quit()

	pd.DataFrame(session_tuples, columns = [
		'year',
		'session type',
		'session title',
		'sub unit',
		'chair name',
		'chair aff',
		]).to_csv(INTERACTIVVE_SESSION_2014_2018, index = False)
	pd.DataFrame(author_tuples, columns = [
		'paper id',
		'paper title',
		'year',
		'author number',
		'author position',
		'author name',
		'author aff'
		]).to_csv(INTERACTIVVE_AUTHOR_2014_2018, index = False)
	pd.DataFrame(paper_tuples, columns = [
		'paper id',
		'year',
		'paper type',
		'paper title',
		'author number',
		'abstract',
		'session title',
		'sub unit'
		]).to_csv(INTERACTIVVE_PAPER_2014_2018, index = False)

Python Pandas numpy selenium From line 7 of scripts/scrape_2014_onward_interactive_paper.py

import pandas as pd
import numpy as np
import time 
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select
import sys
import random 

SESSION_2014_2018 = sys.argv[1]
AUTHOR_2014_2018 = sys.argv[2]
PAPER_2014_2018 = sys.argv[3]

def click_browse_by_session_type():
	'''click on "browse by session type"
	'''
	browse_by_session_type = driver.find_elements(
		By.CSS_SELECTOR, "li.ui-li-has-icon.ui-last-child > a"
	)[3]
	browse_by_session_type.click()

def click_paper_session():
	'''click "paper session" button
	'''
	paper_session = driver.find_element(
		By.XPATH, '//li[@class="ui-li-has-count"][3] //a[@class="ui-btn"]'
	)
	paper_session.click()

def get_sessions():
	'''These are session links
	'''
	sessions = driver.find_elements(
		By.CSS_SELECTOR, 'a.ul-li-has-alt-left.ui-btn'
	)
	return sessions

def update_session_meta(year, session_tuples):
	'''update session metadata: session title, session sub unit, 
		session chair name and affiliation
	'''
	session_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	session_title = session_title_e.text

	# sub unit, cosponsor, chair, the presentations
	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	h4s_texts = [i.text for i in h4s]
	sub_unit_e_idx = h4s_texts.index('Sub Unit')
	'''sub unit and chair are very tricky
	Some examples: year 2015, session "Environmental Journalism: Coverage, Reader Response, and Mediators"
	  in the above example, 'chair' is below 'cosponsor'
	Another example, year 2015, session 'B.E.S.T.: Organizations, Communication, and Technology'
	  This example is a little bit strange because we have 'abstract' here. However, it does not have the gray area
	My conclusion is that it seems that the gray box for sub unit is always the first one so
	I can use the index of '4'. For chair, I need to get its index and add it by 5
	'''
	try:
		sub_unit_e = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
		)[4]
		sub_unit = sub_unit_e.text
	except:
		sub_unit = None
	# if there is no 'Chair', for example, session 200 of 2016,
	# then there is no need to proceed further. 
	if 'Chair' not in h4s_texts:
		chair_name = None
		chair_aff = None
	else:
		try:
			if 'Cosponsor' in h4s_texts:
				chair_e_idx = 6
			else:
				chair_e_idx = 5
			# chair_e_idx = h4s_texts.index('Chair')
			chair_graybox = driver.find_elements(
				By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow'
			)[chair_e_idx]
			chair_es = chair_graybox.find_elements(
				By.CSS_SELECTOR, 'li'
			)
			if chair_es:
				if len(chair_es) == 1:
					chair_info = chair_es[0].text
					chair_name = chair_info.split(', ')[0]
					chair_aff = chair_info.split(', ')[1]
				# this is to solve the issue of when there are multiple chairs. For example,
				# year 2018, session 'Research Escalator - Part 1'
				else:
					chair_name = ''
					chair_aff = ''
					for chair_e in chair_es:
						chair_info = chair_e.text
						chair_name_i = chair_info.split(', ')[0]
						chair_aff_i = chair_info.split(', ')[1]
						chair_name += chair_name_i
						chair_aff += chair_aff_i
						if chair_e != chair_es[-1]:
							chair_name += '; '
							chair_aff += '; '
		except:
			chair_name = None
			chair_aff = None

	session_tuples.append((
		year,
		'Paper Session',
		session_title,
		sub_unit,
		chair_name,
		chair_aff,
	))
	# return session title and sub_unit so that I can use them later
	return session_title, sub_unit

def get_author_num():
	"""This is to get authors element and author number, 
		which I use later in get paper info and author info
	"""
	authors_e = driver.find_elements(
		By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
	)[2:]
	author_num = len(authors_e)
	return authors_e, author_num

def get_author_info(authors_e, author_num, author_tuples, paper_title, paper_id, year):
	'''get author info and update author tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	for author in authors_e:
		author_position = authors_e.index(author) + 1
		# split on the first ', ' only to solve the issue of 'person, aff, dept'
		try:
			author_name, author_aff = author.text.split(', ', 1)
		# For example:
		# 2016, Gaining Access to Social Capital, Louis Leung has no aff
		except:
			author_name = author.text
			author_aff = None
		author_tuples.append((
			paper_id,
			paper_title,
			year,
			author_num,
			author_position,
			author_name,
			author_aff
		))

def get_paper_info(paper_tuples, author_num, session_title, sub_unit, year, paper_id):
	'''get paper info and update paper tuples
	'''
	paper_id = year + '-' + str(paper_id).zfill(4)
	paper_title_e = driver.find_element(
		By.CSS_SELECTOR, 'h3'
	)
	paper_title = paper_title_e.text
	abstract = driver.find_element(
		By.CSS_SELECTOR, 'blockquote > p'
	).text 
	# abstract = " ".join(abstract.splitlines()).strip()
	paper_tuples.append((
		paper_id,
		year,
		'Paper Session',
		paper_title, 
		author_num, 
		abstract, 
		session_title, 
		sub_unit,
	))
	# return paper title so I can use it in get_author_info
	return paper_title

def get_papers():

	h4s = driver.find_elements(
		By.CSS_SELECTOR, 'h4'
	)
	if h4s[-1].get_attribute('innerHTML') == 'Individual Presentations':
		# I do not know why but the first two selections are not paper elements. I need to remove them. 
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:last-child  a.ui-icon-carat-r'
		)[2:]
	# this is to prevent something like the session of
	# Good Grief! Disasters, Crises, and High-Risk Organizational Environments
		return papers
	elif h4s[-1].get_attribute('innerHTML') in ['Respondent', 'Respondents']:
		papers = driver.find_elements(
			By.CSS_SELECTOR, 'ul.ui-listview.ui-listview-inset.ui-corner-all.ui-shadow:nth-last-child(3)  a.ui-icon-carat-r'
		)[2:]

		return papers
	else:
		'''Why this happen? You can go to year 2016, session 262 and you'll know that there
		   are no papers. 

		   session 103 of year 2014 also have no papers
		'''
		# print('Something went wrong!')
		print('TEHRE PROBABLY ARE NO PAPERS HERE')
		# to_scrape_later_tuples.append((year, session_index))

if __name__ == '__main__':
	driver = webdriver.Firefox()
	wait = WebDriverWait(driver, 10)
	urlBase = 'https://convention2.allacademic.com/one/ica/ica'
	# scrape 2014-2018
	# years = range(14,19)
	years = [14, 15, 16, 17, 18]
	# there are always excepts, for example, 2016 session 262

	session_tuples = []
	author_tuples = []
	paper_tuples = []
	for year in years:
		year = str(year)
		url = urlBase + year
		driver.get(url)
		# year in the form of 2014/2018
		year = f'20{year}'
		print(f'{year} has started!')
		click_browse_by_session_type()
		click_paper_session()
		sessions = get_sessions()
		print(f'There are {len(sessions)} sessions.')

		# randomly choose 10 sessions for testing
		random_sessions = random.sample(sessions, 5)

		# to assign paper id. initiate it as 0 and then add 1 each time
		paper_id = 0
		for s in sessions:
		# for s in random_sessions:
			session_index = sessions.index(s)
			s_link = s.get_attribute('href')
			# open a new window
			driver.execute_script("window.open('');")
			# switch to the new window
			driver.switch_to.window(driver.window_handles[1])
			# open the session
			driver.get(s_link)
			session_title, sub_unit = update_session_meta(year, session_tuples)
			if 'preconference:' not in session_title.lower():
				print(f'Session {session_index} has started')
				papers = get_papers()
				# Sometimes paper is none, for example, year 2016, session
				# Communication and Technology, Game Studies, and Information Systems Joint Reception
				if papers:
					print(f'There are {len(papers)} papers.')
					for p in papers:
						# 2016, SESSION 85 HAS TROUBLES
						try:
							p_link = p.get_attribute('href')
							driver.execute_script("window.open('');")
							driver.switch_to.window(driver.window_handles[2])
							driver.get(p_link)
							authors_e, author_num = get_author_num()
							paper_title = get_paper_info(
								paper_tuples, 
								author_num, 
								session_title, 
								sub_unit, 
								year, 
								paper_id
							)
							get_author_info(
								authors_e, author_num, author_tuples, paper_title, paper_id, year)
						except:
							print('This paper is unavailable.')
						paper_id += 1

						print(f'Paper {papers.index(p) + 1} is done.')
						time.sleep(0.5+random.uniform(0, 0.5)) 
						# close windown 2
						driver.close()
						# switch to window 1
						driver.switch_to.window(driver.window_handles[1])

				print(f'Session {session_index} is done.')
				time.sleep(0.5+random.uniform(0, 0.5)) 
			else:
				print(f'Session {session_index} is Preconference.')
			# close window 1
			driver.close()
			# switch to windown 0
			driver.switch_to.window(driver.window_handles[0])

	print('Everything done!')
	driver.close()
	driver.quit()

	pd.DataFrame(session_tuples, columns = [
		'year',
		'session type',
		'session title',
		'sub unit',
		'chair name',
		'chair aff',
		]).to_csv(SESSION_2014_2018, index = False)
	pd.DataFrame(author_tuples, columns = [
		'paper id',
		'paper title',
		'year',
		'author number',
		'author position',
		'author name',
		'author aff'
		]).to_csv(AUTHOR_2014_2018, index = False)
	pd.DataFrame(paper_tuples, columns = [
		'paper id',
		'year',
		'paper type',
		'paper title',
		'author number',
		'abstract',
		'session title',
		'sub unit'
		]).to_csv(PAPER_2014_2018, index = False)

Python Pandas numpy selenium From line 8 of scripts/scrape_2014_onward_paper.py

shell: "python scripts/scrape_2003_2004.py {output}"

SnakeMake From line 61 of workflow/Snakefile

shell: "python scripts/scrape_2005_2013.py {output}"

SnakeMake From line 67 of workflow/Snakefile

shell: "python scripts/scrape_2014_onward_paper.py {output}"

SnakeMake From line 74 of workflow/Snakefile

shell: "python scripts/scrape_2014_onward_interactive_paper.py {output}"

SnakeMake From line 81 of workflow/Snakefile

shell: "python scripts/combine_all_data.py {input} {output}"

SnakeMake From line 99 of workflow/Snakefile

ShowHide 8 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/hongtaoh/ica_conf

Name: ica_conf

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Pandas Snakemake numpy selenium

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free