Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Social Web: (Big) Data Mining - JSB454

Title:	Social Web: (Big) Data Mining
Czech title:	Data mining
Guaranteed by:	Department of Sociology (23-KS)
Faculty:	Faculty of Social Sciences
Actual:	from 2016 to 2017
Semester:	winter
E-Credits:	7
Examination process:	winter s.:combined
Hours per week, examination:	winter s.:1/1, Ex [HT]
Capacity:	45 / unknown (30)
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
State of the course:	taught
Language:	English
Teaching methods:	full-time
Teaching methods:	full-time
Additional information:	https://bit.ly/socialwebdatamining
Note:	course can be enrolled in outside the study plan enabled for web enrollment priority enrollment if the course is part of the study plan

Guarantor:	Mgr. Jakub Růžička
Teacher(s):	Mgr. Jakub Růžička

Examination dates Schedule Noticeboard

Annotation

Last update: Mgr. Jakub Růžička (25.09.2017)

The course gives a professional and academic introduction to web & social media data mining. Emphasis is put on the intersection of data science, social sciences and computer science. Full course syllabus: bit.ly/socialwebdatamining (aiming to get a bit closer to the first part of bit.ly/datamachine).

The course is taught concurrently with the 'Social Web: (Big) Data Mining for Masters' course for graduate students (JSM575), who will help you out. In the weeks where face to face sessions do not take place, the students can utilize open educational resources suggested for each session and attend a webinar to reinforce their skills. Both master and bachelor students can work on their research projects (final examination) in cooperation with the partner of this course, KPMG Czech Republic (home.kpmg.com/cz/en/home.html).

NOTE: Based on the number of enrolled students and the classroom occupancy limitations, please let me know that you are on the waiting list at jameslittlerose@gmail.com.

Aim of the course

Last update: Mgr. Jakub Růžička (20.09.2014)

Intended Learning Outcomes | in which way the course should make your life better and/or improve your skills.

Upon completion of the course, the students will be able to:

understand the intersection of data science, humanities & ICT within the realm of web & social media (big) data mining
ask meaningful questions, perform basic analytical operations regarding both, structured & unstructured web / social media data and draw conclusions for decision making
understand basic concepts and conduct subsequent data preprocessing, analysis & visualization related to social network analysis, web mining, social media mining & text mining
take a positive approach towards data science & computer programming, gain confidence in basic operations and use and/or modify a third party (open) source code and/or an analytical procedure/tool
describe advanced data mining methods & applications for further self education (and/or subsequent institutional education) and/or professional/academic specialization

Course completion requirements

Last update: Mgr. Jakub Růžička (19.09.2016)

Requirements, Examination & Assignments

(I.) Webinar (30%) collaborative, teams of 2-3

(II.) Project/Research (70%) collaborative, teams of 2-3

* the percentage stands for the significance of the assignment regarding the final grade

the grade is calculated on WEBINAR (30%) and PROJECT/RESEARCH defence (70%) | the course is graded A (>=85%), B (>=70%), C (>=60%), D (>=50%), or E (<50%) | A, B or C is needed to pass the course

(I.) Webinar (30%) | collaborative, teams of 2-3 students

assignment:

familiarize yourself (in brief) with an assigned data mining tool and/or application (you might also choose your own if approved by the lecturer) and introduce it
replicate an analysis (cite your source) using the tool and explain the procedure & background information
prepare a short (5-15min) live webinar for your classmates & answer their questions (questions regarding your particular analysis only)
let them do peer assessment of your work

motivation:

the volume of various data science free & open source procedures, tools & applications grows rapidly, so you definitely won‘t ‘be done‘ after passing this course
the volume of open educational resources (text, video, interactive etc.) is huge, the tools are usually well-documented & include sample analyses provided by the creators and/or by its community
you‘ ll learn most by a hands-on approach and you‘ll get feedback from your peers

(20%) brief description of the tool: what it is for | how one can use it | where one can get it & learn it

(60%) replication of an analysis: background information | clarity of the procedure

(20%) question responses: only questions related to the particular analysis count (one doesn‘t become an expert on a tool replicating one analysis =))

(II.) Project/Research (70%) | collaborative, teams of 2-3 students

assignment:

mine/scrape, analyze & visualize available structured & unstructured web & social media data related to your team‘s area of research
prepare an executive summary in a form of storyline highlighting the most important findings for decision making
defend your project/research (examination)

motivation:

preparation for conducting a commercial and/or academic research including web & social media data mining & related analyses
an opportunity to try everything out ‘under supervision‘ & get feedback on your work
practicing teamwork skills, organizing & division of labour within a larger work group / institution

(30%) executive summary, clarity & coherence of the data story and meeting all requirements on analyses used (see below)

(40%) appropriateness & correctness of mining procedures & analyses used and of your data interpretation, consideration of limitations of your outcomes (critical context)

(30%) answers to questions regarding procedures, analyses & other ‘technical‘ details of your project/research

Disscussed within a project defence & included in a project executive summary:

the story of your data (for decision making within your specialization): visualizations, descriptions, theoretical background, interpretations & highlights
social network analysis
web scraping
social media mining
text mining & natural language processing
critical review of the project & limitations of the generalizability of your research
analytical appendix (with a hyperlink to source tables & datasets)
‘technical‘ appendix (computations, programming code, request, queries etc.)

Literature

Last update: Mgr. Jakub Růžička (20.09.2014)

you are not required to read any of the following, but you might find it handy when looking for inspiration, reference, sample analyses, sample code or when some part of the course takes your interest so that you want to follow up with more in-depth self-directed study
further online/paperback study resources, tutorials, libraries, applications & tools will be introduced within specific topics of the course

GOLBECK, Jennifer. ANALYZING THE SOCIAL WEB. Amsterdam: Morgan Kaufmann, 2013. ISBN 01-240-5531-1.

O'NEIL, Cathy and SCHUTT, Rachel. DOING DATA SCIENCE. Sebastopol, CA: O'Reilly, 2013. ISBN 14-493-5865-9.

MCKINNEY, Wes. PYTHON FOR DATA ANALYSIS: DATA WRANGLING WITH PANDAS, NUMPY, AND IPYTHON. Beijing: O'Reilly Media. ISBN 978-1449319793.

RUSSELL, Matthew A. MINING THE SOCIAL WEB: DATA MINING FACEBOOK, TWITTER, LINKEDIN, GOOGLE , GITHUB, AND MORE. 2nd ed. Sebastopol: O´Reilly, 2014. ISBN 978-1-449-36761-9.

JANERT, Philipp K. DATA ANALYSIS WITH OPEN SOURCE TOOLS. Sebastopol, CA: O'Reilly. ISBN 05-968-0235-8.

WASSERMAN, Stanley and Katherine FAUST. SOCIAL NETWORK ANALYSIS: METHODS AND APPLICATIONS. New York: Cambridge University Press, 1994. ISBN 05-213-8707-8.

HANSEN, Derek, Ben SCHNEIDERMAN and Marc SMITH. ANALYZING SOCIAL MEDIA NETWORKS WITH NODEXL: INSIGHTS FROM A CONNECTED WORLD. Burlington, MA: Morgan Kaufmann, 2011. ISBN 01-238-2229-7.

STEELE, Julie and Noah ILIINSKY. BEAUTIFUL VISUALIZATION. Sebastopol, CA: O'Reilly, 2010. ISBN 14-493-7986-9.

FRY, Ben. VISUALIZING DATA. Sebastopol, CA: O´Reilly, 2007. ISBN 05-965-1455-7.

RAJARAMAN, Anand and Jeffrey ULLMAN. MINING OF MASSIVE DATASETS. Cambridge: Cambridge University Press, 2012. ISBN 11-070-1535-9.

NORTH, Matthew. DATA MINING FOR THE MASSES. Global Text Project, 2012. ISBN 06-156-8437-8.

PROVOST, Foster. DATA SCIENCE FOR BUSINESS: WHAT YOU NEED TO KNOW ABOUT DATA MINING AND DATA-ANALYTIC THINKING. Sebastopol, CA: O´Reilly. ISBN 978-1-449-36132-7.

MINELLI, Michael, Michael CHAMBERS and DHIRAJ, Ambiga. BIG DATA BIG ANALYTICS: EMERGING BUSINESS INTELLIGENCE AND ANALYTIC TRENDS FOR TODAY'S BUSINESSES. Wiley, 2013. ISBN 111814760X

STATSOFT. ELECTRONIC STATISTICS TEXTBOOK [online]. 2013. https://www.statsoft.com/textbook

https://www.python.org/doc/

http://www.w3schools.com/

https://github.com/

http://stackexchange.com/sites#

https://developers.facebook.com/docs/

https://dev.twitter.com/docs

https://developer.linkedin.com/apis

http://instagram.com/developer/

https://developers.google.com/+/

https://developers.pinterest.com/

https://developer.foursquare.com/

http://flowingdata.com/

http://www.informationisbeautiful.net/

Teaching methods

Last update: Mgr. Jakub Růžička (20.09.2014)

the course consists of:

lectures
tutorials/seminars
guest lectures (possibly webinars)
student webinars

background, how-to, support & inspiration during lectures & tutorials/seminars and/or online course materials for self-directed students

storytelling | the course topics will be tied togehter via obtaining real-time (& real-life) data for decision making of a fictional political party

teams of 2-3 students will be formed as a response to a need of studying more specific area of the political campaign | teams will be differentiated based on a specific topic/area of interest rather than types of analyses

collaboration | teamwork & knowledge sharing will be strongly encouraged & facilitated
| collaboration has its downsides as well but since there are too many ‘individual work‘ courses & too few ‘team work‘ courses, let‘s try work together for a change

workload | 150 hours:

lectures 16h
tutorials/seminars 16h
assignments

team project 70h
webinar 28h

self-study 20h

Syllabus

Last update: Mgr. Jakub Růžička (25.09.2017)

lectures are followed by tutorials in order to put knowledge into practice | the exact dates & content of the lectures may be subject to change based on pace & requirements of the course group

Session #2:
Graph Theory | Social Network Analysis | Statistical Procedures, Apps & Tools
Pseudocoding | Introduction to Programming in Python (& R language comparison) | Data Exploration & Preprocessing
Web Scraping | Data Cleaning & Processing | Python Implementation & Libraries, Statistical Procedures, Apps & Tools

Session #3:
Social Media Mining | Data Cleaning & Processing | Python Implementation & Libraries, Statistical Procedures, Apps & Tools
Text Mining | Natural Language Processing | Python Implementation & Libraries, Statistical Procedures, Apps & Tools
Data Visualization | Data Storytelling | Electronic Publishing | Python Implementation & Libraries, Statistical Procedures, Apps & Tools

Session #4:
Student Webinars | Introducing Various Free & Open Source Data Mining Software & Apps
Machine Learning, Recommender Systems & Other More Advanced Topics | Large-Scale DataSets | MapReduce, Hadoop, NoSQL
Course Review | Semestral Projects Consultation & Adjustments | The Remaining 99% of Data Science

Entry requirements

Last update: Mgr. Jakub Růžička (20.09.2014)

beginner (quite =)) friendly:

although the course might be challenging for students with no analytical and/or computing background (introductory-level courses and/or professional experience), most of the time, you won‘t be required to create/write your own computer code ‘from scratch‘ (that would require another course) but you‘ll be provided with a working code (explained in a pseudocode) that you‘ll customize
user-level knowledge of social media is assumed

NOTE: Several software packages requiring installation & personalization will be used within the course. BYOD (Bring Your Own Device) is therefore recommended.