/news

Syllabus and news for COSC445/545 Fundamentals of Digital Archeology

GNU General Public License v3.0GPL-3.0

Advertisements

  • I am looking for motivated undergraduates to work on exciting data science research involving machine learning text and image analysis starting now/Spring'19
  • I am looking for students interesed in pursuing PhD starting Fall 2019
  • If you want to learn advanced data analysis/text analysis/image analysis you may consider graduate course Evidence engineering
  • If you are interested in visualization, there is COSC494/557 offered in spring

Projects are due Dec 12

Class on Dec 3

    1. NFL data Presentation
    1. Twitter Analysis Presentation
    1. Social Media Presentation

Class on Nov 30

    1. JS Frameworks Presentation
    1. VideoFile Uniqueness Presentation

Class on Nov 28

    1. Flight Cost Presentation
    1. NPM Vulnerabilities presentation

Class on Nov 26

    1. LinkedIn presentation
    1. Forensic Imaging presentation

Class on Nov 19, 21

  • Wrap-up

Class on Nov 16

  • MiniProject3 due; The notebook/r-studio has to have
  • a hypothesis "I think the priority is going to be affected by ..."
  • explanation how needed measures are calculated from the provided data
  • descriptive analysis of the proposed measures
  • transformation and cleaning statement
  • correlation analysis and a statement about whhether or not some of the measures are too correlated and need to be dropped
  • fitting of the statistical model
  • interpretation of coefficients
  • Please use lectures/fdacStats.ipynb and lecture slides for guidance

Class on Nov 7, 9, 12, 14

  • will meet with each team to hear progress reports on final project
    • Twitter and SocialMediaVideo on 7
    • VideoFileUniqueness and NPM Vulnerabilities on 9
    • NFL + LinkedIn on 12
    • Frameworks + FlightCost on 14
    • Forensic on 16

Class on Nov 5

  • Lecture on text analysis

Class on Nov 02

  • MiniProject2 phase2 due As of 1:30PM Nov 02

  • I don't see forks for: nschwerz

  • I dont see any activity in the forks:

    • CipherR9
    • mander59

Class on Oct 31

  • work on final project/Miniproject

Class on Oct 29

  • MiniProject3 introduced
  • Please schedule final projecty presentations

Class on Oct 26

  • Guest lecture: Doina Caragea Use of Twitter for Disaster Management

Class on Oct 24

  • Finished data analysis presentation

Class on Oct 22

  • Data analysis lecture

Class on Oct 19

  • work on final+mini-projects
  • clarifications as needed

Class on Oct 17

  • MiniProject2 part 1: final final due
  • MiniProject2 part 2: clarified

Class on Oct 15

  • MiniProject2 part 1: discovery due
  • MiniProject2 part 2: retrieval introduced

Class on Oct 8,10,12

  • Sprint 2 of the final project due
  • Working on final project
  • Working on MiniProject2 part 1: discovery

Class on Oct 5

  • Working on final project
  • Working on MiniProject2 part 1: discovery
  • Clarifications on MiniProject2 part 1: discovery

Class on Oct 3

  • Clarifications on MiniProject2 part 1: discovery

Class on Oct 1

  • GC assignment due: any remaining questions
  • Presentations of the remaining proposals
  • Introducing MiniProject2 part 1: discovery

Class on Sep 28

  • Final project proposals are due The group needs to submit a project proposal (1.5-2 pages in IEEE format (see https://www.overleaf.com/latex/templates/preparation-of-papers-for-ieee-sponsored-conferences-and-symposia/zfnqfzzzxghk). The proposal should provide a brief motivation for the project, detailed discussion of the data that will be obtained or used in the project, responsibilities of each member, along with a time-line of milestones, and the expected outcome.
  • Presentations of the proposals (if any are ready)
  • Finalize the proposal
  • Ensure milestone for the next sprint is set and issues (tasks) assigned for everyone on the team
  • Ensure gcloud connection is fully functioanal

Class on Sep 26

  • Clarifications on the use of GC

Class on Sept 24

  • Introducing cloud infrastructure
    • assignment 1: set up use of the the GC.
  • Data storage

Class on Sep 21

  • Work on the proposals for the final project

Class on Sep 19

  • Group representatives for MiniProject1
    • Group 6
    • Group 4
  • Work on the proposals for the final project

Class on Sep 17

  • Group representatives for MiniProject1

    • Group 5
    • Group 1
    • Group 3
    • Group 7
    • Group 6
  • Start working on the proposals for the final project

Class on Sep 14

  • Teams finalized or the course project
  • *** Moved from Wed *** Group presentations of the results of MiniProject1

Class on Sep 12

Class on Sep 10

  • Final pitches for course projects
  • Complete MiniProject1 and present to assigned peers

Class on Sep 07

  • Pitches for course projects

    • Determination of Video File Uniqueness
    • Scraping LinkedIn
    • Explaining Changes in Downloads
    • K-topic: Selecting optimal K
  • Clarifications for MiniProject1 - Teaming analysis

Class on Sep 05

  • Pitches for course projects

    • Text analysis of twitter data associated with disasters
    • Analysis of Social Media Videos
    • Automatic Labeling of Forensic Data
  • MiniProject1

Class on Aug 31

Class on Aug 29

Class on Aug 27

  • As of 9PM Aug 27: Practice0 is open for cloning/completion
  • Finish the lecture on the background for the class
  • Make sure ssh/putty setup works
  • Full details

Class on Aug 24

  • Make sure you accept your github invitations
  • Follow through ssh/putty setup - Full details

Class on Aug 22

  • Create your github account
    • Go through the fork students create your utid.md file providing your name and interests: see Audris.md for inspiration, and also provide your utid.key with your public ssh key.
  • Make sure you do it during the class so we can start ready Aug 24

Information for remote participation via Zoom

Syllabus for "Fundamentals of Digital Archeology"

  • Course: [COSCS-445/COSCS-545]
  • ** MK-524 2:30-3:20 MWF**
  • Instructor: Audris Mockus, audris@utk.edu office hours MK613 - on request
  • TA: Sadika Amreen samreen@vols.utk.edu office hours TBD
  • Need help?

Simple rules:

  1. There are no stupid questions. However, it may be worth going over the following steps:
  2. Think of what the right answer may be.
  3. Search online: stack overflow, etc.
  4. Look through issues
  5. Post the question as an issue.
  6. Ask instructor: email for 1-on-1 help, or to set up a time to meet

Objectives

The course will combine theoretical underpinning of big data with intense practice. In particular, approaches to ethical concerns, reproducibility of the results, absence of context, missing data, and incorrect data will be both discussed and practiced by writing programs to discover the data in the cloud, to retrieve it by scraping the deep web, and by structuring, storing, and sampling it in a way suitable for subsequent decision making. At the end of the course students will be able to discover, collect, and clean digital traces, to use such traces to construct meaningful measures, and to create tools that help with decision making.

Expected Outcomes

Upon completion, students will be able to discover, gather, and analyze digital traces, will learn how to avoid mistakes common in the analysis of low-quality data, and will have produced a working analytics application.

In particular, in addition to practicing critical thinking, students will acquire the following skills:

  • Use Python and other tools to discover, retrieve, and process data.

  • Use data management techniques to store data locally and in the cloud.

  • Use data analysis methods to explore data and to make predictions.

Course Description

A great volume of complex data is generated as a result of human activities, including both work and play. To exploit that data for decision making it is necessary to create software that discovers, collects, and integrates the data.

Digital archeology relies on traces that are left over in the course of ordinary activities, for example the logs generated by sensors in mobile phones, the commits in version control systems, or the email sent and the documents edited by a knowledge worker. Understanding such traces is complicated in contrast to data collected using traditional measurement approaches.

Traditional approaches rely on a highly controlled and well-designed measurement system. In meteorology, for example, the temperature is taken in specially designed and carefully selected locations to avoid direct sunlight and to be at a fixed distance from the ground. Such measurement can then be trusted to represent these controlled conditions and the analysis of such data is, consequently, fairly straightforward.

The measurements from geolocation or other sensors in mobile phones are affected by numerous (yet not recorded) factors: was the phone kept in the pocket, was it indoors or outside? The devices are not calibrated or may not work properly, so the corresponding measurements would be inaccurate. Locations (without mobile phones) may not have any measurement, yet may be of the greatest interest. This lack of context and inaccurate or missing data necessitates fundamentally new approaches that rely on patterns of behavior to correct the data, to fill in missing observations, and to elucidate unrecorded context factors. These steps are needed to obtain meaningful results from a subsequent analysis.

The course will cover basic principles and effective practices to increase the integrity of the results obtained from voluminous but highly unreliable sources.

  • Ethics: legal aspects, privacy, confidentiality, governance

  • Reproducibility: version control, ipython notebook

  • Fundamentals of big data analysis: extreme distributions, transformations, quantiles, sampling strategies, and logistic regression

  • The nature of digital traces: lack of context, missing values, and incorrect data

Prerequisites

Students are expected to have basic programming skills, in particular, be able to use regular expressions, programming concepts such as variables, functions, loops, and data structures like lists and dictionaries (for example, COSC 365)

Being familiar with version control systems (e.g., COSC 340), Python (e.g., COSC 370), and introductory level probability (e.g., ECE 313) and statistics, such as, random variables, distributions and regression would be beneficial but is not expected. Everyone is expected, however, to be willing and highly motivated to catch up in the areas where they have gaps in the relevant skills.

All the assignments and projects for this class will use github and Python. Knowledge of Python is not a prerequisite for this course, provided you are comfortable learning on your own as needed. While we have strived to make the programming component of this course straightforward, we will not devote much time to teaching programming, Python syntax, or any of the libraries and APIs. You should feel comfortable with:

  1. How to look up Python syntax on Google and StackOverflow.
  2. Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
  3. How to learn new libraries by reading documentation and reusing examples
  4. Asking questions on StackOverflow or as a GitHub issue.

Requirements

These apply to real life, as well.

  • Must apply "good programming style" learned in class
    • Optimize for readability
  • Bonus points for:
    • Creativity (as long as requirements are fulfilled)

Teaming Tips

  • Agree on an editor and environment that you're comfortable with
  • The person who's less experienced/comfortable should have more keyboard time
  • Switch who's "driving" regularly
  • Make sure to save the code and send it to others on the team

Evaluation

  • Class Participation – 15%: students are expected to read all material covered in a week and come to class prepared to take part in the classroom discussions. Responding to other student questions (issues) counts as classroom participation.

  • Assignments - 40%: Each assignment will involve writing (or modifying a template of) a small Python program.

  • Project - 45%: one original project done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course that students find particularly compelling. The group needs to submit a project proposal (2 pages IEEE format) approximately 1.5 months before the end of term. The proposal should provide a brief motivation of the project, detailed discussion of the data that will be obtained or used in the project, along with a time-line of milestones, and expected outcome.

Other considerations

As a programmer you will never write anything from scratch, but will reuse code, frameworks, or ideas. You are encouraged to learn from the work of your peers. However, if you don't try to do it yourself, you will not learn. Deliberate practice (activities designed for the sole purpose of effectively improving specific aspects of an individual's performance) is the only way to reach perfection.

Please respect the terms of use and/or license of any code you find, and if you re-implement or duplicate an algorithm or code from elsewhere, credit the original source with an inline comment.

Resources

Materials

This class assumes you are confident with this material, but in case you need a brush-up...

Other

Databases
  • A MongoDB Schema Analyzer. One JavaScript file that you run with the mongo shell command on a database collection and it attempts to come up with a generalized schema of the datastore. It was also written about on the official MongoDB blog.
R and data analysis
  • Modern Applied Statistics with S (4th Edition) by William N. Venables, Brian D. Ripley. ISBN0387954570
  • R
  • Code School
  • Quick-R
Tutorials written as ipython-notebooks

GitHub