We will be using the ACL (Association for Computational Linguistics) anthology, composed by Mark Joseph & Drago Radev: http://tangra.si.umich.edu/clair/anthology/.

We will just use two networks derived from this data set. The files are the weighted co-authorship network CoAuthorshipNetwork.net (how many papers two people co-authored) and the weighted citation network (how many papers of co-author A cite papers of co-author B) AuthorCitationNetwork.net. In each of these networks, an author is only included if they have at least 10 papers in the ACL dataset.

Your tasks are the following:

- Load the two networks. They should both have the same number of vertices: 1559.
Compute the density of both:
`Info > Network > General`. Which one has the higher*density*? Why could this be the case? - Compute the clustering coefficient of both.
`Net > Vector > Clustering Coefficients > CC1`. And then`Info > Vector`. Interpret the difference. Do this on an undirected version of the citation network`Net > Transform > Arcs->Edges`, but for the rest of the assignment use the directed version. - In the co-authorship network, compute the
*degree*,*closeness*, and*betweenness*of each author. Following is a (rather complicated) way to sort the vertices by their centralities:- Apply the centrality measure so that you have a vector of values for each vertex
- With that vector selected in the drop-down menu, select
`Vector > Make Permuation` - With that permuation selected in the permutation drop-down menu, select
`Operations > Reoder > Network`. This will create a new network - Re-calculate the centrality for the ordered network. Click on the
`edit`button next to the new centrality vector. Now the vertices are ordered from least to most central, so scroll to the bottom to get the top 5 (include the list).

- In the citation network, compute the
*indegree*and*proximity prestige*of each author.- for proximity prestige, you are getting the input domain of the vertex (everyone who cites that person directly or indirectly),
and dividing by the average distance to those vertices. You will use
`Net > Partitions > Domain > Input` - this will produce two things: a partition with the size of the input domain of each vertex, and a vector of average distances to vertices in the input domain
- create a vector from the input domain size partition
`Partition > Make Vector` - then select the second drop down menu for the vector to be the average distance
- select
`Vectors > Divide First by Second`. This will be the input prestige of each vertex

- for proximity prestige, you are getting the input domain of the vertex (everyone who cites that person directly or indirectly),
and dividing by the average distance to those vertices. You will use
- Look for the highest correlation in a centrality measure for the co-authorship network and prestige
(indegree or proximity prestige) for the citation network. Please give all pairwise correlations.
Which two measures are the most correlated? Interpret.
(Caution! Make sure that you are using the centrality/proximity measures with the original vertex ordering, and then find the correlations.)
- Select a centrality measure as the first vector in the vector drop down menu
- There is a second drop-down menu right below it, select a prestige measure
- Select
`Vectors > Info`. This will give you the*Pearson correlation coefficient* - Make sure that the measures were applied to the original ordering of the vertices, so that you are correlating values for the same vertex

- [1 bonus] Finally, load the file CitationNetWoCoauthors.net. This is the citation network with citations between co-authors removed (the reason being that an author may be citing their own paper and in the process citing their co-authors). We're trying to get a more "unbiased" prestige measure where we don't take direct citations by co-authors into account. Compare the density of this network with the complete author citation network. What percentage of the citation edges was from co-authors?

You may wish to refer to the Power-laws “Scale free” networks and the Generating and Fitting Power Law Distributions in Matlab to figure out how to complete the tasks.

Generate 100,000 random integers from a power law distribution with exponent alpha = 2.1

- What is the largest value in your sample? Is it possible for a node in a network to have a degree this high (assuming you don't allow multiple edges between two nodes)?
- Construct a histogram of the frequency of occurrence of each integer in your sample.
Pajek will let you calculate the degree of each individual node (
`Net > Partitions > Degree > All`). Then, export the partition as a '.clu' file by clicking on the save icon to the left of the partitions drop-down select menu. Now, you can import it into Excel or another program and histogram it. Try both a linear scale plot and a log-log scale plot. - What happens to the bins with zero count in the log-log plot?
- Try a simple linear regression on the log transformation of both variables.
In Matlab, you can plot two data sets together as follows:
`plot(x1,y1,'r-',x2,y2,'b:')`. This will plot y1 vs. x1 as a red solid line, and y2 vs. x2 as a blue dotted line. (If you are using the fitlineonloglog.m Matlab script, you will feed it the binned data, and it will take the log of the x and y for you before doing a linear fit). What is your value of the power-law exponent alpha? Include a plot of the data with the fit superimposed. - Now exponentially bin the data and fit with a line. What is your value of alpha?
- Finally, do a cumulative frequency plot of the original data sample. Fit, plot, and report on the fitted exponent and the corresponding value of alpha.
- [1 bonus] Which method was the most accurate? Which one, in your opinion, gave the best view of the data and the fit?

Submission of your homework is via WebCT. You must submit all the required files in a single document containing all the answers.

Acknowledgement: The assignment is modified from Lada Adamic.