network_anony: A C++ repository from ivanphilein

<html>
<body>

<body link="#0000ff" vlink="#0000ff" alink="#0000ff">
<h1>Network anonymization README</h1>

<h2>Index</h2>
<ul>
  <li><a href="#folder structure">Folder structure</a></li>
  <li><a href="#result data">Result data</a></li>
  <li><a href="#run step">Running Steps</a></li>
  <li><a href="#file format">File format</a></li>
  <li><a href="#problems">Problems During Implement</a></li>
  <li><a href="#problems to fix">Problems still need to fix</a></li>
  <li><a href="#future work">Future work</a></li>
</ul>

<hr>

<h2><a name="folder structure"></a>Folder structure:</h2> 
<ul>
  <li><font color="blue">preprocess</font></li>
	<ul>
		<li>FakeLabel.java,which can generate fake labels for an unlabeled graph.<br>
	Input: 1. graph 2. # of labels in the each edge 3. # of leaf number(leaf number can not be smaller than label number)<br>
	Output: a graph with labels file name is like: s2_dpinputFile_o_XLab.txt as this file will be used for input of DP and other degree anonymization.</li>
		<li>HTree.java	whcih is used to build Hierarchy Tree</li>
		<li>LiveJournalData.java which is used to get the live journal data</li>
		<li>Running steps: 
		<pre>ant</pre> (to generate the livk.data)
		<pre>./run.sh</pre> (can do the get the live journal data and also add the fake labels)
		<pre>./runDBLP.sh</pre> (can only add the fake labels for DBLP data as DBLP graph</li>	
	</ul>
  <li><font color="blue">heuristicAlgorithm</font></li>
This folder is the main folder of Group-heuristic method
	<ul>
		<li>include folder: 
			<ul>
			<li>heuristic.h</li>
			<li>mapping.h</li>
			<li>sortMap.h: special head file has the information of element used to sort based on node degree(node id, node degree, checked), used to find the larger degree nodes.</li>
			</ul>		
		</li>

		<li>
			<ul>
			<li>heuristic.cpp: k_anonymity function, based on the graph and get new graph with node degree anonomitied.</li>
			<li>mapping.cpp: Read graph information and store the node and edge map, like node to edge, edgeid to edge, etc.</li>
			</ul>
		</li>
		<li>Running steps:
			<pre>./make.sh</pre>
			<pre>./heu2 graph_file K Method</pre>
	Here, the new code just have one method working now, need to change later.
			<br>
	May check more examples from .sh files		
		</li>
	</ul>
  <li><font color="red">arxiv</font></li>
This folder has the data files of arxiv dataset.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="blue">DBLP</font></li>
This folder has the data files of DBLP dataset, the files under DBLP are DBLP input for edge anonymization, output folder is the output of edge anonymization.
  <li><font color="red">debug</font></li>
This folder is the debug result of DP function.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="red">enron</font></li>
This folder has the data files of enron dataset.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="blue">figure</font></li>
This folder has all the figures(PDF file) and R code to generate the figures. See more details at <a href="#result data">Result data</a>.
  <li><font color="red">livj</font></li>
This folder has the data files of livj dataset.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="red">logresult</font></li>
This folder has the log files during running program.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="red">paper</font></li>
This folder has the draft of paper.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="red">rcoderef</font></li>
This folder is used to learn how to draw figures with R, can be delete now.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="red">result</font></li>
The result files(Excel), used to draw figures in the EDBT paper.<font color="red">TODO: more details need to be added later.</font>
  <li><font color="red">syndata</font></li>
This folder has the data files of syndata dataset.<font color="red">TODO: more details need to be added later.</font>
  <li>For other folders and edge anonymization, <font
			color="red">TODO: merge the readme.txt file. I
			will merge that later.</font></li>
  <li>Complete folder list: put all the foders there, if you did not
			check it this time, just put a TODO note</li>
</ul>
<hr>
<h2><a name="result data"></a>Result data</h2>
<style type="text/css">
td{
	width:20px;
}
tr{
	height: 40px;
}

tr.datacellone {
	background-color: #CC9999; color: black;
}
tr.datacelltwo {
	background-color: #9999CC; color: black;
	height: 40px;
}
</style>
<table border="1">
<tr color="green">
<th>Figure in the final paper</th>
<th>File name of data</th>
<th>Row numbers of data</th>
<th>Column numbers of data</th>
<th>R file</th>
<th>Corresponding PDF file</th>
</tr>
<tr class="datacelltwo">
  <td>Figure 7 Degree anonymization: Enron dataset (a)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>4-9</td>
  <td>AB,AC,AD,AE</td>
  <td>enron_degreeanony_edge_log.r</td>
  <td>enron_degreeanony_edge_log.pdf</td>
</tr>
<tr>
  <td>Figure 7 Degree anonymization: Enron dataset (b)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>4-9</td>
  <td>AF,AG,AH</td>
  <td>enron_degreeanony_time_log.r</td>
  <td>enron_degreeanony_time_log.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 8 Degree anonymization: arXiv dataset (a)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>116-122</td>
  <td>AB,AC,AD,AE</td>
  <td>arxiv_degreeanony_edge.r</td>
  <td>arxiv_degreeanony_edge.pdf</td>
</tr>
<tr>
  <td>Figure 8 Degree anonymization: arXiv dataset (b)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>116-122</td>
  <td>AF,AG,AH</td>
  <td>arxiv_degreeanony_time.r</td>
  <td>arxiv_degreeanony_time.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 9 Degree anonymization: DBLP dataset (a)</td>
  <td>result/DBLP_ALL_RESULTS.xlsx</td>
  <td>1-16</td>
  <td>A,B,C</td>
  <td>DBLP_degreeanony_edge.r</td>
  <td>DBLP_degreeanony_edge.pdf</td>
</tr>
<tr>
  <td>Figure 9 Degree anonymization: DBLP dataset (b)</td>
  <td>result/DBLP_ALL_RESULTS.xlsx</td>
  <td>1-16</td>
  <td>A,B,C</td>
  <td>DBLP_degreeanony_time.r</td>
  <td>DBLP_degreeanony_time.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 10 Degree anonymization: Synthetic dataset (a)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>11-18</td>
  <td>AB-AE(we do not show the DP here)</td>
  <td>syn_degreeanony_edge.r</td>
  <td>syn_degreeanony_edge.pdf</td>
</tr>
<tr>
  <td>Figure 10 Degree anonymization: Synthetic dataset (b)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>11-18</td>
  <td>AG-AH(we do not show the DP here)</td>
  <td>syn_degreeanony_time.r</td>
  <td>syn_degreeanony_time.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 11 Degree anonymization: Group-heuristic,Synthetic dataset, fix n = 5K (a)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>49-70</td>
  <td>AB,AD</td>
  <td>syn_degreeanony_varyd_edge.r</td>
  <td>syn_degreeanony_varyd_edge.pdf</td>
</tr>
<tr>
  <td>Figure 11 Degree anonymization: Group-heuristic,Synthetic dataset, fix n = 5K (b)</td>
  <td>result/degreeanony_noZero_2011_1212_figure.xlsx</td>
  <td>49-70</td>
  <td>AB,AG</td>
  <td>syn_degreeanony_varyd_time.r</td>
  <td>syn_degreeanony_varyd_time.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 12 Edge label anonymization:
Enron dataset, compare different pruning strategies: EL denotes the number of edge labels on each edge, fix n = 5K (a)</td>
  <td>result/new_heuristic_2011_1107_figure.xlsx</td>
  <td>19-22</td>
  <td>V,W,X,Y,Z,AA</td>
  <td>prune_enron_tnode.r</td>
  <td>prune_enron_tnode.pdf</td>
</tr>
<tr>
  <td>Figure 12 Edge label anonymization:
Enron dataset, compare different pruning strategies: EL denotes the number of edge labels on each edge, fix n = 5K (b)</td>
  <td>result/new_heuristic_2011_1107_figure.xlsx</td>
  <td>13-16</td>
  <td>V,W,X,Y,Z,AA</td>
  <td>prune_enron_time.r</td>
  <td>prune_enron_time.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 12 Edge label anonymization:
Enron dataset, compare different pruning strategies: EL denotes the number of edge labels on each edge, fix n = 5K (c)</td>
  <td>result/new_heuristic_2011_1107_figure.xlsx</td>
  <td>4-16</td>
  <td>AA,AB,AC</td>
  <td>prune_enron_tnode_vs_time.r</td>
  <td>prune_enron_tnode_vs_time.pdf</td>
</tr>
<tr>
  <td>Figure 13 Edge label anonymization: Synthetic dataset (EL2)</td>
  <td>result/new_heuristic_2011_1107_figure.xlsx</td>
  <td>76-82</td>
  <td>L,M,N,O</td>
  <td>syn_edgetime_varyn_log.r</td>
  <td>syn_edgetime_varyn_log.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 14 Edge label anonymization: DBLP dataset (EL2)</td>
  <td>result/DBLP_ALL_RESULTS.xlsx</td>
  <td>18-24</td>
  <td>A,B</td>
  <td>DBLP_edgetime_varyK.r</td>
  <td>DBLP_edgetime_varyK.pdf</td>
</tr>
<tr>
  <td>Figure 15 Edge label anonymization: arXiv dataset</td>
  <td>result/new_heuristic_2011_1107_figure.xlsx</td>
  <td>159-164</td>
  <td>M,N,O</td>
  <td>arxiv_edgetime.r</td>
  <td>arxiv_edgetime.pdf</td>
</tr>
<tr class="datacelltwo">
  <td>Figure 16 EInformation loss: Enron dataset</td>
  <td>result/new_heuristic_2011_1107_figure.xlsx</td>
  <td>7-11</td>
  <td>J,V</td>
  <td>enron_infoloss.r</td>
  <td>enron_infoloss.pdf</td>
</tr>
</table>
<hr>
<h2><a name="run step"></a>Run Steps</h2>

<ul>
  <li>enron data
    <ul>
      <li>Detailed steps will come later, <font color="red">TODO</font></li>
    </ul>
  </li>
  <li>arXiv data
    <ul>
      <li>Detailed steps will come later, <font color="red">TODO</font></li>
    </ul>
  </li>
  <li>synthetic data
    <ul>
      <li>Detailed steps will come later, <font color="red">TODO</font></li>
    </ul>
  </li>
  <li>livejournal data
    <ul>
      <li>Detailed steps will come later, <font color="red">TODO</font></li>
    </ul>
  </li>
  <li>DBLP data
    <ul>
    <li>Data are generated from keywordsearch/src/networkanony/generateEdge.java 
	<pre>java -cp keywordsearch.jar networkanony.generateEdge</pre>
     </li>
     <li>Add fake labels from networdanony/preprocess/generateFakeLabel/src/FakeLabel.java
	<pre>java -cp livej.jar FakeLabel "graph file" K leaf_num</pre>
     </li>
     <li>Move output file (data/DBLP/s2_dpinputFile_o_2lab.txt) to networdanony/DBLP
	<pre>g++ src/heuristic.cpp src/mapping.cpp -o heu2</pre>
	<pre>./heu2 graphfile K method</pre>
    </li>
    <li>Move output file (data/DBLP/s3_superGraph_o_K2_2lab.txt) to networdanony/DBLP
	<pre>make</pre>
	<pre>./optimal K Lable_num Folder(which has s3_superGraph_o_K2_2lab.txt) prune123+loose(prune method) >>logDBLP.txt(log file)</pre>
    </li>
    </ul>
</ul>

<hr>
<h2><a name="file format"></a>File format</h2>
<ul>
  <li><font color="purple">s1_DomainHierarchyGraph2lab.txt</font></br>
  This is a file has the information of enumeration hierarchy tree. </br>
  Each line has two numbers: tree_node_parent (int), tree_node_child (int)
  <li><font color="purple">s2_dpInputFile_o_<i>i</i>lab.txt</font></br>
  Each line has 2+i numbers: <br/>
  graph_node_from (int), graph_node_to(int), the_first_label (int),the_second_label(int), ..., the_<i>i</i>th_label<br/>
  E.g., Given a file "s2_dpInputFile_o_2lab.txt", each line has 4
  numbers. These numbers are:  <br/>
  graph_node_from (int), graph_node_to(int), the_first_label (int),  the_second_label(int)</li>
  <li><font color="purple">DBLPNoLabel<i>i</i>Lab.txt_k=<k>k<k>_output_M2.txt</font></br>
  This file has similar format of s2_dpInputFile_o_<i>i</i>lab.txt, but this file has three more lines at the beginning:<br/>
  node_num (int): the total number of nodes <br/>
  edge_num (int): the total number of edges <br/>
  label_num (int): the label number
  graph_node_from (int), graph_node_to(int), the_first_label (int),the_second_label(int), ..., the_<i>i</i>th_label<br/>
  E.g., Given a file "s2_dpInputFile_o_2lab.txt", each line has 4
  numbers. These numbers are:  <br/>
  graph_node_from (int), graph_node_to(int), the_first_label (int),  the_second_label(int)
  <li><font color="purple">s3_superGraph_o_K<k>k</k>_<i>i</i>lab.txt</font>
  <li>Each line has 2+i numbers: <br/>
  graph_node_from (int), graph_node_to(int), the_first_label (int),the_second_label(int), ..., the_<i>i</i>th_label<br/>
  E.g., Given a file "s3_superGraph_o_K2_2lab.txt", each line has 4
  numbers. These numbers are:  <br/>
  graph_node_from (int), graph_node_to(int), the_first_label (int),  the_second_label(int)</li>
  The difference from this one and s2 is like this file have edges with
  label"-100,-100" as new added edges).</li>
  <!--<li>The differences between input files for dp-heuristic/dp/group-heuristic</li>-->
</ul>

<hr>
<h2><a name="problems"></a>Problems During Implement</h2>
<ul>
  <li>README.txt still has something unclear, like some information for enron only, should make that clear.</li>
  <li>The output of edge anonymization part, the ouput time should use total time, not the heuristic one time.</li>
</ul>

<hr>
<h2><a name="problems to fix"></a>Problems still need to fix</h2>
<ul>
  <li>Finish Re-Organizing all the folders, put all the data files in one folder, all the codes in one folder with differnet name.</li>
  <li>Re-name the functions, make all the name of runable files reasonable.</li>
  <li>Re-write the command sh files.</li>
  <li>For the Group-Degree-anonymization, original there are two method, but now there is only one, and the other one does not work well, check that(basicly check on differnet OS system).</li>
  <li>Double check the meaning of heuristic one time in the edge anonymization part.</li>
  <li>There is a segmatation fault in the edge anonymization part. The function name is myGraph::writeTo </li>
  <li>When check the figure 12, for label number from 2 to 4, the nodes number of enumeration tree is not clear. </li>
  <li>Change generateFakeLabel and seperate all functions(move live journal part).</li>
  <li>DP result and Group result are different.</li>
</ul>
ivanphilein/network_anony