The development of network technology increases the amount of information on the Internet, how to obtain the information from these massive useful and effective automation technology resources become a problem to be resolved. In this paper we firstly introduce the working principle of HTMLParser and java-related regular expression knowledge, then a general unstructured documents for metadata automatically extracted model is set up, the model is mainly for semantic features, combining with page layouts, text messaging, logic an...