c#: crawler project here is the Answer:

Hi,
Could I get very easy to follow code examples on the following:
  1. Use browser control to launch request to a target website.
  2. Capture the response from the target website.
  3. convert response into DOM object.
  4. Iterate through DOM object and capture things like "FirstName" "LastName" etc if they are part of response.
thanks

Answer is:


Here is code that uses a WebRequest object to retrieve data and captures the response as a stream.
    public static Stream GetExternalData( string url, string postData, int timeout )
    {
        ServicePointManager.ServerCertificateValidationCallback += delegate( object sender,
                                                                                X509Certificate certificate,
                                                                                X509Chain chain,
                                                                                SslPolicyErrors sslPolicyErrors )
        {
            // if we trust the callee implicitly, return true...otherwise, perform validation logic
            return [bool];
        };

        WebRequest request = null;
        HttpWebResponse response = null;

        try
        {
            request = WebRequest.Create( url );
            request.Timeout = timeout; // force a quick timeout

            if( postData != null )
            {
                request.Method = "POST";
                request.ContentType = "application/x-www-form-urlencoded";
                request.ContentLength = postData.Length;

                using( StreamWriter requestStream = new StreamWriter( request.GetRequestStream(), System.Text.Encoding.ASCII ) )
                {
                    requestStream.Write( postData );
                    requestStream.Close();
                }
            }

            response = (HttpWebResponse)request.GetResponse();
        }
        catch( WebException ex )
        {
            Log.LogException( ex );
        }
        finally
        {
            request = null;
        }

        if( response == null || response.StatusCode != HttpStatusCode.OK )
        {
            if( response != null )
            {
                response.Close();
                response = null;
            }

            return null;
        }

        return response.GetResponseStream();
    }
For managing the response, I have a custom Xhtml parser that I use, but it is thousands of lines of code. There are several publicly available parsers (see Darin's comment).
EDIT: per the OP's question, headers can be added to the request to emulate a user agent. For example:
request = (HttpWebRequest)WebRequest.Create( url );
                request
.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/x-shockwave-flash, */*";
                request
.Timeout = timeout;
                request
.Headers.Add( "Cookie", cookies );

               
//
               
// manifest as a standard user agent
                request
.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)";
 

0 comments: